Choosing an API

The PHP SDK offers only a blocking API — but this is not necessarily a limitation. Using a combination of data batching and process forking or the batching API (not available just yet) we can perform effective bulk operations over data.

Batching operations allows you to make better utilization of your network and speed up your application by increasing network throughput and reducing latency. Batched operations work by pipelining requests over the network. When requests are pipelined, they are sent in one large group to the cluster. The cluster in turn pipelines responses back to the client. When operations are batched, there are fewer IP packets to be sent over the network (since there are fewer individual TCP segments).

Batching with process forks

Bulk loading with multiple PHP processes provides a useful way of achieving the effectiveness of parallel operations. In the following example we will look at loading a set of JSON files and uploading them to Couchbase Server in concurrent batches.

To begin with let’s look at loading the data from one of the Couchbase sample datasets, the beer dataset. This dataset is around 7300 JSON files, each file representing a document. This sample looks for the dataset in the default location for a Linux install, you can find the default locations for other operation systems in our CLI reference.

$concurrency = 4; // number of processes
$sample_name = "beer-sample";
$sample_zipball = "/opt/couchbase/samples/$sample_name.zip";
printf("Using '%s' as input\n", $sample_zipball);
system("rm -rf /tmp/$sample_name");
system("unzip -q -d /tmp $sample_zipball");
$files = glob("/tmp/$sample_name/docs/*.json");
$batches = [];
for ($i = 0; $i < $concurrency; $i++) {
  $batches[$i] = [];
}
printf("Bundle '%s' contains %d files\n", $sample_name, count($files));
for ($i = 0; $i < count($files); $i++) {
  array_push($batches[$i % $concurrency], $files[$i]);
}

Here we’ve unzipped the zip file containing the dataset and then setup the relevant number of batches, where each batch is a set of filenames that we will later read and use the documents from.

In the next snippet we can see that call pcntl_fork to fork the process. After forking the process we check if we’re now running as a child or as the parent process. If we’re running as the child then we run the upload_batch function. The upload_batch function iterates over the filenames, reading the contents of each file and uploading it to Couchbase Server. If we were in the parent process then instead of running the upload_batch function we add the PID of the child process to the $children array. The parent then uses pcntl_waitpid to wait for each child process to complete.

$children = [];
for ($i = 0; $i < $concurrency; $i++) {
  $pid = pcntl_fork();
  if ($pid == -1) {
    die("unable to spawn child process");
  } else if ($pid == 0) {
    printf("Start a process to upload a batch of %d files\n", count($batches[$i]));
    upload_batch($i, $batches[$i]);
    exit(0);
  } else {
    array_push($children, $pid);
  }
}

foreach ($children as $child) {
  pcntl_waitpid($child, $status);
}

use \Couchbase\Cluster;
use \Couchbase\ClusterOptions;
function upload_batch($id, $batch) {
  $options = new ClusterOptions();
  $options->credentials("Administrator", "password");
  $cluster = new Cluster("couchbase://10.112.193.101", $options);
  $collection = $cluster->bucket("default")->defaultCollection();
  foreach ($batch as $path) {
    $collection->upsert($path, json_decode(file_get_contents($path)));
  }
}

In the output we can see something like:

Bundle 'beer-sample' contains 7303 files
Start a process to upload a batch of 1826 files
Start a process to upload a batch of 1826 files
Start a process to upload a batch of 1826 files
Start a process to upload a batch of 1825 files

The application has split the files into 4 batches and then uploaded the batches in parallel.