Migration scrips. We’ve all had to write them. They are the background, throwaway scripts that don’t have any performance requirements. Who cares if the script you planned to run overnight takes an extra hour to finish, especially since it doesn’t directly impact users? And generally all of that is true. Of course, when the situation falls out of the “generally” bucket, it fails spectacularly.

$max_size = 4200479;
for ($i = 0; $i <= $max_size; $i++) {
  // connect to the source database and process row
  $dbh = new DB(...);
  $sth = $dbh->prepare("select * from source_table where id = ?");
  $sth->execute($i);
  
  _process_row($sth->fetch()); 
  
  // connect to the target database and store row
  $dbh_trgt = new DB(...);
  $sth_trgt = $dbh_trgt->prepare("insert into target_table ... ");
  $sth_trgt->execute(...);
}

This is a pretty common base migration script. It iterates through the rows in the old table (4,200,479 of them), modifies the data and structure to fit the new schema and inserts the row into the new table. Now, it has a lot of performance issues (n+1 probably being one of the most visible one). But do we care? This is a one time script that will never be used after the migration is complete. Right?

As it turns out — we do care. The script above would take about 6 seconds per row. Which seems decent enough for a throw away. So, to complete the full migration it would take 25,202,874 seconds. Or a little over 291 days.

Yikes.

So we need to optimize and optimize we shall. Luckily, there are a few simple performance improvement that can be done to reduce the migration runtime from a year to an overnight success.

Minimize number of connects

Connecting and disconnecting to a database is not free. Many people don’t think about it, but those connections (especially over network) costs resources. It is usually a minimal overhead, but overhead nonetheless. If you connect and disconnect with every query, your overhead will grow in arithmetic progression.

What you want to do, is to maintain the connection for the duration of the script.

// connect to source and target databases
$dbh = new DB(...);
$dbh_trgt = new DB(...);

$max_size = 4200479;

for ($i = 0; $i <= $max_size; $i++) {
  // get and process row
  $sth = $dbh->prepare("select * from source_table where id = ?");
  $sth->execute($i);
  
  _process_row($sth->fetch());
  
  // store row
  $sth_trgt = $dbh_trgt->prepare("insert into target_table ... ");
  $sth_trgt->execute(...);
}

Condensing 8,400,958 (4,200,479 x 2 databases) connects and disconnects into 2 makes an immediate impact on overall processing time.

Minimize number of queries

A standard solution to n+1 problem is to consolidate all the queries into one (or close to it). Theoretically, that can be done in this migration as well. Instead of selecting individual rows, we can select the whole dataset and then process returned rows in a loop. Unfortunately, returning millions of rows, storing them in memory and processing them at the same time is not good answer. Fortunately, there is a middle ground.

Instead of returning one row at a time (or all the rows) we can return sets of data in a reasonably sized batches.

// connect to source and target databases
$dbh = new DB(...);
$dbh_trgt = new DB(...);

$batch_start = 0;
$batch_end = 100000;
$max_size = 4200479;

while($batch_start <= $max_size) {
  // get rows in batch
  $sth = $dbh->prepare("select * from source_table 
                         where id is between ? and ?");
  $sth->execute($batch_start, $batch_end);
  
  // increment batch
  $batch_start = $batch_end + 1;
  $batch_end += 100000;
  foreach ($row as $sth->fetchAll()) {
    _process_row($row);
    
    // store row
    $sth_trgt = $dbh_trgt->prepare("insert into target_table ... ");
    $sth_trgt->execute(...);
  }
}

The size of batches can vary based on your environment and resource availability. By combining the selected rows into batches of 100,000 we’ve effectively reduced the number of select queries from 4+ million to 43.

Minimize number of transactions

To continue the theme of minimizing the number of round trips to the database, reduce the number of insert transactions sent to the database. Note, transactions not queries. Wrapping multiple queries in a transaction allows to submit multiple (read: thousands) queries in one go. This can be accomplished by either turning AutoCommit flag off and managing batch commits manually, or using BEGIN block to build own batch transaction.

// connect to source and target databases
$dbh = new DB(...);
$dbh_trgt = new DB(...);

$batch_start = 0;
$batch_end = 100000;
$max_size = 4200479;

while($batch_start <= $max_size) {
  // get rows in batch
  $sth = $dbh->prepare("select * from source_table 
                         where id is between ? and ?");
  $sth->execute($batch_start, $batch_end);
  
  // increment batch
  $batch_start = $batch_end + 1;
  $batch_end += 100000;
  
  // begin transaction
  $dbh->beginTransaction();
  
  foreach ($row as $sth->fetchAll()) {
    _process_row($row);
    
    // store row
    $sth_trgt = $dbh_trgt->prepare("insert into target_table ... ");
    $sth_trgt->execute(...);
  }
  // commit transaction
  $dbh->commit();
}

Adding all inserts for the batch into a single transaction would send all the queries to the database at the same time to execute and commit. So instead of 100,000 commits going to the database, you only send the transaction with all the insert statements and commit once, also effectively reducing a total number of insert transactions for the migration from 4+ million to 43.

Maximize parallelism

If database is the biggest offender in the 6 second processing time, the techniques above would likely give the script a significant enough boost in performance to complete in a reasonable timeframe. However, if the increase in performance on the database side is not enough or the bottleneck is in the _process_row() function — you can fork your script or run multiple instances of the script in parallel, processing multiple batches at the same time.

If transactions and batching are implemented properly, you can fork multiple processes inserting millions of records at the same time in parallel. This, of course, if limited by CPU (don’t fork bomb the machine!), but even with small number of parallel processes you’re significantly reducing time to completion.

// connect to source and target databases
$dbh = new DB(...);
$dbh_trgt = new DB(...);

$batch_start = 0;
$batch_end = 100000;
$max_size = 4200479;

while($batch_start <= $max_size) {
  // Fork a process
  $pid = pcntl_fork();

  // if a child - process a batch
  if (!$pid) {
    // get rows in batch
    $sth = $dbh->prepare("select * from source_table 
                         where id is between ? and ?");
    $sth->execute($batch_start, $batch_end);
    
    // increment batch
    $batch_start = $batch_end+1;
    $batch_end += 100000;
    
    // begin transaction
    $dbh->beginTransaction();
  
    foreach ($row as $sth->fetchAll()) {
      _process_row($row);
      
      // store row
      $sth_trgt = $dbh_trgt->prepare("insert into target_table .. ");
      $sth_trgt->execute(...);
    }
    // commit transaction
    $dbh->commit();
    // Die to exit a child process
    die();
  }
}

This final version of the migration script will spawn 43 children to run batches of 100,000 records in parallel, finishing in minutes/hours (depending on the processing time and complexity of queries). You can control both the size of batches and the number of parallel processes that best match the environment you’re running it on.

There are always additional performance improvements that could be done, but for most throwaway scripts these techniques should be enough to bring the migration time to a much more reasonable (read: acceptable) level.

ABOUT THE AUTHOR

Leon Fayer

Leon Fayer (@papa_fire) has two decades of expertise were concentrated on architecting and operating complex, web-based systems to withstand crushing traffic (often unexpectedly). Over the years, he's had a somewhat unique opportunity to design and build systems that run some of the most visited websites in the world. While his core expertise is in application development, he works his way around the whole technology stack from system architecture to databases design and optimization to front/back-end programming. He's considered a professional naysayer by peers and has the opinion that nothing really works until it works for at least a million people.

One Response to “Migrations: getting there faster”

  1. Brian LePore

    I love this advice, but I think the topic is a bit limited. I actually had to do almost this exact same process for something that end users actually experienced. Our dataset was more in the way of many minutes, not days, but still something an end user shouldn’t have to deal with.

    One thing with the final script, I think it has a slight bug. The script should actually run forever and execute the same rows on each query. The solution would be to move the “increment batch” part after the “if a child – process a batch” part is closed. That way each child would run on a different portion, and the parent would eventually die.

    Another piece of advice for those using MySQL and can accept some database-specific code: LOAD DATA INFILE yields significantly better performance than even bulk inserts.