So, you have a ton of records to update in a really large table. Say, you need to update 3 million records in a table with 100 million rows. And, let’s also assume that you have a way of finding these records. Even, with all of this info, updating 3M records in a single transaction is troublesome if your table is being used moderately during this data fix. You have a high probability of running into a deadlock or your query timing out.
There is a way you can do this by updating your data in small batches. The idea is to first find the ids of the records you want to update and then updating a small batch of them in each transaction.
For our example, let us say we have a users
table which has 3M records created
in the year 2019 whose authentication token needs to be reset. Simple enough!
1. Doing this in a single update
Doing this in a single update is the easiest and is possible if you don’t use this table a lot. However, as I said, it is prone to deadlocks and statement timeouts.
1 | UPDATE users |
2. Doing this in multiple batches through a CTE
Doing this through a CTE in multiple batches works, but is not the most efficient.
1 | -- first get all the records that you want to update by using rolling OFFSETs |
That works. However, it is not the most efficient update. Because, for every batch, (in this example a batch of 1000) we perform the filtering and ordering of all the data. So, we end up making the same query 3M/1K or 3000 times. Not the most efficient use of our database resources!
3.1. Doing this in multiple batches using a temporary table
So, to remove the inefficiency from the previous step, we can create a temporary table to store the filtered user ids while we update the records. Also, since this is a temp table, it is discarded automatically once the session finishes.
1 | CREATE TEMP TABLE users_to_be_updated AS |
So, in the above SQL we are creating a temporary table containing a row_id which is a serial number going from 1 to the total number of rows and also adding an index on this because we’ll be using it in our batch update WHERE clause. And, finally doing our batch update by selecting the rows from 0..1000 in the first iteration, 1000..2000 in the second iteration, and so on.
3.2. Tying this up via a ruby script to do the full update.
1 | # sql_generator.rb |
This tiny script generates a sql file which can then be executed via psql to do the whole process in one fell swoop.
1 | # generate the sql file |
Once we have the sql file we run it through psql like so
1 | psql --echo-all --file=user_batch_update.psql "DATABASE_URL" |
That’s all folks, now your updates should be done in batches and shouldn’t cause any deadlocks or statement timeouts.