Fastest way to bulk delete in Django
Was faced with an optimisation problem today at work. Our product analyses large datasets every day and as our number of user grew over the last few months, we’ve had to optimise our algorithm that generates and saves recommendations to our database. Poking around a bit, I noticed the step that deleted rows in our recommendations table was taking a long time to process.
Photo by Cara Fuller on Unsplash
Time taken for code to run
Before going into the optimisation step, we need to find out which part needs optimising. Here’s a neat way to use the built-in print
statement to find out how long a code block takes to run that I use a lot. This will have to do until I hone my sense of runtime efficiency just by looking at code…
import time
def delete_recommendations(cutoff_time): """ Delete all old recommendations before cutoff_time. """ start = time.time()
Code block that does the deletion
end = time.time()
print ‘delete_recommendations took {}s’.format(end - start)
=> delete_recommendations took 1.461352134s
That’s the simplest way I know of to identify processes that take a long time to run. I just wrap a start
and end
time around any block I suspect to be slow and print the difference.
Now on to deleting rows in the database using Django.
The fastest: raw SQL
By first principles, nothing beats raw SQL in terms of speed because it operates closest to the database!
To execute raw SQL queries in Django, use the connection
and cursor
APIs like this:
from django.db import connection from app.account.models import Store from app.recommendations.models import Recommendation
store = Store.objects.get(name=‘amazon’) recommendations = Recommendation.objects.filter(store=store) if recommendations.exists(): cursor = connection.cursor()
Raw SQL delete, using params for protection against SQL injection
cursor.execute(“DELETE FROM app_recommendation WHERE store_id = %s”, [store.id])
(In case you’re wondering, this obviously isn’t our company code. Just an illustration!)
This method was indeed the fastest when I tested it, blazing through 20,000 deletions in about 4 seconds on my Macbook Pro. YMMV. (That means Your Mileage May Vary, I learned it from the cool kids on Reddit recently. Think I’ll use it more often now. Handy!)
The fast and maintainable way: _raw_delete()
The second fastest way to bulk delete entries in the database with Django is to use the private method _raw_delete
. I got this idea when my colleague pointed me to this [answer on Stack Overflow]2.
recommendations = Recommendation.objects.filter(store=store) if recommendations.exists():
Raw delete with the convenience of using Django QuerySet
recommendations._raw_delete(recommendations.db)
We ended up using this approach because it maintained just the right level of abstraction for us to work with Django. In the raw SQL approach we wrote actual structured query language, which is verbose and more problematically, it meant writing code that would be harder to maintain.
This approach leverages the Django QuerySet ORM, allowing us to write Django QuerySet queries instead of raw SQL. It executed just slightly slower than the raw SQL approach, taking 6 seconds for 20,000 deletions.
One important limitation of using _raw_delete
is that it bypasses the model layer of Django, so dependent database tables will not be informed (“signalled” is the right term in Django I think). That means if a dependent entry in another table is supposed to be deleted along with this entry, that dependent delete will not be executed.
The normal way: delete()
If you don’t want to bypass the model layer (because you know this deletion affects other tables, for example), the best way is to use Django’s .delete()
QuerySet method.
recommendations = Recommendation.objects.filter(store=store) if recommendations.exists():
Standard Django delete
recommendations.delete()
This was the slowest approach. It took about 18 seconds for 20,000 deletions, about 3-4 times slower than the above two approaches.
There are obviously other considerations when deciding which approach works best in a particular scenario. Here I’ve merely stated the three approaches I’ve tried to optimise our algorithm.