Wednesday 24 September 2014

Working with Entity Framework Code First and SQL Bulk Copy

There's a few of these that I haven't written, but it seems that you could keep a blog going pretty well with just "Working with Entity Framework and ..." posts. That's not because Entity Framework is particularly bad, I really quite like it myself, but because if you're writing an application of any considerable size or complexity you will at some point reach the limitations of what is possible "out of the box". I'm also aware of the number of "ORMs are evil" posts circulating currently, and possibly you're someone who thinks the same, but for what I'm working on now they make perfect sense, and I'm all for using the right tool for the job.

So what's the problem this time?

General purpose ORMs are great when you're working with transactions, it's what they're meant for.  A user interacts with the system and you need to persist the data from their most recent transaction.  The problem this time is that as the system grows you suddenly find yourself with a couple of use cases where you need to persist a lot of data.  In the example application I've put together for this post I'm using an XML data source with 10,000 records which need to be persisted.

The problem here is that when running with this size data set (with auto-tracking changes disabled) is that it is taking around 40 seconds to run.  10,000 records in 40 seconds is certainly more that I can process manually but for a modern computer it's not so great.  The problem is that as you're adding more records to the context it's getting bloated, it has to start tracking more and more entities, then each entity is persisted individually.  That last point is important because each insert in Entity Framework inserts the new record and then pulls back out the newly created ID code and updates the entity in memory with the new ID code, which is not a trivial amount of work.

So what are the solutions?

Disable features: The first thing to check is that you are telling the context to not auto-track changes, it's a small thing but you're giving the context less work to do, performance isn't about making your code faster, it's about making it do less.

If you were to run this again you would find that you've taken a few seconds of the total run time, which is better but it's still no where near fast enough.

Disabling auto-detect changes


Commit early, commit often: A fairly obvious option but rather than waiting until we've attached all of the entities to the context before saving, save more frequently (e.g. every 100 entities).  Again this is reducing the amount of work the context is having to perform when it figures out which entities it needs to persist and makes a more significant impact in our figures, but it's still got a way to go.

You might also remember that I mentioned about the context getting bloated, well we can do something about that as well by re-creating the context after each time we save the changes.  This stops the context from getting too bloated and again reduces the amount of effort needed to work out which entities need to be persisted.  We've added in some work now for context initialisation but this is typically cheaper.  This does take a bit more effort to maintain and ensure that we're not breaking the system by doing anything stupid, but it again takes a bit more of a chunk out of the run time.

Committing frequently


Get to the DbSet differently: The typical route to adding an entity is to add it to the contexts DbSet collection.  Bizarrely this collection doesn't have an AddRange method, but there is a way to get at one by asking the context for the set directly.  By adding the entities using the AddRange method we can skip all of the tedious foreach looping and adding the entities one at a time.  So we can now make a simple call to AddRange followed by a call to SaveChanges, this is much more performant than the previous solutions, getting down to some slightly more reasonable numbers.

Using AddRange


But what about SQL Bulk Copy?

So the title of the post is about bulk copying, and having read through the above you're probably wondering why I didn't just jump to this as a solution.  Well, Entity Framework has no out of the box support for SQL Bulk Copy because it's database agnostic.  But I've done this before when working with SQL FILESTREAM so why can't I do it again?

Being a lazy developer the first thing I did was look for an existing solution and one turned up in the shape of EntityFramework.BulkInsert.  It seems pretty popular online for this kind of a problem, is available through NuGet and is pretty easy to use.  After adding the package and creating a method to try it out I ran the sample application and waited for it to finish.  It took me a while before I realised that it already had!  For 10,000 records it ran in under 1 second.

Using EntityFramework.BulkInsert


So surely EntityFramework.BulkInsert is the answer then?  Well if you want to stop reading here and go and download it then please do, it's a great little package.  Naturally there are a few things that you need to take into consideration.  First of all bulk copying doesn't bring the ID codes back, so if you need the values you will have to think of a way around this (think SQL 2012 sequences and sp_sequence_get_range).  Next you have to think about how bulk copying works and make sure you get the bulk copy options correct.  By default it won't check any constraints you have in place and it might not observe NULL values, instead putting in default values for the column type.  It also works within its own transaction (unless you provide a TransactionScope), but if you can work around these then you have a great little solution in your hands.

SQL Bulk Copy

I'm a lazy developer but I'm also a fascinated one, I wanted to know if I could still write the code using the System.Data.SqlClient.SqlBulkCopy class instead of relying on 3rd party packages or falling back to ADO.NET (which is an option, but not one I'm going to cover).

I already know that I can get the connection information from the context, and I've previously shown how to get the mapped table name for a given entity, so surely this is possible.  But I am going to be a little bit lazy and not implement an IDataReader for my collection, instead I'm going to load the entities into a DataTable and use that (note, this option really isn't going to scale well).

This is actually a fairly easy solution to implement with probably the most complicated piece being a fairly simple extension method which pulls out the entity properties and their types, then using this to create a DataTable and copy the data using reflection (again, this isn't going to scale well).  Once you have that you just need to write the data to the database for your chosen batch size.

Using System.Data.SqlClient.SqlBulkCopy

This solution isn't quite as fast as the EntityFramework.BulkInsert component, mostly for the reasons I mention, but it can still persist 100,000 records in about 1 second.

I've created a project which is available on GitHub under an MIT license for you to grab and look at.  I've done this because the code isn't really that difficult to follow and is pretty similar to my previous post on SQL FILESTREAM and me talking through lines of code is boring.  Also available is a LinqPad file which I used to create the input data files, just change the number of entities and run it.  But for convenience I've added a 1,000 and 10,000 entity files into the project anyway.