Wednesday, 24 September 2014

Working with Entity Framework Code First and SQL Bulk Copy

There's a few of these that I haven't written, but it seems that you could keep a blog going pretty well with just "Working with Entity Framework and ..." posts. That's not because Entity Framework is particularly bad, I really quite like it myself, but because if you're writing an application of any considerable size or complexity you will at some point reach the limitations of what is possible "out of the box". I'm also aware of the number of "ORMs are evil" posts circulating currently, and possibly you're someone who thinks the same, but for what I'm working on now they make perfect sense, and I'm all for using the right tool for the job.

So what's the problem this time?

General purpose ORMs are great when you're working with transactions, it's what they're meant for.  A user interacts with the system and you need to persist the data from their most recent transaction.  The problem this time is that as the system grows you suddenly find yourself with a couple of use cases where you need to persist a lot of data.  In the example application I've put together for this post I'm using an XML data source with 10,000 records which need to be persisted.

The problem here is that when running with this size data set (with auto-tracking changes disabled) is that it is taking around 40 seconds to run.  10,000 records in 40 seconds is certainly more that I can process manually but for a modern computer it's not so great.  The problem is that as you're adding more records to the context it's getting bloated, it has to start tracking more and more entities, then each entity is persisted individually.  That last point is important because each insert in Entity Framework inserts the new record and then pulls back out the newly created ID code and updates the entity in memory with the new ID code, which is not a trivial amount of work.

So what are the solutions?

Disable features: The first thing to check is that you are telling the context to not auto-track changes, it's a small thing but you're giving the context less work to do, performance isn't about making your code faster, it's about making it do less.

If you were to run this again you would find that you've taken a few seconds of the total run time, which is better but it's still no where near fast enough.

Disabling auto-detect changes


Commit early, commit often: A fairly obvious option but rather than waiting until we've attached all of the entities to the context before saving, save more frequently (e.g. every 100 entities).  Again this is reducing the amount of work the context is having to perform when it figures out which entities it needs to persist and makes a more significant impact in our figures, but it's still got a way to go.

You might also remember that I mentioned about the context getting bloated, well we can do something about that as well by re-creating the context after each time we save the changes.  This stops the context from getting too bloated and again reduces the amount of effort needed to work out which entities need to be persisted.  We've added in some work now for context initialisation but this is typically cheaper.  This does take a bit more effort to maintain and ensure that we're not breaking the system by doing anything stupid, but it again takes a bit more of a chunk out of the run time.

Committing frequently


Get to the DbSet differently: The typical route to adding an entity is to add it to the contexts DbSet collection.  Bizarrely this collection doesn't have an AddRange method, but there is a way to get at one by asking the context for the set directly.  By adding the entities using the AddRange method we can skip all of the tedious foreach looping and adding the entities one at a time.  So we can now make a simple call to AddRange followed by a call to SaveChanges, this is much more performant than the previous solutions, getting down to some slightly more reasonable numbers.

Using AddRange


But what about SQL Bulk Copy?

So the title of the post is about bulk copying, and having read through the above you're probably wondering why I didn't just jump to this as a solution.  Well, Entity Framework has no out of the box support for SQL Bulk Copy because it's database agnostic.  But I've done this before when working with SQL FILESTREAM so why can't I do it again?

Being a lazy developer the first thing I did was look for an existing solution and one turned up in the shape of EntityFramework.BulkInsert.  It seems pretty popular online for this kind of a problem, is available through NuGet and is pretty easy to use.  After adding the package and creating a method to try it out I ran the sample application and waited for it to finish.  It took me a while before I realised that it already had!  For 10,000 records it ran in under 1 second.

Using EntityFramework.BulkInsert


So surely EntityFramework.BulkInsert is the answer then?  Well if you want to stop reading here and go and download it then please do, it's a great little package.  Naturally there are a few things that you need to take into consideration.  First of all bulk copying doesn't bring the ID codes back, so if you need the values you will have to think of a way around this (think SQL 2012 sequences and sp_sequence_get_range).  Next you have to think about how bulk copying works and make sure you get the bulk copy options correct.  By default it won't check any constraints you have in place and it might not observe NULL values, instead putting in default values for the column type.  It also works within its own transaction (unless you provide a TransactionScope), but if you can work around these then you have a great little solution in your hands.

SQL Bulk Copy

I'm a lazy developer but I'm also a fascinated one, I wanted to know if I could still write the code using the System.Data.SqlClient.SqlBulkCopy class instead of relying on 3rd party packages or falling back to ADO.NET (which is an option, but not one I'm going to cover).

I already know that I can get the connection information from the context, and I've previously shown how to get the mapped table name for a given entity, so surely this is possible.  But I am going to be a little bit lazy and not implement an IDataReader for my collection, instead I'm going to load the entities into a DataTable and use that (note, this option really isn't going to scale well).

This is actually a fairly easy solution to implement with probably the most complicated piece being a fairly simple extension method which pulls out the entity properties and their types, then using this to create a DataTable and copy the data using reflection (again, this isn't going to scale well).  Once you have that you just need to write the data to the database for your chosen batch size.

Using System.Data.SqlClient.SqlBulkCopy

This solution isn't quite as fast as the EntityFramework.BulkInsert component, mostly for the reasons I mention, but it can still persist 100,000 records in about 1 second.

I've created a project which is available on GitHub under an MIT license for you to grab and look at.  I've done this because the code isn't really that difficult to follow and is pretty similar to my previous post on SQL FILESTREAM and me talking through lines of code is boring.  Also available is a LinqPad file which I used to create the input data files, just change the number of entities and run it.  But for convenience I've added a 1,000 and 10,000 entity files into the project anyway.

Tuesday, 8 July 2014

Where does your FILESTREAM data live?

This is hopefully just a short one but follows up on a previous post about using FILESTREAM with Entity Framework.

After implementing the solution almost as posted we ran into some problems on the environment it was being deployed to (for reference, do NOT disable netBIOS!).  Whilst investigating these issues we needed to figure out where on the server the FILESTREAM data was being stored.  Looking around the internet I found many posts about getting the path name from a SELECT statement but that's useless if you want to know "my data is at C:\<path>" because you want to check disk space, permissions etc...  Not having the DBA who installed available wasn't useful either and there doesn't appear to be a way to get this information back from SQL Server Management Studio.

But, there is a way to find it from the system tables.  So I put together a SQL statement which pulls the information out.  It worked for me but you may want to tweak it to give you the information you want, to filter stuff out and so on.

Thursday, 30 January 2014

Working with Entity Framework Code First and SQL FILESTREAM

Whilst looking around at these two technologies I couldn't find any information which was particularly useful about how to get the solution working.  So I wanted to put this up so that should anyone else want to try something similar then there's some (hopefully) useful information out there.

What am I trying to do?

The system that I am involved with at the moment has the need to store files with some of it's entities.  We've started out by using Entity Framework 5 Code First (although in this post I'll be working with Entity Framework 6) to create the data model and database mappings which is working nicely and is proving itself a very useful framework.

When looking at saving file data along with an entity there are a few choices:
  1. Store the file contents directly in the database
  2. Save the file contents to the file system
  3. Use a database table with FILESTREAM enabled
  4. Use a FILETABLE
The first option is fine for small files, but I don't really want to load all of the file data each time I query the table and the files might get pretty big.

The next option is a reasonable option and with recent versions of Windows Server we have a transactional file system.  So far so good, but it would be nice to not have to think about two persistence mechanisms.

FILESTREAMs were introduced in SQL Server 2008 and allow you to store unstructured data on the file system, so it feels like we're using the right tools for the job but they're all nicely in the same package.  The problem here is that Entity Framework doesn't support FILESTREAM.

Lastly there's FILETABLE which was introduced with SQL Server 2012.  This is like FILESTREAM, but rather than defining it at a column level you get a table created for you which provides information from the file system.  It is a really nice system, but it didn't quite fit with how we're structuring data and it's also not supported by Entity Framework.

So the option that I would ideally like to work with here is the FILESTREAM option as it gives me all of the database goodness but with the performance of the file system.  But there is just that minor sticking point of it not being supported by Entity Framework.  After a fair amount of playing around with the technology and research into the problem I figured that I could probably make it work by falling back to basic ADO.NET for handling the FILESTREAM part of the requests.  Whilst this was an option I didn't really want to start having different technology choices for doing database work, so the goal was to see how much I could get away with in Entity Framework.

Setting up the test solution

The database server

With SQL Server the default install options will not give you a FILESTREAM enabled server but you can enable it.  I'm not going to go into how with this post as Microsoft have some pretty good documentation available on how to do this.

This also means that we can't let Entity Framework create the database for us, so you will need to create an empty, FILESTREAM enabled database and point to that.

The outline of the project

I created the solution in Visual Studio 2013, and delving into the most creative parts of my mind I came up with a solution that has hotels, with rooms and multiple pictures of each room (okay, not my most creative moment, but it gets the job done).

So in this solution I have my data model which is pretty simple.  I have some locations, at each location there are a number of hotels, each hotel has rooms and each room has photos.

The Location, Hotel, Room and Photot entities
Solution Data Model
Of all of these the import one is Photo.  This entity has some basic properties, Title and Description, which describe the photo, then there's the navigation properties for getting back to the room and then lastly there's the Data property which is intended to hold the content of the file.  Normally Entity Framework would see this property and it's type (a byte array) and map it to an appropriately named column of type VARBINARY(max).  Whilst we could still let it do this, it would somewhat defeat the purpose of the exercise as we'd be storing the contents of the file directly in the database, so we need to add some configuration to tell Entity Framework to ignore this property when mapping.

Photo configuration information
Photo entity configuration
I'm using the Fluent API here, but you should be able to do this using Data Annotations as well.

At this point if we were to deploy the database we would get a table with no data information and a blank property in our entity.  What we need to do next before any of this is useful is to somehow get a FILESTREAM column into the Photo table.  The solution to this is to use Entity Framework migrations, the basics of which I'll not cover here and leave it as an exercise to the reader.

Migrations provides us with a migration class for each migration added to uplift and roll-back the changes to the database.  The useful method for us in this class is the Sql method which allows us to execute SQL commands; using this we can add our ROWGUID column and our FILESTREAM column with all the constraints we need (and of course the appropriate commands to remove it all again as well for the Down method).

Migrations code
Migrations SQL commands
Now if we run the Update-Database command from the Package Manager Console we get a table with all the right columns of the right types for being able to use FILESTREAM.

So that's half the battle won, the next challenge is being able to read to and write from the table.

Storing and retrieving file data

So how do we query data in a FILESTREAM column?  Well this is the bit where we fall back to the System.Data.SqlTypes namespace, specifically the SqlFileStream class.  We use this class to read the contents of the file back from the server as a stream, but this only works in the context of a SQL transaction.

So the first thing we need to do is get the file path and the SQL transaction information, we can then pass this to the SqlFileStream constructor to get our stream, after which it's just a case of reading from the byte array in our entity and writing to the SqlFileStream stream.  To get this information we need to run a custom SQL statement.  We could do this using a SqlCommand object, but I still want to stick to Entity Framework a bit more, fortunately there's the DbContext.Database.SqlQuery<TElement> class which we can use to run raw SQL statements, it also handles parameters so we can parametrize the query (great for guarding against SQL injection attacks) and it an enumerable collection mapped to TElement (which does not have to be a part of our data model).
Raw Data Query
Raw Data Query
The FileStreamRowData class here is a custom class with a string property for the path, and a byte array for the transaction context.

Running all of this inside of a transaction scope will get information required (the call to "First" will enumerate the collection) to pass to the SqlFileStream constructor, we can then use this to write data to the stream.
Writing to the FILESTREAM
Writing to the FILESTREAM
The same applies when writing to the database as well, but with the source and destination reversed.  Also when writing to the database you would need to save the entity first.  Wrapping up the Entity Framework bit in the same transaction scope means that even if you call "SaveChanges" on your context, if the transaction does not successfully complete then the changes are stilled rolled back.

So does it work?

Well, yes it does, and it works pretty nicely as well.  It's maybe not the final solution that I'll use as I'm still investigating a couple of other options, but it's certainly not a solution that I would be upset at using, and by hiding the complexity in the data services the client need never know how the file information is being held in the database or using which technologies.

How do I play with it?

You could probably work most of what you need out from this post, but for convenience sake I've also put up the whole solution onto GitHub, so feel free to head over and take a look.  If you come up with any suggestions or improvements then feel free to contribute.

Tuesday, 30 July 2013

Playing with CoffeeScript

I've recently been playing around with CoffeeScript lately, and as I have a tendency to do I decided to crack open a prime number challenge and see what the solution looked like.

The Challenge

I recently set this up as a challenge at work as a bit of fun which goes as follows.

"Calculate the first 10,000 prime numbers, output the largest prime number and the sum of all palindromic primes"

I've implemented the solution a number of times using C#, C++, Python, Go and JavaScript and it is the latter that I wanted to compare the solution to.  The JavaScript solution I created typically ran in about 15ms after a fair amount of tweaking and profiling to squeeze as much out of it as I could do (within my own abilities).

In all I found writing the solution a pleasant experience, with a very simple and expressive syntax which abstracts away some of the ugliness that is JavaScript (not that I particularly dislike JavaScript, it's quite fun actually).  List comprehensions were incredibly useful and powerful as they are in other languages and writing iterations and checks were very simple.

Anyway, here's my solution.  I'm sure it's not perfect, and there are probably some CoffeeScript tricks that I've not picked up yet.  But the solution is as fast as my JavaScript implementation and much easier to read, so all positives in my book :)

My JavaScript solution along with a couple of others is available on the gist as well.



EDIT: Since posting this I tweaked the solution a little bit after noticing that I was evaluating all of the values in the array of candidates, instead of working only up to the square root of the upper value

Wednesday, 3 July 2013

SQL Injection with Entity Framework 5 and Static Code Analysis

It's interesting times here at the moment having started a new, big, project which is using MVC 4 Web API and Entity Framework 5 and some other cool stuff.  There is also a big push for secure coding at the moment so I'm evaluating various tools to help out with code and solution analysis for this.

Thinking about exploits


So the project is using Entity Framework 5, Code First.  It's a really nice technology even if it does have some limitations and quirks, but then that's the trade-off you make when you choose not to do everything manually.

Whilst using it, and looking at static code analysis tools I got to wondering how easy it would be to create a SQL injection exploit and, more importantly, how well do the code analysis tools pick it up.  The reason this kind of attack came to mind was because I was at the time reading the updated 2013 OWASP Top 10 and this is pretty close to the top (actually, I think it might be number 1).

Typically when using Entity Framework you would use LINQ to query your data source, this helps in that the framework will create parametrized queries, a great line of defence (maybe, more on this later).  But, what if you want to do something funky, well EF5 has you covered and allows you to run any SQL you want.

So I created a basic Web API project to retrieve data from a books database (imaginative!).  It included a couple of entity POCOs for the books and authors with some referential integrity, a DbContext class, and of course the API controller.  The method I implemented simply takes a string value and searches for all books with that text in the title.  What could possibly go wrong?  On to the code.


Exploiting the code


So if I run the code from Visual Studio 2012 and point my web browser to the resource method at "/api/books/?title=shadow" I get the single book in my limited database back.

Great, working so far.  So what else can we provide as an API request?  Well how about the following

/api/books/?title=' drop database test --'

Well, what does the response say?


That looks fine doesn't it?  Well yes, until I look in my database server and realise that my "test" database has suddenly vanished!

Looking at the exploit


As you can see, it's a fairly obvious exploit and one that should be picked up quite easily during a code review, although it frankly should never get to that point anyway and I would hope that most developers wouldn't think that this is acceptable.  But, with deadlines looming and managers breathing down your neck, even the best of us can make silly mistakes, so what can we do?

Well, one option is static code analysis.  This is where a tool looks at your code and determines if you are breaking any rules.  The most obvious choice here is the tool which comes with Visual Studio (in 2012 this is now available in the professional edition), also known as FxCop.  It comes with a built-in rule-set called "Microsoft Security Rules" and running it is as simple as going to the Code Analysis window, clicking the settings icon, setting the rule set and then running it.

Unfortunately it doesn't detect this exploit, so having code analysis running on your build environment and reporting issues isn't going to save you here.  So I tried the version of Code Analysis which comes with Visual Studio 2013 Preview Edition, still no luck.

Next up was CAT.NET, this is a tool which has a very limited set of rules and hasn't seen much attention from Microsoft since 2010, but it's still available so I tried it.  As expected it didn't find anything!

Last up was a commercial product which I'm now going to name here, but is in the NIST list.  It touts a large list of security rules and in demonstrations is very impressive, but here it failed to detect the issue.  It did detect that I should be using column names instead of "SELECT *", but that's not going to save the day really.

So lets make it parametrized


So will making it a parametrized query help?  Well lets see what it looks like in code (thanks to Troy Hunt for highlighting this in the first place).


This uses a stored procedure (even safer right?), the code for which looks like this.


So is this better?  Well, no.

The problem here is that I've shifted the injection vulnerability to the database.  But in a large development team where there are people looking after the database and others writing the C# code, in this kind of environment, without proper oversight of the full product you might find that you run into this problem.


So what do we do


Well, the only way to prevent this kind of simple mistake is education.  Make sure that developers don't think that using ORMs like Entity Framework protect them completely from exploits, and make sure that they're aware of coding defensively.

For now we're going to have to keep a closer eye on the code, but keep on pushing the vendors to support the latest technologies so that some of these checks can be automated, then we can turn our attention to less obvious problems.

Update

Thanks to Rowan Miller in the comments below for suggesting that I update the post to show how to use the above pattern safely with Entity Framework. When I wrote this post I wanted to highlight how easily you could misuse good framework, and how tools out of the box failed to detect it. The trouble is that it's easy to forget to show how to do something positively when you're looking at a negative.

So, with that in mind, here is an example using the above scenario, but this time using Database.SqlQuery<T> with a parametrized query and using a SqlParameter (although you can pass the value in directly in an object array if you don't want to create a parameter).


It's perhaps still not the best way of doing it, but if you have to write code this way then it's better that you do it right and reduce the risk of attack rather than bypassing all the secure features in place for you and opening up your application for exploits.

Tuesday, 18 December 2012

Reversing a string in C# - is it really that difficult!

I've been interviewing people for a C# developers position recently which has taken the usual "technical questions, chat, random questions, any questions for us" kind of format which I'm sure most people who've been looking for a job recently will be familiar with.

One of the technical questions we've been asking recently is "Can you describe how you would reverse a string in C#". Note, we're not asking anyone to code a full answer, just describe how they would do it.

This is one of those pretty standard interview questions, much like the FizzBuzz question, which is presented to see if candidates can demonstrate a basic understanding of a programming language and show that they can understand simple requirements. And yet, it seems to throw so many candidates!

The requirements are simple.

Take a string such as "ignoring the voices"
Reverse the characters of the string
Output the result (i.e. "seciov eht gnirongi")

As with most things there are a number of ways to achieve this goal, some more efficient than others, some more geeky than others, but still there are varied number of ways to do it, so I wanted to demonstrate a couple of methods here to show it is possible. To date I've only received a single correct answer, the other answers have been either non-existent (I don't know how) or far from the mark.

Method 1 - The agnostic approach

This method is the agnostic approach as it shows an algorithm rather than a platform specific solution. This is a good solution in an interview as it shows that you understand the problem and can devise algorithms to solve it, although it's worth saying that there may be a solution more relevant to the platform. The solution is simple, work your way in to the middle of the string (character array) from the start and end of the string, swapping the characters as you go. The important thing is to stop when you get half way through the string, otherwise you'll reverse the same characters again and arrive back at the original string. The example is presented as an extension method.


Method 2 - The C# (.NET 4) approach

The .NET framework provides a large number of methods available on all of the standard types, these can help solve this problem in a way which is specific to C# and the .NET platform. The important thing to remember is that strings are just arrays of characters, and so any methods which can be applied to arrays can be applied to strings as character arrays. More recently some methods which were only for arrays have been moved to strings as well, conveniently one of them is the Reverse method. That's not to say that you can just reverse a string directly as the Reverse method returns an IEnumerable object, but the IEnumerable object can return an array and the string constructor can take a character array to initialize a string object with, so we get the following method.


This is a vastly simpler implementation than the first, but is it any better? Well it shows an understanding of the language which is a positive, but not of algorithm implementation, so it's neither better or worse than the first method. The best answer to give would be the first answer followed by the second to show an understanding of algorithms and of the required platform.

Of course from these you can devise variations on the problem, such as reversing the words in a string whilst keeping the characters of each word in the correct direction.


Or even reversing the characters of each word, but keeping the words themselves in the correct order.


So why do some programmers struggle with these kind of questions? Is this indicative of programmers as a whole or is this just a lull? I can only guess at the reason, but non-the-less it feels concerning at the moment.

Wednesday, 24 October 2012

Playing around with Go, prime numbers and ancient Greece

I've been playing around with Google Go (golang if you want to Google it) a lot lately as I've been of sick and needed something to stop me from going nuts.  It's been a while since I played with the language properly and so I started off by reading through a new book which has been written by Caleb Doxsey entitled An Introduction to Programming in Go which you can read for free on-line (or download as a PDF) or purchase from Amazon.  I read it for free on-line but purchased a Kindle copy of it from Amazon afterwards as it's a really good book and really well priced.

So after going through the book I needed something to practice on, not wanting to throw myself into an active project I decided to head over to Project Euler.  A great resource if you want to practice using a language without resorting to made up scenarios or over elaborate "Hello World" applications.  There are a number of problems (399 at the time of writing) which are mathematical/programming based in nature.  If you're not too good at maths or it's been a while since you were in full time education then you may need to Google a bit to get an idea of the problem but they're not beyond the average person.

The Problem

The one I want to look at here specifically is problem 7 which (at the time of writing) is a problem to find the 10,001st prime number.  There are two obvious ways to solve this, the first being a prime number generator.  This is a function which keeps on incrementing a count, checks to see if the next number is prime and if it is the number is returned from the function.  The second method is to use a sieve, this takes a sequence of numbers and "removes" all of the ones which are not prime numbers, it does this by:

1. Create a sequence of numbers, marking each as a "prime candidate"
2. Starting from 2, check to see if the entry is marked as a prime candidate
  a. If the entry is a prime candidate then mark each multiple in the sequence as non-prime
3. Return the collection of prime of numbers still marked as prime candidates as the final set of prime numbers

This is an over-simplified explanation of the process, for a more detailed explanation see the Wikipedia entry on the Sieve of Eratosthenes.

The issue with creating a sieve is that you must specify an upper limit, that is to say "find all prime numbers less than N".  The problem however poses the question as "find the first N prime numbers".  This means that in order to use a sieve we have to know roughly where the Nth prime will be, or use a high enough limit to generate enough primes to get N.  Because of this it would seem that using a generator would be a better option.  One thing to keep in mind is that, although there is no time limit to the problems, generally speaking if the solution takes longer than a minute then it may not be the best solution.

The Solution

Before I start

One thing I should probably mention here is that I'm only checking to see if a number is divisible by prime numbers instead of checking for all possible divisors.  The theory here is Prime Factorization.  So a prime number is a number which is only divisible by 1 and itself, other positive integers (other than 1 which is special) are composite numbers, meaning that they are divisible by integers other than 1 and themselves.  Consider the number 20, this can be decomposed as follows:

20 / 2 = 10
10 / 2 = 5
5 / 5 = 1

So the number 20 can be represented as 2 x 2 x 5 (or 2^2 x 5).

Because numbers which are not primes must be evenly divisible by at least one prime number less than itself then this reduces the number of numbers to check for to see if a number is prime.

Solution 1

Because I wanted to play with Go I took the opportunity to create a prime number generator using goroutines (think light-weight threads).  This was done by creating a number "emitter" at one end and a "listener" at another end, when a number leaves the emitter it passes through a series of checkers, these check to see if the number is evenly divisible by a given prime number.  So the first checker will check to see if the number is evenly divisible by 2, if not then it is passed to the next checker which checks to see if it is evenly divisible by 3.  The "listener" watches the output from the last checker, if a number appears here then it is not divisible by any of the currently known primes and so must be a prime number itself.  In this event, the listener creates a new checker for the new prime number which listens to the output from the last checker, the listener then listens for output from this new checker, it also puts the new prime number into a channel and waits for it to be read by the client.

Emitter > Check For 2 > Check For 3 > Check For 5 > Listener

In implementation I thought this to be a rather elegant solution, however for production use this turns out to be quite a time consuming exercise.  It allowed me however to play with channels and goroutines and so was incredibly useful.  Next up I decided to implement a similar generator again, but without goroutines and channels.


Solution 2

This solution is more simple as it simply builds up a list of prime numbers and runs in a single goroutine (the main one).  The theory is almost identical though; keep on checking numbers to see if they are evenly divisible by the current list of known prime numbers, if not then it is a prime number and add it to the list.  The implementation this time takes a number which is the Nth prime to find, this means that the function will keep on running until it has found that many prime numbers and will return the last one found.

This was straight forward to implement and ran so much faster than the first solution (something along the lines of 38 seconds for solution 1 to 6 seconds for solution 2).  But I still felt that it was too slow.

Solution 3 - The Prime Sieve

So I investigated the sieve solution.  I gave a brief outline of what is required above but there are a couple of optimizations to be made.  The first is that for a given sequence of numbers you only need to look for prime numbers up to the square root of the limit, numbers after this point are either multiples of the prime numbers discovered to that point, or are prime numbers themselves.  Say we wanted to find all of the prime numbers up to 16 then we would limit ourselves to only looking up to 4 (the square root of 16):


1 N/A
2 Prime Number
3 Prime Number
4 Divisible by 2
Search limit
5 Prime Number
6 Divisible by 3
7 Prime Number
8 Divisible by 2
9 Divisible by 3
10 Divisible by 2
11 Prime Number
12 Divisible by 2
13 Prime Number
14 Divisible by 2

The second optimization is fairly obvious from the table above, every even number with the exception of 2 is non-prime.  This means that we can initially mark every even candidate as non-prime (except 2), then starting from 3 check every odd number (e.g. 3, 5, 7, 9...), this halves the number of candidates which need checking.

In order to get around the "known limit" problem I implemented the solution to sieve out all primes up to 1,000,000 (a suitably large number I thought).  There are ways in which you can determine the upper limit, but the math is a bit beyond me right now, this is Prime Number Theorem, feel free to take a look.  If this were implemented then we would be checking fewer potential candidates meaning less work again.

(Edit, thanks to compoasso from the comments I implemented a function to get the upper limit so that it is no longer checking from primes up to 1 million)

I didn't hold out much hope for this solution as I am sieving for primes in a large data set to find 1 specific number, so I figured I that I wasn't going to do much better than the ~6 seconds from solution 2.  So the result surprised me a little.  After putting together the solution and running it with the "time" command on my HP Mini 210 (Intel Atom N450, dual core 1.6Ghz) running Ubuntu Linux I got the correct result in 98 milliseconds!


So, what now?

Well, if you want to check out the full source files you can find them on GitHub along with the above snippets.

I absolutely love playing with Go, it is very much like creating C++ application but with the convenience of something like Python or Ruby.  Project Euler is a great playground for experimenting with new languages as well, it also allows you to think about how to create a solution which is appropriate to the language (instead of making a generic solution work across all languages).

Now I just need to remember to keep on posting :-)