Thursday, June 25, 2009

Importing large datasets in MySql

I have been working a new website cyclebrain.com. This website will be a bicycle journal and statistics management site. Allowing a user to track all kinds of statistics about their bicycles, rides, and equipment they use, or that is part of their bicycle. The website is up and currently I am allowing beta testers, so if you're interested then go the site and request to be a beta user, you will be able to give feedback as to what features you would like, as well as any other changes you think would make the site great.

But enough about the site for now, I will post future posts about the site, and the technology I am using as well as obstacles I have had to overcome, both to give credit to those who have made what I want to achieve easier, and also as a way for me to log things I don't want to forget.

The first thing I would like to write about is how to import large data sets into a MySQL database. For cyclebrain I have some database tables where I store information to allow a user to select the country, state, and city there are from as part of the user preferences. This is great, except one of the import files is over 300MB! This has caused issues with the interface my hosting provider has provided. I have tried a few different tactics to achieve this, such as using MySQL QueryBrowser and trying to execute the scripts, this took way too long and I was losing my connection to the database. I also tried uploading the files and connecting to the database via an ssh client, this too did not work. I also tried splitting the file into smaller files and running those, this worked the best but I was still getting timeout errors, and I was tired of tracking which files I had run, and which had errors.

I just happened to be browsing dzone and came across a post where someone was running into the same issue, and in the comments there was a link to BigDump. BigDump is great. It takes a large sql file and breaks it up into smaller sessions, so issues like timeouts, are less prone to being encountered. You simply just upload a php file to your server, and put the sql files in the same, web accessible, directory. You then just bring up the page in your browser. It lists all of the sql files, and give you the option to run them. It couldn't be simpler. BigDump does have some restrictions with the files it can handle and the formats, but for my files, simple sql inserts it worked great. One problem that I did have was BigDump stopping on any errors in my sql file. I ran into primary key constraint violations, after checking my file I did not notice these inconsistencies, I'm thinking it may have something to do with the way BigDump breaks up the requests, and maybe there was some duplication in the queries at the end of one request, and the beginning of another. No matter, I just commented out the code that stops the process on this error and ran it again. No problems, this worked for me, as I can deal with a few rows missing. I may go back and log these so I can make sure they worked, but for now this works, and BigDumphas saved me some headaches.