Storing images in the database – good or bad idea?

As many of you know, I write real estate web applications for fun and profit. I have to admit it really is an interesting area for innovative web applications. After all, writing a modern IDX solution requires taking advantage of AJAX technologies, mastering a web based mapping platform (like Microsoft Virtual Earth or Google Maps), integration w/ 3rd party APIs and web sites (such as Zillow or Trulia), pretty dynamic SQL, and handling large amounts of graphical images. Needless to say, it can pose some interesting technical challenges at times.

I'm in the midst of an application architecture upgrade; I'm currently debating how I want store my MLS images. I currently store broker specific images in a database (currently just agent photos and property management photos). My biggest customer (Real Property Associates) has about 70 agent photos and 2000 property images stored in a database, (I currently use ASP.NET HTTP handlers to extract images out of the DB and re-size said image on the fly before I send the response stream to the browser). The NWMLS images are on the file system scattered in one of a thousand subdirectories.

Although this arrangement is more efficient than storing all 474,589 or so NWMLS images in a single directory, it's painful to deal with. Deleting photos for sold or expired listings is like going spelunking instead of doing a DELETE * FROM … WHERE … SQL query. SELECT queries are a lot more flexible that DIR commands. Modifying data only requires one set of DB permissions instead of setting file system permissions on a thousand directories. I don't have to worry about the physical location of my data, since it's just a SELECT query away (regarding of where SQL ends up putting it). Adding meta data is just an ALTER TABLE … ADD COLUMN … SQL command away. In short, even if SQL Server is slower than the file system, the benefits of dealing with it instead are worth the price, so far.

Conventional wisdom is that you don't store images in SQL server. Unfortunately, my good experience with smaller data sets is leading me to believe that conventional wisdom is wrong. Then again, there's a world of difference between 2000 images and a number hovering 25K shy of half a million images. In my case, the images tend to be smaller JPEGs (about 400 x 300 in size) usually hovering around 50K, and I suspect (though I don't know), as long as the size of the images are small-ish, the performance penalty you'll pay appears to be comparatively is minor. My DBA guru friends have told me as long as the image sizes are small, it should be fine. SQL finding a BLOB via a primary key should be pretty quick.

Of course, the only way to find out for sure is to go for it (and see if you regret it later). I suppose, I already know the answer I want to hear. I just want to feel good about it before I go ahead and database-tize my NWMLS image set. Rumor has it that SQL 2008 has a new FileStream datatype, that stores BLOBs in the file system, but yet remain an integral part of the database with transactional consistency, so perhaps I should wait? Ironically, doing nothing (and waiting for Microsoft to fix the problem) is sometimes the smartest design decision you can make.

So does anybody have any good or bad experiences with storing images in their DB / Web applications? How big of a dataset are you storing in the DB? How big of a dataset are you displaying in the web browser window? How do the MLS Vendors deal with this problem?

Print | posted on Tuesday, June 26, 2007 9:17 AM

Feedback

# re: Storing images in the database – good or bad idea?

Gravatar left by Michael Wurzer at 6/26/2007 2:28 PM
Robbie, we don't store our images in a database. We're currently managing about 25 million images and that count is growing very rapidly. That's not to say that storing them in the database is a bad idea, we just don't do it. In fact, I was just speaking with a systems engineer last week from one of the largest MLSs in the country and they are considering moving their imagery set into the database to make backups and redundancy easier. We use NetApp boxes, which cost lots of money, but they are very, very good at handling lots and lots of files in a redundant and easy to backup fashion. So, I really think you're right, it's just one of those things you have to try to see if it works in your environment.

# re: Storing images in the database – good or bad idea?

Gravatar left by Robbie at 6/26/2007 5:18 PM
In my experience, the flexibility gained has more been more important than performance lost. Granted, I'm blessed with a small number of customers, smaller datasets, enterprise class tools (my server is just a mini-tower machine w/ desktop hardware & 4 GB RAM) but so far the approach has worked great for me.

I'll probably go for it at some point. Apparently,the Microsoft TerraServer team did it this way, so apparently it's not a scaling nightmare.

I assume that the 25 million images you manage are usually small-ish JPEGS (100K or less each)? You don't usually have multi-megabyte images do you?

BTW, how to do deal with thumbnails? Do you store 2 sets of images, or create thumbnails on the fly from the full size image? What are your biggest image related headaches? Are you or the MLS, responsible for managing the systems IDX vendors use for data / image download? The NWMLS has had image scalability issues that recently got resolved.

# re: Storing images in the database – good or bad idea?

Gravatar left by Michael Wurzer at 6/26/2007 6:07 PM
We store three copies of every image, a high-res version (~1 MB max), a report size (about 300x200), and a thumbnail. We have about 6TB of image storage available. We host the system for all but two of our customers, and so we're responsible for IDX and other data and image distribution needs. The MLS, of course, deals with approvals and all that, but, once that's covered, we take care of the delivery.

# re: Storing images in the database – good or bad idea?

Gravatar left by Galen at 7/22/2007 7:13 PM

Robbie, there is so little advantage to storing images in the database that I just don't see the point. Your tables quickly get bigger, meaning they don't fit in memory and there is more thrashing on your hard drive to execute queries.

Think programatically, not databasically (no, not a word). When you want to destroy a record, create a function that looks up and destroys its images as well. You should have (non-db) functions that track photo count and the like anyways (so you know how many to download), right? So what's an extra 3 lines of code that look up and delete each photo.

Now I understand why you (you!) might do it - you're a database guy and you like trying out new features. But the only other people who will tell you it's a good idea are the people selling you the databases and the upgrades and the people who did it on their own to show off the power of their database (Microsoft Terraserver) because the licenses were free. Google and (I would wager) Microsoft Live Maps almost definitely store their images in a glorified file system of sorts; file systems are made for finding and delivering files, databases are made for a host of other things, most of which are lost on images.

# re: Storing images in the database – good or bad idea?

Gravatar left by Robbie at 7/24/2007 3:58 PM
Galen,

I'm assuming that your search uses a DB and your listings data isn't in HTML/XML files scattered on your hard drive? So I'll assume we agree that DBs are the best tool for searching/manipulating small pieces of textual/numeric data.

If your already using a DB, you've already "paid" for it, and there's no added financial cost to storing the images in the DB. Sure, newer DB's are better/faster. However, most commercial database vendors charge per CPU or per user, not per MB stored so adding image data to the DB will not increase your licensing fees. (Although if Oracle did charge more per MB it wouldn't surprise me) And since you're probably using Postgres or MySQL, cost is even less of an issue for you.

If your already using a DB, your already taking up CPU, memory and disk resources for it. Why not use it to it's full advantage? What's the difference between the file system thrashing and database system thrashing? Disk thrashing is disk thrashing right? Storing images in either location requires CPU cycles, memory, disk space and disk seeking. Furthermore, databases are designed for fast seeking, so a well indexed DB should beat the crap out the filesystem in terms of data finding.

DB's win on data management (it's easier to back up one database, than one database + 500,000 images). DB's win on security (one set of credentials vs potentially different directory/file security settings for each file). DB's win on meta-data searching (SELECT ... FROM ... WHERE ... beats ls or dir any day of the week). DB's win on flexibility (i don't need to worry about where the data is physically stored, I just need a way to get at it). DB's have advantages, but perhaps not enough for you to consider using it for image serving.

The problem is that databases aren't usually known for fast BLOB (aka Binary Large OBjects) retrieval (which is what file systems usually do better) and getting BLOB data out of a database to a web browser is different (and potentially more work) than getting BLOB data out of a file system to a web browser.

However, since most DB's get data in chunks (MS SQL uses 8K pages - dunno about the others) and cache it, the DBMS is probably getting more data off the disk than what you are asking for. Actually, file systems act much the same way, they just don't manage meta-data as effectively as a DBMS can. So in either case, the respective storage engine has overhead involved. It's merely a question of how much overhead they impose.

I've heard that SQL 2005 outscales the NTFS file system. Granted, I don't have any hard proof, other than a good experience w/ images in the DB on a smaller scale and a mediocre experience w/ images on the file system on a large scale. The SQL experts I've talked with have basically said "it's probably a little slower, but as long as you properly index the tables, and the images are small-ish, you'll probably be fine". Right now, it's pure conjecture, but I'm thinking about it.

As we've seen with the NWMLS, they put all the images into a1000 different directories and they set up the FTP server to prohibit ftp ls / dir commands to avoid file system scaling issues. Perhaps in the Linux world, the file system scales better and the OSS databases don't scale as well as their commercial counterparts?

If you have a modern DBMS, the idea isn't as crazy as it sounds.

# re: Storing images in the database – good or bad idea?

Gravatar left by Rob Wack at 10/15/2007 10:15 AM
Robbie,

Those SQL experts you've talked with gave you good advice...but you might want to read between the lines. What they mean is this: it IS a little slower, but as long as you properly index the tables, and the images are small, AND YOU DON'T HAVE TOO MANY OF THEM, you MIGHT be fine.

Modern filesystems, in general, are almost always faster than the best DBs for moderate to large chunks of binary data. A full set of images (say, 500,000) for all the listings from a large MLS could conceivably generate a 50-plus GB table full of jpeg BLOBs, and there hasn't yet been a DB written which handles large numbers of BLOBs as quickly as any modern filesystem. Why? Primarily because the filesystem search for a few files in a directory of moderate size (1000 files or less) carries with it less disk/processor overhead than the DB search through a table of 10,000 or more rows for a few BLOBs. If you need to collect many smaller pieces of data from several rows of a table or three, the database beats the filesystem every time. Otherwise, the filesystem wins.

That would be okay, given the easier coding task, as long as the DB didn't hiccup; but problems begin to arise when you delete and add large numbers of images - problems which don't happen with a filesystem-based image solution. That image table will need to be optimized each time you delete a large number of images, in order to avoid a noticeable reduction in performance; and that optimization carries with it a considerable expense in disk thrashing not to mention the somewhat increased probability of table corruption, which WILL eventually happen with a very large table of BLOBs which is frequently updated.

You would probably find that everything worked reasonably well, if slightly slowly, for a while; then, suddenly and unexpectedly, you'd have to do a table restore to fix a corruption issue...then rinse and repeat periodically at unpredictable times.

The DB solution does make the coding a bit simpler and the speed issue isn't a deal breaker if you don't have frequent updates or if you're happy with optimization of such a large table after each one. The nightmare, though, is the eventual table corruption, unless you've got an employee whose job it is to monitor that and deal with it. It WILL happen, just as surely as you've probably never seen it happen on smaller tables or even large tables without BLOBs.

A hint to negate much of what I just wrote:

You could improve your table corruption odds (though the BLOBs themselves will still, very occasionally, be a corruption causing issue) and increase performance appreciably by breaking up your image BLOBs into multiple tables, much as is done by limiting a file directory to 1000 files, for example. The DB will handle large numbers of tables, each with a smaller number of BLOBs, much better than one big table with a large number of BLOBs - and it will do it more quickly.

# re: Storing images in the database – good or bad idea?

Gravatar left by Matt Lavallee at 1/5/2008 4:17 PM
Hi Robbie,

Just chiming in here... we were all-but-decided on storing our images in the database for a variety of reasons, not the least of which is bypassing the CIFS calls involved in allocating each individual file from multiple head units, checking perms, and so on.

However, I (fortunately) wound up finding a few analyses online of this exact issue (some even from MS themselves). The reality is that the curve favors SQL heavily in the <1MB range, where it's twice as fast as FS access; but, the curves diverge very quickly when you hit 1MB and beyond, where NTFS beat SQL by 10%, then 20%, all the way up to 50% with 10MB and above.

So, our decision was to store image metadata in the database, five known sizes on-disk, then also have a call for cached dynamic sizing (for some specific vendors). We'll be using iSCSI MPIO targets to bypass as much of the SMB traffic as we can, and moving to 10GBe for the media network.

One curveball to consider at your next revision, as we will be, is that SQL 2008 supports the FileStream datatype, allowing the file to actually reside on the filesystem, bypassing the permission-based RPC/SMB calls and resolving the fragmentation and table-size issues in one fell swoop.

Next time!

-Matt

# re: Storing images in the database – good or bad idea?

Gravatar left by Robbie at 1/6/2008 11:44 AM
Wow, great comment, Matt! Thanks.

I had to read it twice to actually understand half of it (I don't have any real experience with NAS systems and I'm also part Polish :). If you can post links to the things you've read, I'd be very appreciative. I think you've pushed me over the edge.

Since, I'm still small scale when wearing my RE.net hat, I think I'll end up storing everything in the DB. Since all the images in the DB are going to be (or at least should be) under 1 MB. (It's for web sites, not print), So I think that'll work well for me. The question is to VARBINARY(max) now, or wait for 2008 FileStreams.

However, when I wear my e-commerce hat, it's a tougher fight since I have convince the other engineers on my team, and we probably 10 million images we need to deal with (instead of the 475K images I deal with in real estate land).

My question with new SQL 2008 FileStream, is can you change the files outside the DB, or can you use SQL INSERT/UPDATEs? How do queries w/ transactions work on FileStream. Any way, stay in touch cause I'd love find out how it all works for you.

# re: Storing images in the database – good or bad idea?

Gravatar left by Matt Lavallee at 1/12/2008 7:17 PM
Alright, first up, here's a succinct review by some smart folks (academic whitepaper and all):
http://www.scribd.com/doc/264408/Fragmentation-in-Large-Object-Repositories-2007-wisc-conference

Second, here's the FileStream digs from the Softies:
http://blogs.msdn.com/manisblog/archive/2007/10/21/filestream-data-type-sql-server-2008.aspx

To paraphrase: SQL will use a special area of the filesystem as a filegroup for BLOBs, leveraging NTFS' ability to sequentially stream IO, maintain journaling, fragmentation, etc. Those files should not be touched outside the database, but otherwise the database stays mean'n'lean while NTFS remains oblivious of the permission checks. Win-win.

Finally, iSCSI lets you connect drives with SCSI-like interfacing (LUNs, etc.) over ethernet, and the cost of 10Gbit ethernet is dropping by ~50% annually. Compare it to 4Gb Fibre that costs about 3x more than iSCSI and you can see the writing on the wall. Add to this MPIO (Multi-Path Input/Output) that allows you connect the same volume with multiple targets and you get serious fault tolerance at now fractions of big hardware costs. Love it.

-Matt
Title  
Name
Email
(never displayed)
Url
Comments   
Please add 1 and 8 and type the answer here: