Thursday, May 08, 2008

Infobright Puts a Clever Twist on the Columnar Database

It took me some time to form a clear picture of analytical database vendor Infobright, despite an excellent white paper that seems to have since vanished from their Web site. [Note: Per Susan Davis' comment below, they have since reloaded it here.] Infobright’s product, named BrightHouse, confused me because it is a SQL-compatible, columnar database, which makes it sound similar to systems like Vertica and ParAccel (click here for my ParAccel entry).

But it turns out there is a critical difference: while those other products rely on massively parallel (MPP) hardware for scalability and performance, BrightHouse runs on conventional (SMP) servers. The system gains its performance edge by breaking each database column into 65K chunks called “data packs”, and reading relatively few packs to resolve most queries.

The trick is that BrightHouse stores descriptive information about each data pack and can often use this information to avoid loading the pack itself. For example, the descriptive information holds minimum and maximum values of data within the pack, plus summary data such as totals. This means that a query involving a certain value range may determine that all or none of the records within a pack are qualified. If all values are out of range, the pack can be ignored; if all values are in range, the summary data may suffice. Only when some but not all of the records within a pack are relevant must the pack itself be loaded from disk and decompressed. According to CEO Miriam Tuerk, this approach can reduce data transfers by up to 90%. The data is also highly compressed when loaded into the packs—by ratios as high as 50:1, although 10:1 is average. This reduces hardware costs and yields even faster disk reads. By contrast, data in MPP columnar systems often takes up as much or more storage space as the source files.

This design is substantially more efficient than conventional columnar systems, which read every record in a given column to resolve queries involving that column. The small size of the BrightHouse data packs means that many packs will be totally included or excluded from queries even without their contents being sorted when the data is loaded. This lack of sorting, along with the lack of indexing or data hashing, yields load rates of up to 250 GB per hour. This is impressive for a SMP system, although MPP systems are faster.

You may wonder what happens to BrightHouse when queries require joins across tables. It turns out that even in these cases, the system can use its summary data to exclude many data packs. In addition, the system watches queries as they execute and builds a record of which data packs are related to other data packs. Subsequent queries can use this information to avoid opening data packs unnecessarily. The system thus gains a performance advantage without requiring a single, predefined join path between tables—something that is present in some other columnar systems, though not all of them. The net result of all this is great flexibility: users can load data from existing source systems without restructuring it, and still get excellent analytical performance.

BrightHouse uses the open source MySQL database interface, allowing it to connect with any data source that is accessible to MySQL. According to Tuerk, it is the only version of MySQL that scales beyond 500 GB. Its scalability is still limited, however, to 30 to 50 TB of source data, which would be a handful of terabytes once compressed. The system runs on any Red Hat Linux 5 server—for example, a 1 TB installation runs on a $22,000 Dell. A Windows version is planned for later this year. The software itself costs $30,000 per terabyte of source data (one-time license plus annual maintenance), which puts it towards the low end of other analytical systems.

Infobright was founded in 2005 although development of the BrightHouse engine began earlier. Several production systems were in place by 2007. The system was officially launched in early 2008 and now has about dozen production customers.

2 comments:

Susan Davis said...

Hi David,

I'm glad you liked our "excellent" white paper and sorry I took it off our site! I wanted to update it to reflect our current release of Brighthouse, but as it is taking a bit longer than I planned, I have re-posted it until I'm done. Your readers can find it at http://www.infobright.com/products.php

Thanks for the great write-up as well. We certainly believe we have done something unique in the industry - created a data warehouse solution that can be implemented and managed without a boatload of rocket scientists, and doesn't require lots of servers and storage arrays. We let our software do the hard work so IT doesn't have to...

Susan Davis said...

David - by the way, Miriam's last name is spelled Tuerk, not Turek but we appreciate the post anyway!