csv – Scalable, fast, text file backed database engine?

csv – Scalable, fast, text file backed database engine?

There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).

You know your requirements better than any of us, but I would suggest you think again about this. If you have 16-bit integers (0-65535) stored in a csv file, your .tsv storage efficiency is about 33%: it takes 5 bytes to store most 16-bit integers plus a delimiter = 6 bytes, whereas the native integers take 2 bytes. For floating-point data the efficiency is even worse.

I would consider taking the existing data, and instead of storing raw, processing it in the following two ways:

  1. Store it compressed in a well-known compression format (e.g. gzip or bzip2) onto your permanent archiving media (backup servers, tape drives, whatever), so that you retain the advantages of the .tsv format.
  2. Process it into a database which has good storage efficiency. If the files have a fixed and rigorous format (e.g. column X is always a string, column Y is always a 16-bit integer), then youre probably in good shape. Otherwise, a NoSQL database might be better (see Stefans answer).

This would create an auditable (but perhaps slowly accessible) archive with low risk of data loss, and a quickly-accessible database that doesnt need to be concerned with losing the source data, since you can always re-read it into the database from the archive.

You should be able to reduce your storage space and should not need twice as much storage space, as you state.

Indexing is going to be the hard part; youd better have a good idea of what subset of the data you need to be able to query efficiently.

One of these nosql dbs might work. I highly doubt any are configurable to sit on top of flat, delimited files. You might look at one of the open source projects and write your own database layer.

csv – Scalable, fast, text file backed database engine?

Scalability begins at a point beyond tab-separated ASCII.

Just be practical – dont academicise it – convention frees your fingers as well as your mind.

Leave a Reply

Your email address will not be published. Required fields are marked *