Document Stores (DODS)

Document Oriented Database Systems are schemaless. Instead, each record is stored as a single document, which is usually represented as JSON. Different records can have a different key structure, values can be a single value or an array of values, and data can be nested.

Popular document stores are MongoDB, Elasticsearch, and Couchbase.

Scaleable

Document databases ushered in the era of NoSQL databases and aimed to solve the problem of high availability and load balancing that relational databases struggle with. Database clusters with the ability to scale to petabytes became not only possible, but quite performant as well. At a social network analytics company, we benchmarked Postgres vs. Elasticsearch for a reporting application. Elasicsearch could return our Top-N aggregate queries in under 100ms, where it took Postgres over a minute to do the same thing.

Schemaless

Maintaining a schemaless data store is a compelling feature for applications that don’t necessarily know what data they are going to have to deal with in the future. An example might be a SAAS CRM application where users can define the data they wish to collect about their customers.

It’s very easy to add new keys or simply stop saving deprecated keys on new records, but this comes at a cost. It can become challenging to manage data without a defined structure. Ancillary features such as reports or alerts would need additional work to manage the variety of keys available.

Use for reports, search, and event logging

Writing mostly append-only denormalized documents for things such as aggregated reports, full text search, and logging of events is where document stores really shine. This is achieved through the use of inverted indices.1

Denormalize (i.e. avoid joining data)

Document stores traditionally haven’t supported joining data together well, which makes it hard to normalize data. If a common value needs to be updated, it needs to write to every document that contains that value. MongoDB now supports joins via the keyword $lookup, and Elasticsearch can have nested queries and child documents.

The join field shouldn’t be used like joins in a relational database. In Elasticsearch the key to good performance is to denormalize your data into documents. Each join field, has_child or has_parent query adds a significant tax to your query performance. - Elastic.co2

There is a debate raging about how performant a join is in MongoDB.

The OLTP benchmark was based on a teaching example for Python users written by a MongoDB developer advocate. Ongres ported it to Java and then built benchmarking on top of that. This led to unnecessary uses of $lookup (JOIN) aggregation and other relational traits in MongoDB which are known to impact performance simply because MongoDB is not a relational database. - MongoDB3

Though I personally haven’t tried to do a join in MongoDB, this comment suggests that joining documents together has significant performance implications.


  1. Indexing for Beginners, Part 3 

  2. Parent-join and performance 

  3. Benchmarking: Do it with transparency or don’t do it at all