🍃 MongoDB

Updated at 2012-10-24 22:25

MongoDB is schemaless document-oriented NoSQL database.

Avoid using MongoDB before you know how it works. It will cause more problems than solve if you do not know what you are doing.

Considered using MongoDB to store frequently accessed data that is not business critical e.g. location data.

General

While relational databases store data in specific columns and rows in a schema, MongoDB stores data in Binary-JSON documents without collection restrictions. You need to do some data modelling to have optimal performance though.

Each document has a BSON identifier and it is awesome.

// Will return the time the ObjectId was created
ObjectId("505bd76785ebb509fc183733").getTimestamp();

// The BSON ID also increments, sorting by id will sort by creation date.
// The column is indexed automatically.

A single document can be up to 16M in size. You need to change your document structure if you want to get past that.

Prefer using 64-bit MongoDB. MongoDB's architecture can be either 32-bit or 64-bit. One 32-bit MongoDB process is limited to 2 GB of data. This is because storage engine uses memory-mapped files for performance.

MongoDB size on disk is big compared to other databases. You can compact the database to reduce the size, but that locks it from all other operations.

MongoDB writes very fast but uses unsafe writes by default. You can use safe writes but you also need to check getLastError if you want to confirm that writes went through.

MongoDB does not enforce data types. It does not raise any errors if you store "duck" to "price" field.

All updates are targeted to a single document. Relational databases update everything that matches the query.

// Multiupdate Example:
db.people.update(
    {age: {$gt: 30}},
    {$set: {past_it: true}},
    false,
    true
)

MongoDB only supports single document transactions. No relation database style multiple action transactions and rollbacking.

MongodDB does not support joins like in relational databases. This can only be resolved by good design.

MongoDB is case-sensitive by default. Good to keep in mind if searches do not find data.

// These are a different thing
db.people.find({name: 'Russell'})
db.people.find({name: 'russell'})

// This matches both but is slow...
db.people.find({name: /russell/i})

Selective counts are slow even if indexed.

// As of 17.10.2012
db.collection.count( {username: "my_username"} );

// Consider using $inc command.

Range queries are indexed differently. When using a range query like $in, MongoDB applies the sort before it applies the range.

db.collection.find({_id: {$in : [
    ObjectId("505bd76785ebb509fc183733"),
    ObjectId("505bd76785ebb509fc183734"),
    ObjectId("505bd76785ebb509fc183735"),
    ObjectId("505bd76785ebb509fc183736")
]}}).sort({last_name: 1});

Unindexed queries cause performance problems to indexed queries. At least partially index all queries, even those rarely run cron jobs.

You can define database wide write locks in MongoDB, but no collection level locking.

MongoDB has no authentication by default. You need to include one yourself.

If you connect publicly, you should know that data is unencrypted by default.

Journaling is configured to write to disk every 100ms. Do not disable journaling.

Process limits on Linux that runs MongoDB should be set to over 4000.

Replication

With replica sets, data is replicated between all the nodes and one is elected as the primary. If the primary fails, the other nodes will vote between themselves and one will be elected the new primary. So you must use an odd number of replica set members or arbiters that can vote.

If primary fails and comes online later, db will return to last common point. You can restore lost data from rollback directory.

If replica set is getting too slow, use sharding. Shard before you get to 80% of your estimated capacity. Note that you cannot update shard key, remove and reinsert.

Shards do not enforce unique keys automatically: Enforcing unique keyes

Choosing shard key

Note that there might be inconsistent reads in replica sets.

// As of 17.10.2012

// Writes the object to the primary
db.collection.insert(
    {_id: ObjectId("505bd76785ebb509fc183733"), key: "value"}
);

// This find is routed to a read-only secondary, and finds no results
db.collection.find({_id: ObjectId("505bd76785ebb509fc183733")});

// Replication lag can be from minutes to hours.

Benchmarking

You can easily test queries. Always ask MongoDB to explain what the query is doing by using explain().

db.collection.find(query).explain()
// =>
{
    // BasicCursor means no index used
    // BtreeCursor would mean this is an indexed query
    "cursor" : "BasicCursor",

    // The bounds of the index that were used,
    // see how much of the index is being scanned
    "indexBounds" : [ ],

    // Number of documents or indexes scanned
    "nscanned" : 57594,

    // Number of documents scanned
    "nscannedObjects" : 57594,

    // The number of times the read/write lock was yielded
    "nYields" : 2 ,

    // Number of documents matched
    "n" : 3 ,

    // Duration in milliseconds
    "millis" : 108,

    // True if the results can be returned using only the index
    "indexOnly" : false,

    // If true, a multikey index was used
    "isMultiKey" : false
}

MongoDB has a good profiler. Profiler adds overhead to database performance but it is essential in optimization. Helps you optimize performance more than it costs.

// Consider recording all queries that take over 100ms.
// Will profile all queries that take 100 ms.
db.setProfilingLevel(1, 100);

// Will profile all queries.
db.setProfilingLevel(2);

// Will disable the profiler.
db.setProfilingLevel(0);

// Usage

// Find the most recent profile entries.
db.system.profile.find().sort({$natural:-1});

// Find all queries that took more than 5ms.
db.system.profile.find( { millis : { $gt : 5 } } );

// Find only the slowest queries.
db.system.profile.find().sort({millis:-1});

Monitoring your production is relatively easy.

// Monitor index sizes
// MongoDB really needs your working set to fit in RAM.
// Helps deciding when to scale machine, drop index etc.

// Monitor index misses
// When MongoDB has to hit the disk to lead an index.
// Index did not fit to memory.
// Ideally 0.

// Monitor number of current operations
// When number of current operations spike, see what caused it.

// Monitor replication lag
// If you use replication as backup.
// Should be only minutes.

// Monitor I/O performance
// Disk performance can help to identify current operation spikes.

// Monitoring help
mongotop - shows how much time was spend reading or writing each
           collection over the last second
mongostat - brilliant live debug tool, gives a view on all your
            connected MongoDB instances

MMS - 10gen’s hosted mongo monitoring service. Good starting point.
Kibana - Logstash frontend. Trend analysis for Mongo logs.
         Pretty useful for good visibility.

Useful Commands

db.currentOp()          // shows you all currently running operations
db.killOp( opid )       // lets you kill long running queries
db.serverStatus()       // shows you stats for the entire server
db.stats()              // shows you stats for the selected db
db.collection.stats()   // stats for the specified collection

Sources

-Things I wish I knew about MongoDB a year ago