The source of a poor-performing MongoDB database is often use of a shard key that doesn't distribute the data load evenly across servers. While you can't change the shard key of an existing collection or database, the process of dumping the collection, re-importing its data, and selecting a more appropriate shard key is generally straightforward. Just make sure your new shard key optimizes the database's query and write speeds.
It isn't uncommon to find yourself with a MongoDB database that needs a new shard key. For any number of reasons, your existing shard key may be slowing the speed of queries and write operations, or otherwise impeding performance.
The problem was presented in a Google Groups post from December 2012: a 150GB collection can no longer use the _id field as its shard key. There are many instances where _id is not a good choice as a shard key. As the MongoDB Manual explains, _id is a monotonically increasing number, and all insert operations on such a shard key will store data on a single shard. This is suitable only if you don't update much or most write operations are update() operations distributed evenly throughout the data set.
The database designer in this case was asking whether he could avoid having to dump and re-import the collection -- in essence, starting over. The short answer is "no". Fortunately, starting over need not be difficult, and with a little thinking about how your database is being used, the result can be a big improvement in performance.
Re-key your DB in three steps: Dump, import, reset
The succinct version of the process of replacing your database's shard key is presented in a Stack Overflow post from 2011:
The MongoDB FAQ goes into more detail on the process:
- Dump all data from MongoDB into an external format.
- Drop the original sharded collection.
- Configure sharding using a more ideal shard key.
- Pre-split the shard key range to promote even distribution.
- Restore the dumped data into MongoDB.
The FAQ explains that pre-splitting the chunk ranges in an empty sharded collection allows you to add data to a collection that has already been partitioned. Pre-splitting lets you partition an existing data collection stored on a single shard. It also facilitates importing a large amount of data into an imbalanced cluster, or prevents problems when ingesting the data as is would imbalance the cluster.
The split command is applied to the collection's empty chunks to split them manually, as illustrated in this example:
Match the shard key to the way the database will be used
In a response to the Google Groups post cited above, Alexis Okuwa offered another shard-key strategy. By adding a single-letter prefix to his collections' pre-split chunks, the replica sets are able to create their own chunks as necessary. The letters are selecting randomly to avoid low cardinality, but when you notice that data isn't getting to particular nodes, you can edit the code manually to rebalance the servers.
Database maintenance is often a one-click procedure with the BitCan cloud storage service. BitCan stores your heterogeneous MySQL and MongoDB databases, as well as Unix/Linux systems and files. You can set up backups in seconds, and your data is encrypted at both the communication and storage layers, while BitCan sends you alerts about the status of your backups.
With BitCan's elastic storage, you pay for only the space you use, and you can access your data in the public cloud or behind a firewall. Visit the BitCan site to register for a free 30-day trial account.