Don’t always rely on machines MongoDB extensions demonstrate sharding capabilities-1024programmer

Johannes Brandstettercs is currently the Director of Development Operations at omSysto GmbH. He once served as 1&1 IT Manager; he is committed to project research on MongoDB, Hadoop, and AWS.

Looking at the design of the most advanced application modeling and infrastructure best implementations, it is easy to find a problem: if you need rapid growth in performance and adapt to free scaling from 0 to 10 million requests, there is only one way to go – Scale Out!

Scaling usually describes a method of adding more resources to a system. It needs to be distinguished from two aspects:

Scale vertically/up

Vertical expansion means adding resources to a single node in the system, typically increasing the CPU or memory of a single computer.

Scale horizontally/out

Horizontal scaling means increasing the number of nodes in the system, such as adding computers to a distributed software application. To give a simple example, increase the number of servers in the Web server system from 1 to 3.

With MongoDB, you can scale in two ways according to your needs: focus on read operations or improve write operations.

10gen tells us some extension options:

“With built-in scale-out capabilities, MongoDB allows users to quickly build and scale their applications. With automatic sharding, you can easily distribute data across different nodes. Replica sets provide high reliability across data centers. Automatic failover and recovery.”

Sounds good, actually? Now start importing 24 million pieces of data into a MongoDB cluster consisting of 6 nodes. Automatic sharding is used here:

From the picture above, the results are not very good, showing very uneven writing performance. If we focus on the performance of a single node, the situation is even worse than it seems above:

Click each to view a larger image

Not only are you looking at uneven write speeds across all nodes, but you’re also seeing huge differences in write operations per second between nodes. At one moment Node2 reached 4000 insert/s, but at that moment node5 only had 740 insert/s!

Why is this happening? We used mongoimport to monitor the entire import process and found that: everything ran very well for a period of time in the beginning, and our cluster also provided a good throughput; but as time went by, the speed gradually slowed down. Through mongostat, we can easily find that sometimes many nodes do not allocate any data within a few seconds.

To understand why, we have to look at MongoDB’s sharding mechanism. Here we have to look at 3 basic components:

1. MongoDB Sharding Router (mongos)

All connections to the cluster are selected by mongos, and sharding is completely transparent to the application layer – routing is done through Transparent Query Routing.

2. Mongo Config Server

Determine which corresponding node a certain part of the metadata should be allocated to.

3. Mongo Shard Node

A normal mongod process used to support data.

So to store data in such a cluster, you must do a metadata query to check which node the current data should be written to. However, even if there is a certain overhead in determining the target node before each write, the performance loss here should not be so large.

But the reality is so cruel! What I want to mention here is the Mongo Config Server, which means you must tell MongoDB which range of data needs to be used for sharding. This is the legendary “shard key”! Similarly, if you make a wrong decision on the shard key, it will cause a series of troubles. Of course, for the sake of simplicity and clarity, here we choose a shard key that we think is suitable for our data type.

So what MongoDB does now is internally put data into so-called “chunks” – ��Like a fixed size data bucket. In this way, each chunk contains a certain range of data and only a certain capacity of data (default 64MB). When we import data into the database, these chunks will be injected with data, and once the data is full, they will be split by MongoDB. In this case, the following two situations will inevitably occur:

New metadata must be written to the configured server
Balancer must consider whether a chunk should be transferred to another node

Now we need to pay attention to the second point! Because we did not tell MongoDB how any shard key is related or distributed, MongoDB must make the best estimate before sharding and try to distribute the data evenly among the shards. It ensures even distribution of data by maintaining the same number of chunks on all nodes at any given time. In MongoDB 2.2, new migration thresholds were introduced and successfully minimized the impact of balancing the cluster. However, Mongo also does not know the range and distribution of shard keys, so hot spots are likely to occur – a large part of the data is written to the same shard, and therefore only on the same node.

Now that we know what the problem is, can we do something about it?

Data pre-splitting

If you want to pre-split the data, you must have a full understanding of the data you are about to import, so instead of letting MongoDB complete the chunk selection, you need to do it yourself. Depending on your shard key choice, you can either use a script to generate the relevant commands (a good use case) or use the methods provided in the MongoDB documentation. Either way you’ll be done telling MongoDB where to allocate chunks:

 
 				
 
 db.runCommand( { split : "mydb.mycollection" , middle : { shardkey : value } } );

Of course, you can also force MongoDB to allocate corresponding nodes to chunks. Again this depends on your use case and how the application reads the data. The important thing to note here is to preserve the locality of the data.

Increase chunk size

There are specific chunk size instructions in the document:

1. Small chunks obtain a more even distribution of data at the expense of frequent migration, which will add more overhead to the Query Routing Layer.

2. Large chunks will significantly reduce migration, which will bring benefits both in terms of network design and internal Query Routing Layer overhead. Of course these gains come at the expense of potential data uneven distribution.

In order to make MongoDB run fast enough, the chunk size is increased to 1GB. Of course, this is based on the understanding of your own data, so you must make the size of the chunk more valuable.

Close Balancer

Since we have specified a specific node for the chunk and do not want new chunks to be created during this process; we can use the following command to shut down the Balancer:

 
 				
 
 use config

 
 				
 
 db.settings.update( { _id: "balancer" }, { $set : { stopped: true } } , true );

Choose the correct shard key

Shard key is the core of node distribution. Click here for details to get shard key related advice. In MongoDB 2.4, MongoDB also supports the option of independently selecting the hash shard key.

Through these efforts, the import situation is as follows:

The picture above shows the rate on one node in the cluster. Of course, there are still some gaps, because the data is not absolutely uniform. But overall this is a big improvement, with import speeds reaching 4608 inserts per second across the cluster; import speed is now only limited by the network interface between nodes.

Summary

There is no free lunch in the world! If you want great performance, you have to do it yourself.

Don’t always rely on machines MongoDB extensions demonstrate sharding capabilities

author: admin

Leave a Reply Cancel reply

Contact us

Scan wechat and follow us

给这篇文章的作者打赏

author: admin

Related recommendations

Leave a Reply Cancel reply

Contact us

Scan wechat and follow us