MongoDB Sharding: A Comprehensive Guide

Aug 9, 2023
MongoDB sharding

-sidebar-toc>

In our data-driven society, where the volume and complexity of data continue to expand at an unprecedented rate, the requirement for scalable and robust databases has grown to be a crucial. According to estimates, 180 zettabytes of data will be created before 2025. These are huge numbers that are difficult to comprehend.

This comprehensive guide we go deep into the complexities of MongoDB sharding, demystifying the benefits, its components, best practices, common mistakes, and how you can begin.

What exactly is the Database Sharding?

A database sharding technique is a management method that involves splitting the growing data base horizontally into smaller and easier to manage units called shards.

As your database expands, it is feasible to split it into several smaller components and store each part in its own separate machine. The smaller pieces, also known as shards, are distinct subsets of the overall database. The process of splitting and dispersing data constitutes database sharding.

If you are considering implementing a sharded data base there are two major approaches: developing a custom sharding software or buying an existing one. This raises the question of whether building a sharded solution or paying for it is the better option.

To make this choice make sure you consider the costs of third parties, bearing in mind the following factors:

  • Learnability and skills of the developer Learning curve that comes with the product and how well it aligns with the skills of the developers.
  • The data model and API that is provided to users by this system: Every data system comes with its own methods of representing its data. Its ease of use and the speed of integrating your application with the system is a key factor to take into consideration.
  • Customer support and online documentation: In cases where there are issues or require help during implementation, the level and accessibility of support from the customer and comprehensive online documentation become crucial.
  • The cloud's deployment: As more companies migrate to cloud computing It is crucial to determine whether the third-party software can be used within a cloud-based environment.

In light of these aspects the next step is to construct the sharding technology or purchase a system that will do the heavy lifting.

What is Sharding? MongoDB?

The main reason to use the NoSQL database is the ability to deal with the demands of storage and computing for storing massive amounts of information.

Generally, a MongoDB database has a huge number of collections. Each collection is made up of different documents that hold details in the form key-value pair. The ability to split this huge collection into smaller collections using MongoDB sharding. This lets MongoDB to run query without placing too much stress upon the database server.

For example, Telefonica Tech manages over 30 million IoT devices across the globe. To keep up with the growing demand for devices, they needed an application that could grow in a flexible manner and handle the rapidly growing data infrastructure. Sharding was MongoDB's ideal choice as it was the best fit to their budget and capacity needs.

With MongoDB shredding, Telefonica Tech runs well over 115,000 queries per second. That's 30,000 database inserts every second with just one millisecond of delay!

The benefits of MongoDB Sharding

Below are some advantages of MongoDB sharding for large-scale data that you can enjoy:

Storage Capacity

The sharding process disperses data over the cluster shards. This distribution lets each shard contain a fragment of the total cluster data. Additional shards will increase the cluster's storage capacity depending on when the database grows.

Reads/Writes

MongoDB distributes read-and-write workload over shards within the form of a sharded cluster. This allows each shard the ability to perform a subset of cluster operations. Both of these workloads can be scaled horizontally across the cluster by adding more Shards.

High Availability

The use of shards as well as configuration servers to replicate sets provides greater reliability. Even if one or more shard replica sets become completely unavailable the cluster that is sharded can write and read partial data.

Protecting yourself from outages

A lot of users are affected when machines go down due to an unplanned outage. When a system isn't sharded because the entire database could have been shut down and the consequences would be massive. The radius of negative user experience/impact can be contained through MongoDB shredding.

Geo-Distribution and Performance

Shards with duplicates are able to be put across different areas. This means that customers can have the ability to access their data at a low latency i.e., redirect consumer requests to the shard nearer to their location. In accordance with the policy for data governance of a region, specific Shards may be set to be assigned the region of.

Parts and components that make up MongoDB Sharded Clusters

In the past, we have explained the notion of an MongoDB sharded cluster, we can explore the elements that comprise such clusters.

1. Shard

Each shard is a subset of the sharded data. In MongoDB 3.6 the shards have to be installed as a replica set to provide high availability as well as redundancy.

Every database in the sharded cluster is based on a primary shard that'll hold all the non-sharded databases for that. This shard doesn't have any connection to the primary in an replica set.

To alter the primary shard of the database, make use of movePrimary command. movePrimary command. The process of transferring the primary shard may take duration to finish.

During that time, you shouldn't attempt to browse the databases associated with the database till the migration process is completed. This process might impact the overall operation of your cluster depending on the volume of data to be migrated.

It is possible to use mongosh's sh.status() method of mongosh to examine the cluster's overview. The method returns the primary shard used for the database along with the proportion of chunks distributed across the different shards.

2. Config Servers

Implementing config servers to shard clusters in replica sets will improve the consistency across the configuration server. This is because MongoDB is able to use the standard replica set read and write protocols to the config information.

To deploy config servers as replica sets then you'll need to install on the WiredTiger storage engine. WiredTiger employs a document-level concurrency system for writing operations. This means that multiple clients are able to edit different documents in a collection at the same simultaneously.

Config servers store the information for a sharded cluster in the config database. To get access to the config database, you can use the following command in the mongo shell

make use of the config

Here are a few rules to be aware of here:

  • An replica-set configuration that is used for config servers should have zero arbiters. An arbiter participates in an election for the primary but doesn't have a copy of the data and therefore isn't able to become the primary.
  • This replica set is not able to have any delayed members. The members who are delayed are able to copy the dataset of the replica set. The delayed member's data set includes an earlier, or delayed version of the data set.
  • It is necessary to create indexes for servers for config. Simply put, no member should have members[n].buildIndexes setting set to false.

If the config server replica set loses its primary member and can't choose a new member that is available, the metadata for the cluster will become only accessible to read. You'll still be able to read and write through the shards but no chunk splits, or transfer will take place until the replica set can select an alternate.

3. Query Routers

MongoDB mongos instances are able to serve as query routers, allowing clients applications as well as the sharded clusters to connect easily.

In MongoDB 4.4, mongos can accommodate hedged reads, which can reduce latencies. With hedged reads, the mongos instances can send read operations to two replica set members each shard being asked. It'll then return results from the first respondent on each shard.

The three parts are interconnected within a sharded cluster:

Mongos instances will route an inquiry to a cluster by:

  1. Reviewing for shards that need to receive the query.
  2. Make a cursor of every shard you are targeting.

Mongos then combine the information from each shard and return the result document. Certain query modifiers like sorting, are executed on each shard prior to mongos retrieving the data.

If the shard key or prefix for shard keys is component of a query, mongos may perform a pre-planned operation, pointing queries to the shards of a particular subclass in the cluster.

For the Production cluster make sure that your data is backed up and that your systems are available. The following configuration for a production-sharded cluster deployment:

  • Each shard should be deployed as three-member replica sets
  • Deploy config servers as 3-member replica sets
  • Install one or more Mongos routers

If you want to deploy a non-production cluster it is possible to deploy a sharded cluster with the following components:

  • A single shard replica set
  • A replica set config server
  • One mongos instance

What is the process behind MongoDB Sharding Work?

We've now discussed the various components of the sharded cluster, it's time we dive into the details of this process.

To break the data across different servers, you'll make use of mongos. When you connect to transmit the query to MongoDB, mongos will look for and locate where the data is. It'll then get it from the right server, and then join it all together in the event that it's split across different servers.

How to Setup MongoDB Sharding Step-by-Step?

Setting up MongoDB Sharding is an operation that involves several steps to create a secure and reliable database cluster. This is a procedure step-by-step on how to setup MongoDB Sharding.

Before we get started we must keep in mind that, to enable sharding on MongoDB, you will need to have at least three servers: one for the config server, one for the mongos server as well as one or more to host the shards.

1. Create a Directory From Config Server

For the first step, we'll make a directory for the config server's information. It can be accomplished using the following command on the first server:

MKdir/data/configdb

2. Begin MongoDB with Config Mode

Next, we'll begin MongoDB in configuration mode on one server by with this command:

mongod --configsvr --dbpath /data/configdb --port 27019

The configuration server at port 27019 and save its information in the the /data/configdb directory. We're using the --configsvr option to indicate the server is used as a config server.

3. Start Mongos Instance

The next step is to start the mongos application. The process is designed to route requests to the correct shards in accordance with the sharding keys. To start the mongos instance, use the following command:

mongos --configdb :27019

Replace with the IP address or hostname of the machine on which the config server operates.

4. Connect To Mongos Instance

After the mongos instance has been in operation, we are able to connect to it using mongoDB's shell. This can be done with the below command:

mongo --host  --port 27017

In this command, the mongos-server parameter should be replaced with the hostname, or the IP address of the server hosting the mongos instance. It will then open the MongoDB shell. This will allow us to connect with the mongos instance and add servers to the cluster.

Replace "mongos-server>" with the IP address or hostname of the machine on which the mongos instance is running.

5. Add Servers To Clusters

Now that we're connected to the mongos server, we are able to connect servers to the cluster using the following command:

sh.addShard(":27017")

This command is replaced by the IP address or hostname of the server hosting the shard. The command will join the shard in the cluster, and then make it accessible for usage.

Repeat this process for every shred you'd like to include in the cluster.

6. Allow Sharding to be enabled for databases

Finally, we'll enable sharding in a database using the following command:

sh.enableSharding("")

When you execute this command, should be replaced with the name of the database you wish to shred. This will enable sharding for the database you specify, which will allow users to share their information across several shards.

This is the end of it! After following these steps, you should now have a fully functional MongoDB cluster that has been sharded to scale horizontally and handle high-traffic loads.

Best Methods to Practice MongoDB Sharding

1. Find the Best Shard Key

The shard key is a important element of MongoDB sharding, which determines the way data is distributed across the shards. Choosing a shard key that evenly distributes data across all different shards as well as supports the most popular queries is important. Avoid selecting one that creates hotspots or inconsistent data distribution. This could result in performance issues.

When choosing the best shard key, you should look at your data and the kind of queries you'll run and select a key that meets those needs.

2. Plan for Data Growth

In building your sharded-cluster, plan for future growth beginning with enough shards to handle the current load and then increasing the number as necessary. Be sure your equipment and network infrastructure are able to accommodate the amount of shards you'll need and the volume of data you anticipate to have in the future.

3. Utilize Dedicated Hardware to store Shards

Utilize dedicated hardware for every Shard to ensure maximum efficiency and security. Each shard needs its individual virtual or server, to make use of every resource without interruption.

Utilizing shared hardware may lead to resource contention and performance loss, which could affect the overall system's reliability.

4. Make use of Replica Sets to connect Shard Servers

Utilizing replica sets to shard servers ensures high availability as well as fault tolerance for your MongoDB Sharded Cluster. Each replica set should include at least three members, and each member should be located in a separate computer. This setup ensures that your hard-sharded cluster will withstand the failure of a single member or server.

5. Monitor Shard Performance

Monitoring the performance of your shards is essential to identify difficulties before they escalate into difficulties. You should monitor the CPU memory, disk I/O and network I/O of each server shard to ensure that the shard can handle the demands.

There are integrated monitoring tools like mongostat as well as mongotop as well as the third-party monitoring tools like Datadog, Dynatrace, and Zabbix for the performance of shards.

6. The Plan for Disaster Recovery

Planning for disaster recovery can be essential to ensure the security of your MongoDB Sharded Cluster. You should have a disaster recovery plan which includes routine backups and testing of backups to ensure they're valid, and a plan for restoring backups in case of the failure.

7. Make use of Hashed-Based Sharding if it is necessary.

When applications issue range-based queries, ranged sharding is advantageous since operations are limited to less than a single shard. You need to understand your data as well as the query patterns to implement this.

Hashed sharding guarantees a consistent distribution of reads and writes. However, it doesn't offer efficient range-based operation.

What Are the Most Common Errors To Avoid When Sharding Your MongoDB Database?

MongoDB sharding is a powerful method that allows you to scale your database horizontally and spread data across several servers. However, there are several errors that you need to be aware of when sharding your MongoDB database. Below are some of the most frequently made mistakes and the best way to stay clear of them.

1. Choosing the Wrong Sharding Key

One of the most important choices you'll make while you are sharding your MongoDB database is to choose the key for sharding. The key that shards your database determines how data is distributed across shards, and choosing the wrong key could result in uneven data distribution Hotspots, uneven distribution, as well as inadequate performance.

An error that is common is selecting a shard key value that only increases for new documents when using range-based sharding as opposed to hashed sharding. For instance, a timestamp (naturally) or any other document that has the time component as the primary component, like ObjectID (the beginning four bytes represent a timestamp).

If you choose an shard-key and insert a chunk, the entire write will go to the shard with the largest space. However, even if you add new shards, your maximum capacity to write will not increase.

If you plan on scaling in terms of write capacity, you can try using a hash-based shard key--which allows you to use the same field while providing adequate write capacity.

2. Try to alter the value in the Shard Key

Shard keys are immutable for an existing document, which means you can't alter the key. There are certain changes you can do prior to sharding but you cannot do this after. Trying to modify the shard keys for an existing document fails with the following error:

does not alter the Shard Key's value fieldid of collection: collectionname

Then, you can delete and insert the file to update the key shard instead of attempting to modify it.

3. Failure to Monitor the Cluster

Sharding adds complexity to the database environment, making it crucial to watch the cluster carefully. Failing to monitor the cluster can lead to performance issues or data loss as well as many other issues.

To avoid this mistake, you should install monitoring software to keep track of key metrics like use of memory, CPU storage space on disks, internet usage. You should also set up alerts when certain thresholds are reached.

4. In Too Long Awaiting to Add A New Shard (Overloaded)

The most common mistake you make while sharding your MongoDB database is to wait for too long to create a new shard. When a shard becomes overloaded by queries or data, it could cause difficulties with performance, and even slow down the entire cluster.

Let's say you've got an imagined cluster that is composed of two shreds, each with 20000 chunks (5000 are considered "active") And you need to include an additional shard. This 3rd shard will eventually contain one third of the active chunks (and the total number of chunks).

The problem is to determine when the shard will stop adding burden and turns into an asset. We would need to calculate the load that the system would produce when migrating the active chunks onto the new shard. We must also determine the time when it will be minimal when compared with the overall system increase.

It's quite easy to envision this set of migrations taking longer if we have an overloading set of shards it will take a lot longer the newly added shard to reach the threshold, and thus become a net gain. As such, it's best to be proactive and add capacity before it's required.

Some mitigation options include checking the cluster on a regular basis and creating new shards during periods of low traffic so that there is less resource competition. The best option is to manually balance the targeted "hot" pieces (accessed more than others) in order to transfer the activity to the new shard more efficiently.

5. Under-Provisioning Config Servers

If the servers of a config server aren't properly stocked the result could be unstable performance as well as instability. Under-provisioning can occur due to an insufficient allocation of resources like CPU, memory, or storage.

This can result in delays in query performance, timeouts, and even the possibility of crashes. To avoid this, allocating enough resources to the config servers is crucial, especially for larger clusters. The monitoring of the usage of resources by the config servers on a regular basis will help you identify problems with under-provisioning.

Another method to stop this is to use dedicated hardware for the config servers, rather than sharing resources with other components of the cluster. This will ensure that the config servers are equipped with enough power to support their demands.

6. Failing To Backup and Restore the Data

Backups are vital to ensure that data isn't lost in the event of a malfunction. Loss of data could be due to different reasons like the failure of hardware or human error. It can also be caused by malicious attacks.

7. Failing To Test the Sharded Cluster

Prior to deploying your sharded network for production, make sure you be sure to test the cluster thoroughly so that you know it will be able to handle load and requests. In the absence of testing your sharded system can lead to slow performance or even crashes.

MongoDB Sharding vs. Clustered Indexes: Which Is More Effective for large datasets?

Both MongoDB sharding and clustered indexes are efficient strategies to handle massive datasets. They serve different functions. Selecting the best method will depend on the requirements of your application.

Sharding is an horizontal scaling technique that distributes data across multiple nodes. This makes it an effective method for managing massive data sets with very large write speeds. It's transparent for applications, allowing them to interact with MongoDB in the same way as one server.

On the other hand, clustered indexes improve the efficiency of queries to search for data in large databases because they allow MongoDB to locate the data faster when the query matches the index field.

Which one of these will be more efficient for large datasets? It all will depend on the usage case and the workload needs.

If your application needs the highest write and query speed and requires a horizontal scaling and horizontally, MongoDB Sharding could be to be the best choice. However, clustered indexes may prove more efficient if the application has a read-heavy workload and requires frequently queried data to be organized in an order that is specific to.

Summary

A sharded-based cluster is a efficient architecture that handles huge amounts of data as well as scale horizontally to meet the demands of ever-growing applications. The cluster is comprised of shards, configuration servers, mongos processing and client software, and data is split based on a shard key that is carefully selected to assure efficient distribution and querying.

Utilizing the potential of sharding, applications can attain high availability, enhanced performance and efficient utilization of hardware resources. Selecting the correct sharding key is essential to ensure an even distribution of the data.

   What are your views on MongoDB and the process of database sharding? Is there any aspect of sharding that you feel we could have addressed? Please let us know by leaving a comment!

Jeremy Holcombe

Content and Marketing Editor at , WordPress Web Developer, and Content Writer. Outside of everything related to WordPress I like the ocean, golf and movies. Also, I have height problems ;).