Blueprint of a cloud data store inside for objects with state updated often (Azure and AWS)

This post will focus on how we shall design a cloud system inside Azure and AWS that needs to handle objects with the state changing very often.

Proposition

I want to have a system that can store objects in a storage with a flexible schema. A part of the objects are updated every day and queries are per object. Aggregation reports are done inside a data warehouse solution, not directly inside this storage.

Requirements

Let’s imagine a client that needs a cloud solution with the following requirements:

  • 500M objects stored
  • 1M new objects added every day
  • 10M objects state are changed every day
  • Update operation shall be under 0.3s
  • Query per objects shall be under 0.2s
  • 30M queries per day that check the object state
  • Dynamic schema of the objects
  • Except for object state, all other attributes are written one time (during insert operation)

Solution overview

We need a NoSQL solution to store the objects. Even so, the challenging part is to design a solution that enables us to do fast updates on the object state and keep the cost under control. By using a NoSQL solution the size of the storage is not a problem. Having 500M or 1000M objects is the same thing as long as we are doing the partitioning the right way from the beginning.

Because most of the updates and queries are on the state attribute of the object we can optimize the storage by adding an index on the state field if necessary.

Even if we have a NoSQL solution, having a high number of operations would create similar bottlenecks as for a relational database. Besides this, we need to take the cost into account and try to optimize the consumption as much as possible.

The proposed solution is a hybrid one that combines two different types of NoSQL solutions. Object attributes are stored inside a document DB storage except for the state attribute. The state attribute is stored inside key-value storage, that it is optimized for a high number of write and reads.

The latency may increase a little because if you need to load an object completely, you need to query two storages, but at the same time because of the key-value storage, you can retrieve easily the object state based on the object ID.

The cost of storing the data inside a key-value database with items that are very often updated is much lower in comparison with a document DB storage.

In the next part of the post will take a look on how the solution would look like inside AWS and Azure.

Azure Approach

The data layer would use two different types of storages from Azure Cosmos DB. The first type of storage is DocumentDB that would be used to store the objects information inside Azure Cosmos DB. All objects attributes are stored inside it except the object state attribute.

The object state attribute is stored inside Tables. This key-value stored is optimized for a high number of writes. To reduce the running cost, even more, we would replace the Tables, that are part of Azure Cosmos DB with Azure Tables. Azure Tables are a good option for us as long we limit our queries per objects ID (key) and we don’t try to run complex queries.

Inside Azure Cosmos DB we have a level of control at the partitioning level, but for Azure Table we might hit some limitations. Because of this, if we go with an approach where we use Azure Tables the Partition Key shall be the hash of the object ID and the key the object itself. Also, if the number of transactions per Azure Table is higher than 20K/second, multiple Storage accounts might be required. If you don’t want to manage these possible issues and reduce risk then you should go with Tables from Azure Cosmos DB.

Azure Cosmos DB can scale automatically and has a DR strategy that is very powerful and easy to use. It’s one of the best NoSQL solutions that are on the market, and when it is well configured is amazing. Automatic DR and data replication across regions is available, but everything comes with a cost, especially from operation part ($$$)

AWS Approach

The approach inside AWS is similar, but it is built on top of AWS DocumentDB to store the object attributes and AWS DynamoDB to store the object state. AWS DynamoDB is one of the best key-value data stores available on the market. When data consistency and DR are not top priorities AWS DynamoDB is your best choice. Besides scaling and speed capabilities, AWS DynamoDB is enabling us to push a stream of data to the AWS Redshift. Any update of the data is automatically pushed to the data warehouse, allowing us to have an out of the box system that sends the updates to the data warehouse.

AWS DocumentDB fulfils his job very good, being able to store 500M of objects without any issues.

Final thoughts

Splitting the data storage into 2 different types of storage can be a right choice when there only a small subset of the fields are updated very often and the rest of them are written only one time — during the insert operation. Combining the power of document DB storage with key-value pair storage enables us to design a system that can manage a high throughput easily.

Both cloud providers offer services that match our needs that are highly scalable and cheap from the operational perspective. Inside Azure, this can be achieved by combining DocumentDB and Tables from Azure Cosmos DB. For AWS ecosystem we would need to use AWS DocumentDB and AWS DynamoDB.

Source: http://vunvulearadu.blogspot.com/2019/09/blueprint-of-cloud-data-store-inside.html