The point of polyglot

Recently I spoke about polyglot persistence in one of the SQL Saturday events. The basic idea of this session revolved around the idea of not getting overwhelmed by the NoSQL boom, but at the same time understanding the modern application requirements which demand more features which side with the NoSQL features.

Enterprise application development is under massive shift than ever before. Enterprises look for more consumer application and social features in the enterprise software. Example – having a chat feature in a banking system, tags based image search, heavy blob handling features like bookmarking, read-resume-state and some go beyond the traditional limits and have AI features with cognitive services.

So the NoSQL technologies would help us in mapping, modeling, designing and developing these applications, sure they would do. But the adaption of NoSQL technologies, how it happens and the mentality of the people is quite interesting to see.

In my opinion there are two major concerns prevail in the industry about the adaption of NoSQL technology. They are

  • NoSQL for no reason  – People who believe that NoSQL is the way to go in all the projects. NoSQL is the ultimate savior.  NoSQL replaces the relational stores. World does not require relational databases. I often hear complains that the database table have more than 1 million rows or the database has grown more than 2 TB and now we think we need to move this to NoSQL, or they say it is very slow, so we need to move to NoSQL.
  • Fear of traditional relational database people – People who have relational database skills and think their skills do not match the NoSQL world and afraid of it. NoSQL is an alien technology that is going to replace relational databases. The fear of these people get worse by the group of people mentioned above who believe NoSQL for no reason.

Both parties miss the big picture. The better option is to use the right technology based on the requirement. The better case is – opting for polyglot persistence as the hybrid of both relational and NoSQL technologies.

Let’s name the decision point of when to make the move to polyglot persistence is point of polyglot. Below I have presented two real cases of polyglot persistence and mainly at which stage it happened.

Scenario of moving to polyglot from relational only – A product used in banking risk analysis, it handles many transactions and Azure SQL database running on premium tier. A feature came that users should be able to create their own forms and collect data (a custom surveys) we needed to store the HTML of the survey template and data filled by the users. At this point we thought about NoSQL, but we sided with relational. We stored the template as HTML and data as JSON in the SQL Database. We made this decision because there is no search required to be performed and the new feature seemed less likely to be used frequently. Later another feature rich chat module came with the ability to send attachments and group conversations. This is the point we decided to use Document DB (Azure based document type NoSQL). The user related data is in SQL Databases and the chat messages are in Document DB leading to a polyglot persistence.

Things to note : We were reluctant to move to NoSQL when the survey requirement came because, though it is dynamic during creation very much static after creation. And we didn’t want to add up NoSQL just because of this feature which is a part of a big module. But we readily made the decision of using Document DB for chat because it is a replacement of internal email system and not a good candidate to model using the relational schema.

Scenario of moving to polyglot from NoSQL only – This is a backend service and persistence of an emerging mobile app. Loads of unstructured data about places and reviews. Started with Azure Document DB. Later the app expanded and wanted the places and restaurants to be able to login using a portal and adjust their payment plans for promotions. We required to persist meta data and payment information – that’s the point we set up a Azure SQL Database and everything is smooth.

Things to note : It’s not that a NoSQL database cannot handle those transaction / accounting based information but it is  not a natural fit for any reporting and auditing purpose.

As you see there’s no strict rule on when one should decide to move to a NoSQL or to relational schema. I mention this balance as the natural fit.

Having strict demarcations of relational and NoSQL wouldn’t help to achieve the best use cases. As it’s hard to define the crossing point but it easy to see the overall business case and decide.

The below figure shows, the point of polyglot (author’s concept)

image

Natural fit plays a major role in deciding the point of polyglot. But it doesn’t mean it is always somewhere in the middle, it can be anywhere based on the product features, roadmaps and team skill. There are products which have polyglot persistence from the beginning of the implementation.

Though point of polyglot can be mapped like above, the implementation of polyglot is influenced by two major factors – they are cost of implementation and the available skills. The below figure shows the decision matrix (author’s concept).

image

Conclusion – There are two groups of people with opposing mindsets in adapting either NoSQL or relational stores. At some point most of the projects would go through the point of polyglot but this is not the implementation point. In the general ground, implementation decision is highly influenced by the decision matrix.

Advertisement

Design an online forum on Azure Document DB

Couple of weeks back I posted an article on how to design an online forum application on top of Azure Table Storage. This post is about how to design the same application using Azure Document DB. Same as the previous article, I want to stress the point that the way we design an application and the thinking behind the design completely differ based on the NoSQL technology we select.

These are the basic requirements / functionalities of the application.

  • Forum members can post questions under categories.
  • Forum members can reply to posts.
  • Users have points based on their forum contribution.

Document type NoSQL databases are handy in storing data as documents, most of the modern document databases support JSON as the document storage format.

Also I assume that you have the understanding of Azure Document DB about indexing, consistency levels and how it is structured as databases, collections, documents and more.

Basic Idea

Based on the above requirements, if we design a single document it would look similar to this.

image

As you see we can have a single document structure to cover everything the application requires, but it has some drawbacks too.

Mainly user data is redundant and if we want to update the points of the user we have to go through all the documents and update it, we use other data operations like map reduce to perform these operations in a large scale document type implementations.

Design for Azure Document DB

It is recommended and straight forward to have a dedicated collection for each identified entitiy. Thinking on that base, we would require four main collections they are users, categories, posts and replies.

This design is easy, highly scalable but expensive, because Document DB databases are containers for the collections. Collections are the containers for the documents and also a single collection is the billing entity, meaning that if you create a database and two collections within that database in a tier priced $25 per month, then you will be billed $50 per month as you have two collections.

In this design will be having 4 collections.

User document

image

Category document

image

Posts document

image

Reply document

image

 

But this would not be the ideal design for the solution in terms of the best tradeoff between the cost and the solution design.

Because having a dedicated collection for the category is not necessary, we can simply have the category as an attribute in the posts. Having a dedicated collection for users might sound too much. I do not totally offend this – because, sometimes it is a good idea to have a dedicated collection for the users, especially when the number of users grow in large scale.

Also remember the design using the Azure table storage where we used bucketing strategies to partition the users, we can use the same strategy here if we have millions of users. We can put them in different collections rather than keeping them in one single collection.

But say that we have only few hundreds of users and few categories, then we do not want to have separate collection for each. So we need a mechanism to put them in the same collection and query them.

The idea is simple, again this is not the technology but it is the best decision we make on top of the technologies we use.

Have two documents with their Ids represent the entities or you can have an attribute called type which represents the document.

image

image

When you change the design like this, there is a significant change that you should do in your queries.

But again the idea here is to give you the possibilities how you can design the solution on top of Document DB.

Also thinking about the posts and replies, better practice is to keep the posts and replies in separate collections as designed earlier. Because not only that you can scale them individually but also it is not a best practice to have unbounded attribute in a document, meaning an attribute who’s values have theoretically no limits. Replies is an unbounded array, so we will have a dedicated collection for that.

Conclusion

This is the second series of the post in designing applications on Azure NoSQL offerings, however one of the main point I want to clarify is the design decisions we make vary based on the NoSQL technology we pick.

Design an online forum application on Azure Table Storage

NoSQL technologies provide solutions for issues that relational databases cannot provide. At the same time, designing an application on top of a NoSQL technology requires specific technology dependent design decisions and architecture.

This post addresses the issue and explains how to model a real world problem using Azure Table Storage. This is neither an introduction to Azure Table storage nor a code sample, but this post provides the thinking behind designing applications on Azure Table Storage.

Designing in Azure Table Storage

Azure Table Storage is a column store NoSQL data store, it has 4 types of querying practices.

  1. Point Query – Query based on Partition Key and Row Key, retrieves single entity.
  2. Range Query – Query based on Partition Key and range of Row Keys, retrieves multiple entities.
  3. Partition Scan – Partition Key is used but Row Key is not known / not used in the in the query, other non-key fields might be used.
  4. Table Scan – Partition Key is not used in the query, other key fields might be used.

Scenario

Think something similar to StackOverflow or MSDN forum. (Be mindful that developing a forum in that scale requires lot more technologies and strategies other than NoSQL). But as a scenario let’s assume we’re going to build a small scale forum with the following features.

    • Forum members can post questions under categories.
    • Forum members can reply to posts.
    • Users have points based on their forum contribution.

Design

In modeling our application in the Azure Table Storage, we need to identify the tables first. Users, Posts, Replies and Categories are the main tables.

Categories table can have single partition or may be two partitions – Active and Archived.

1

Row Key has been used to store the category name, in the entity class CategoryName has IgnoreProperty attribute, which makes it virtual and there will not be a physical column called CategoryName in the table. Since category name is the Row Key under a partition there won’t be duplicates in category names within the partition.

image

Keep the fixed Partition Keys as enums, this avoids mistakes (mostly typing mistakes in dealing with strings) in defining Partition Keys.

image

A simple query (partition scan) to retrieve all Active categories.

image

 

Users table has a special design, email address and password are used as credentials. So email address should be unique across the entire Users table regardless of the Partition Key – Row Key combination. So are we going to design the Users table in a single partition with email being the Row Key ?

This is possible but it is not a good design practice, dumping millions of user records under single partition.

The strategy is simple bucketing, I define 6 partitions for the Users table with Partition Key simply being a single number, like 1 to 6. And allocate email addresses based on their first letter.

Consider, that any email address starting from ‘a’ to ‘d’ go to partition 1, email addresses starting from ‘e’ to ‘h’ go to partition 2 like shown in the table below. This achieves both the uniqueness of the email address across the table and gives the partition scalability.

aa

A simple method like below would decide the Partition Key.

image

 

Posts table would be a straight forward design with Partition Key being the category name and PostId (GUID) would be the Row Key. Posts of each category live in a separate parition.

image

Like the Category entity, Post entity class will link Partition Key and Row Key using two properties CategoryName and PostId respectively marked with the IgnoreProperty attribute. See the code snippet given below.

image

If you think, using category names as Partition Keys would outgrow the rows in a single partition since one category can have hundreds of thousands of rows, you can concatenate the category name along with the year and create partitions like Azure-2015, Azure-2016 or use any other possible variable.

But the point is, making sure that you can calculate the Partition Keys from a formula gives you the ability to limit your queries maximum to Partition Scans.

 

In this scenario, Replies table can take two highly possible designs.

First, there is no separate table for Replies, use the Posts table with an additional column called ParentId. Posts will have an empty ParentId and replies will have values for ParentId of the post they are made to. Replies also go to the same partition as Posts.

Second design is having a separate table for Replies – I would personally go for this design as we can have more detailed information specific to replies.

Partition Key would be the category name and Row Key would be the Reply ID. PostId would be another column. So in order to find the replies of a Post we would trigger a Partition Scan.

Conclusion

Designing an application on top of any NoSQL technology requires specific planning and architecture based on the domain and the selected NoSQL platform. The knowledge of the underlying NoSQL technology is very essential in order to make the design efficient.

For example, in the above design if we get a requirement to show the recent 20 posts in the home page, regardless of the category, this would definitely trigger a Table Scan and also we have to bring all the posts and sort it based on the TimeStamp property.

So a good decision would be having another temporary table to keep the top 20 posts, when a new post is added the Id of the post will be updated in that table and removing the last old one. We can use write behind strategies in the application to do this.

So make sure that you design the application for the technology in a correct and efficient way.

The biggest misleading point I always here in the industry is, NoSQL development is easy and takes less time. Those two arguments are subjective and also you need to compare it with some other technology, commonly they do the comparison with relational database technologies. But in my experience I don’t see any significant time savings in using a NoSQL technology. But there are other benefits for sure.