Design an online forum on Azure Document DB

Couple of weeks back I posted an article on how to design an online forum application on top of Azure Table Storage. This post is about how to design the same application using Azure Document DB. Same as the previous article, I want to stress the point that the way we design an application and the thinking behind the design completely differ based on the NoSQL technology we select.

These are the basic requirements / functionalities of the application.

  • Forum members can post questions under categories.
  • Forum members can reply to posts.
  • Users have points based on their forum contribution.

Document type NoSQL databases are handy in storing data as documents, most of the modern document databases support JSON as the document storage format.

Also I assume that you have the understanding of Azure Document DB about indexing, consistency levels and how it is structured as databases, collections, documents and more.

Basic Idea

Based on the above requirements, if we design a single document it would look similar to this.

image

As you see we can have a single document structure to cover everything the application requires, but it has some drawbacks too.

Mainly user data is redundant and if we want to update the points of the user we have to go through all the documents and update it, we use other data operations like map reduce to perform these operations in a large scale document type implementations.

Design for Azure Document DB

It is recommended and straight forward to have a dedicated collection for each identified entitiy. Thinking on that base, we would require four main collections they are users, categories, posts and replies.

This design is easy, highly scalable but expensive, because Document DB databases are containers for the collections. Collections are the containers for the documents and also a single collection is the billing entity, meaning that if you create a database and two collections within that database in a tier priced $25 per month, then you will be billed $50 per month as you have two collections.

In this design will be having 4 collections.

User document

image

Category document

image

Posts document

image

Reply document

image

 

But this would not be the ideal design for the solution in terms of the best tradeoff between the cost and the solution design.

Because having a dedicated collection for the category is not necessary, we can simply have the category as an attribute in the posts. Having a dedicated collection for users might sound too much. I do not totally offend this – because, sometimes it is a good idea to have a dedicated collection for the users, especially when the number of users grow in large scale.

Also remember the design using the Azure table storage where we used bucketing strategies to partition the users, we can use the same strategy here if we have millions of users. We can put them in different collections rather than keeping them in one single collection.

But say that we have only few hundreds of users and few categories, then we do not want to have separate collection for each. So we need a mechanism to put them in the same collection and query them.

The idea is simple, again this is not the technology but it is the best decision we make on top of the technologies we use.

Have two documents with their Ids represent the entities or you can have an attribute called type which represents the document.

image

image

When you change the design like this, there is a significant change that you should do in your queries.

But again the idea here is to give you the possibilities how you can design the solution on top of Document DB.

Also thinking about the posts and replies, better practice is to keep the posts and replies in separate collections as designed earlier. Because not only that you can scale them individually but also it is not a best practice to have unbounded attribute in a document, meaning an attribute who’s values have theoretically no limits. Replies is an unbounded array, so we will have a dedicated collection for that.

Conclusion

This is the second series of the post in designing applications on Azure NoSQL offerings, however one of the main point I want to clarify is the design decisions we make vary based on the NoSQL technology we pick.

Advertisement

Design an online forum application on Azure Table Storage

NoSQL technologies provide solutions for issues that relational databases cannot provide. At the same time, designing an application on top of a NoSQL technology requires specific technology dependent design decisions and architecture.

This post addresses the issue and explains how to model a real world problem using Azure Table Storage. This is neither an introduction to Azure Table storage nor a code sample, but this post provides the thinking behind designing applications on Azure Table Storage.

Designing in Azure Table Storage

Azure Table Storage is a column store NoSQL data store, it has 4 types of querying practices.

  1. Point Query – Query based on Partition Key and Row Key, retrieves single entity.
  2. Range Query – Query based on Partition Key and range of Row Keys, retrieves multiple entities.
  3. Partition Scan – Partition Key is used but Row Key is not known / not used in the in the query, other non-key fields might be used.
  4. Table Scan – Partition Key is not used in the query, other key fields might be used.

Scenario

Think something similar to StackOverflow or MSDN forum. (Be mindful that developing a forum in that scale requires lot more technologies and strategies other than NoSQL). But as a scenario let’s assume we’re going to build a small scale forum with the following features.

    • Forum members can post questions under categories.
    • Forum members can reply to posts.
    • Users have points based on their forum contribution.

Design

In modeling our application in the Azure Table Storage, we need to identify the tables first. Users, Posts, Replies and Categories are the main tables.

Categories table can have single partition or may be two partitions – Active and Archived.

1

Row Key has been used to store the category name, in the entity class CategoryName has IgnoreProperty attribute, which makes it virtual and there will not be a physical column called CategoryName in the table. Since category name is the Row Key under a partition there won’t be duplicates in category names within the partition.

image

Keep the fixed Partition Keys as enums, this avoids mistakes (mostly typing mistakes in dealing with strings) in defining Partition Keys.

image

A simple query (partition scan) to retrieve all Active categories.

image

 

Users table has a special design, email address and password are used as credentials. So email address should be unique across the entire Users table regardless of the Partition Key – Row Key combination. So are we going to design the Users table in a single partition with email being the Row Key ?

This is possible but it is not a good design practice, dumping millions of user records under single partition.

The strategy is simple bucketing, I define 6 partitions for the Users table with Partition Key simply being a single number, like 1 to 6. And allocate email addresses based on their first letter.

Consider, that any email address starting from ‘a’ to ‘d’ go to partition 1, email addresses starting from ‘e’ to ‘h’ go to partition 2 like shown in the table below. This achieves both the uniqueness of the email address across the table and gives the partition scalability.

aa

A simple method like below would decide the Partition Key.

image

 

Posts table would be a straight forward design with Partition Key being the category name and PostId (GUID) would be the Row Key. Posts of each category live in a separate parition.

image

Like the Category entity, Post entity class will link Partition Key and Row Key using two properties CategoryName and PostId respectively marked with the IgnoreProperty attribute. See the code snippet given below.

image

If you think, using category names as Partition Keys would outgrow the rows in a single partition since one category can have hundreds of thousands of rows, you can concatenate the category name along with the year and create partitions like Azure-2015, Azure-2016 or use any other possible variable.

But the point is, making sure that you can calculate the Partition Keys from a formula gives you the ability to limit your queries maximum to Partition Scans.

 

In this scenario, Replies table can take two highly possible designs.

First, there is no separate table for Replies, use the Posts table with an additional column called ParentId. Posts will have an empty ParentId and replies will have values for ParentId of the post they are made to. Replies also go to the same partition as Posts.

Second design is having a separate table for Replies – I would personally go for this design as we can have more detailed information specific to replies.

Partition Key would be the category name and Row Key would be the Reply ID. PostId would be another column. So in order to find the replies of a Post we would trigger a Partition Scan.

Conclusion

Designing an application on top of any NoSQL technology requires specific planning and architecture based on the domain and the selected NoSQL platform. The knowledge of the underlying NoSQL technology is very essential in order to make the design efficient.

For example, in the above design if we get a requirement to show the recent 20 posts in the home page, regardless of the category, this would definitely trigger a Table Scan and also we have to bring all the posts and sort it based on the TimeStamp property.

So a good decision would be having another temporary table to keep the top 20 posts, when a new post is added the Id of the post will be updated in that table and removing the last old one. We can use write behind strategies in the application to do this.

So make sure that you design the application for the technology in a correct and efficient way.

The biggest misleading point I always here in the industry is, NoSQL development is easy and takes less time. Those two arguments are subjective and also you need to compare it with some other technology, commonly they do the comparison with relational database technologies. But in my experience I don’t see any significant time savings in using a NoSQL technology. But there are other benefits for sure.

The remote server returned an error: (412) The append position condition specified was not met – Azure Append Blob issue

Azure Blob storage got a new addition recently, that is Append Blob. Based on the Microsoft documentations this is an ideal blob storage service for frequently modified data, which makes it a suitable candidate for the logging.

It is a natural tendency that we see AppendText method and we go for it for simple text based logging and you end up with this exception – The remote server returned an error : (412) The append position condition specified was not met.

This blogger has addressed the issue and provided a workaround in his post.

If you’re looking for a solution, then above link has it. If you want to read about the reason for this issue continue reading.

Investigation

I started the investigation directly by digging into the source code of the Azure Storage SDK in GitHub

In the documentation it is mentioned that AppendText method can be used only in the single write scenarios.

See this section CloudAppendBlob Methods

It is very explicit that in the SDK it has been mentioned that AppendText should be used only in the single writer scenarios, check that in line number 1739 of this file

CloudAppendBlob methods that cannot be used in concurrent access scenarios

  • AppendFromByteArray
  • AppendFromFile
  • AppendFromStream
  • AppendText

Also the Async counter parts of the above methods also cannot be used in concurrent access.

The only method that can be used in concurrent scenarios is AppendBlock / AppendBlockAsync

So Why AppendBlock is special ?

Investigating the source in GitHub it is clear that the call chain goes like this.

AppendText calls AppendFromByteArray, AppendFromByteArray calls AppendFromStream, AppendFromStream calls the internal method UploadFromStreamHelper 

UploadFromStreamHelper cannot handle concurrent scenarios.

Continuing the investigation…..

Still we need to investigate how AppendBlock can handle the concurrent requests. We pass a System.IO.Stream object to AppendBlock method and the AppendBlock method calls  WriteToSync on the passed Stream object.

Do we have a WriteToSync method in the System.IO.Stream ? No.

The Storage SDK has a implementation of WriteToSync as an extenstion method. See this file – line number 65

It is clear that WriteToAsync  has a synchronized call to the blob using the ManualResetEvent, so that is why AppendBlock could handle the concurrent access, but remember this is a blocking call.

 

Hidden gems of Azure Web Apps

Azure Web Apps is a component of Azure App Service. Developers love Web Apps as this is very simple web server as a service model. Often people ask the question, what is the difference between VMs Cloud Services and Web Apps. I think this post explains the difference in greater details.

Accessing KUDU

Azure Web Apps run on Kudu, an open source engine which is responsible for the management and operation of the Azure Web Service.

You can reach the Kudu environment of your azure web app by simply modifying your URL, example : I have this following site to manage the Azure Key Vault (in dev environments) https://keyvaultmanager.azurewebsites.net/ in order to access the Kudu environment of this site, simple add the .scm part after the web app name part of the URL. Something like this https://keyvaultmanager.scm.azurewebsites.net/

You should be logged in to the Azure Portal in order for this to work. You can configure many settings in the KUDU. You can also access KUDU using the Azure Preview Portal under the Tools section of your web application.

Server Affinity

Web Apps by default has Server Affinity enabled, meaning that the Web Apps service will direct your requests to the same server in the scaled environments. You can leverage the advantages of local in-memory session states. Please not that modern cloud application design principals do not recommend you use local session states, instead they promote the centralized session management.

However, Web Apps having Server Affinity enabled by default is an out of the box feature of the Web Apps.


You can simply set the value to false to disable the server affinity. You can access the above template of your web app using Azure Resource Explorer in the Azure Preview Portal.

If you want a simpler solution add this in your Web App’s App Settings section – clientAffinityEnabled = false

Live Monitoring

This is a great management tool for the production Web Apps. Azure Web Apps enables you to monitor your app live with very low latency monitoring. It also provides a comprehensive log of your application.

You can access from the Azure Preview Portal, under the Tools section of your Web App and click on the Troubleshoot option. The click Live HTTP Traffic


Auto Healing

You can reach this option, using Azure Preview Portal. Click on Tools section of your web app, click on Troubleshoot option and then click Mitigate.

In this option you get to set an alert or logging rule, but the interesting part is that you can enable an recycle rule of your web app. So when a condition is met your web app will be recycled (in IIS application pool will be restarted).




Above I have set a rule to recycle the web app every time if receives 100 requests in 10 seconds. You can add more sophisticated rules and chain them together.


Cached-Aside Pattern using Redis on Azure

Cache-Aside is a common pattern in modern cloud applications. This is a very simple and a straight forward one. The followings are the characteristics of the pattern.

  • When an application needs data, first it looks in the cache.
  • If the data available in the cache, then application will use the data from the cache, otherwise data is retrieved from the data store and the cache entry will be updated.
  • When the application writes the data, first it writes to the data store and invalidates the cache.

How to handle the lookups and other properties and events of the cache are independent, meaning the patters does not enforce any rules on that. These diagrams summarize the idea.

  1. Application checks the cache for the data, if the data in the cache it gets it from the cache.
  2. If the data is not available in the cache application looks for the data in the data store.
  3. Then the application updates the cache with the retrieved data.

  1. Application writes the data to the data store.
  2. Sends the invalidate request to the cache.

Implementation

Project : https://github.com/thuru/CloudPatterns/tree/master/CloudPatterns/CacheAsidePatternWebTemplate

The above project has an implementation of this pattern.

Data objects implement an interface ICacheable and an abstract class CacheProvider<ICacheable> has the abstract implementation of the cache provider. You can implement any cache provider by extending CacheProvider<ICacheable>. GitHub sample contains code for the Azure Redis and AWS Elastic Cache implementations.

Implementation of ICacheable : https://github.com/thuru/CloudPatterns/blob/master/CloudPatterns/CacheAsidePatternWebTemplate/Cache/ICacheable.cs

Implementation of CacheProvider<ICacheable>: https://github.com/thuru/CloudPatterns/blob/master/CloudPatterns/CacheAsidePatternWebTemplate/Cache/CacheProvider.cs

Implementation of AzureRedisCacheProvider : https://github.com/thuru/CloudPatterns/blob/master/CloudPatterns/CacheAsidePatternWebTemplate/Cache/AzureRedisCacheProvider.cs

The template also includes Cache Priming in Global.asax. This could be used to prime your cache (loading the mostly accessed data in the application start)

What you need to know about SQL Database Dynamic Data Masking before the GA

Please read this post after reading this, as some critical changes have gone to the DDM feature of SQL Databases in GA
SQL Database Dynamic Data Masking will be Generally Available (GA) very soon by the end of this month.

For more information about Dynamic Database Masking please read this article.

As per the article you can configure the dynamic data masking on columns which have sensitive data, and also we have the options to specify the SQL Logins which should be excluded from masking. Meaning any user connecting to the database using the specified SQL login will see the data without being masked.

  • SQL Logins will be removed from the exclusion list and you have to specify the SQL users or Azure AD users directly in the exclusion list.
  • SQL users with administrative privileges are always excluded from dynamic data masking

Example:

In a SQL Database server with a SQL Database named Db1 there are 3 SQL logins.

SQL Login

SQL user

Privilege

thuru Thuru Admin (sa) for the server
login1 User1 db_owner for Db1
login2 User2 db_owner for Db1

First thing to note is, after the GA of the feature you cannot user SQL login names in the exclusion list in the portal meaning that you cannot specify login1, login2 in the list. Instead of specifying the SQL login you should specify the SQL users as user1, user2.

And SQL users who have administrative privileges always see the data.

Look at the image below as I have specified only user1 in the exclusion list.

Login to the database as thuru and executing a SELECT statement results this.

As you see, though I haven’t specified the admin user in the exclusion list, still admins are automatically excluded from dynamic data masking. If you’re using admin users in your application and enabled dynamic data masking, you have to create other SQL logins to access the database.

Login to the database as user1 and user2 outputs the following results respectively.

How to create other SQL Logins in SQL Database

Login to the server using the admin user.

  1. Select the master database and create the logins using this command.

  2. Connect to the database you want to add the user to and add the user using the created SQL login in step 1

  3. Add roles to the user

Azure Elastic Database Pool – Managing SQL Databases in breeze

If you have multiple SQL databases in Azure and looking for a centralized management solution, then Elastic Database Pool is the one you’ve been looking for.

Elastic Database Pool is a feature that you can enable in SQL Database Servers which has V12 update. Elastic Database Pool is still in preview as of this writing, so you should agree the preview terms in order to use this.

Azure Elastic SQL Database Pool allows you to manage different databases at a centralized single user interface. This is a breeze feature especially if you have a SaaS application running on Azure with the single tenant strategy for your database layer. You will have different databases for each customer and rolling out updates on each database is not only a time consuming task, but also very annoying and error prone task too. Most of the developers use scripted tools either purchased from some third parties or build their own custom tools to manage collection of SQL databases. But with the introduction of Azure Elastic Database Pool, you can centrally manage your SQL databases as group.

You can add / remove any database to and from the pool when desired, apart from the centralized management – which is a very important feature for devs, cost wise benefits also possible in Elastic SQL Database Pool. Because the pool has a DTU count which is known as eDTU (elastic data through put units) shared across the databases in the pool. So there’s no fixed DTU allocation for the databases.

For example, consider that your application has customers across the globe and the database usage varies based on the time zone. If you’re using SQL Databases without adding them to the pool, you have to allocate a fixed DTU (let’s say 50) for each customer database. Assuming you have 2 databases, each has 50 DTUs allocated and are in different time zones. Your application has traffic only in the day time, during the night the DTUs are unused but you’re still paying for that tier.

But if you can put those databases in an Elastic Database Pool with an eDTU of 50 or little bit more than 50, both databases will run smoothly at a reduced cost. Because when one database is not in use the Elastic Database Pool will allocate the eDTUs to the other database. And also Elastic Database Pool has the flexibility to set the minimum and maximum number of eDTUs for databases.

Creating an Elastic SQL Database Pool

  • Go to the new Azure portal and select the SQL Database Server on which you want to create a pool. One server can have multiple pools. But one database can be attached to only one pool. In the server blade click on the Add pool button.

  • Give a name to the pool. Select the pricing tier, and you can add databases to the pool at this time or later. Note the ‘Configure Performance Section’ and you can adjust the eDTU and the storage capacity for the pool. Also you can set the minimum and maximum eDTU usage per database. You can change these settings later. The number of databases per pool, the total number of eDTUs and the maximum size of the pool are determined by the pool’s pricing tier.

Managing the Elastic Database Pool

Now we have a pool and let’s see how to manage the databases in the pool. Managing the pool is done through the jobs, it requires certain installation in your azure subscription, which includes a cloud service, SQL Database, service bus and a storage account. You can click on the Manage or Create jobs button in the Elastic Pool blade, and if you haven’t set up the Jobs you will see the below screen for the first time. Jobs are in the preview stage and you should accept the preview terms, and also setup a credential. This credential will be used as the administrator credential for the Jobs SQL database.

After setting up the required resources for the jobs, now you can create jobs. Below screen shows the jobs creation blade. Jobs are scripted in TSQL, and note that here I have provided the credential of the SQL Server I want my script to run, not the credential of the Jobs SQL database.

I have the option to run the jobs immediately and also save them and run later. You can see the jobs in the Manage Jobs blade.

You can start a job by clicking Run and it will show real time updates about the job status.

All my databases are updated.

Elastic Database Pool is very convenient and a powerful approach to manage multiple databases as a group.

Elastic Database Pool Jobs is an application which is a hosted cloud service, which has a SQL database, a service bus and a storage account, these resources are charged separately in addition to the Elastic Database Pool cost.

Securing your Azure Web Apps FTP Endpoints

Web Apps are my favorite PaaS offering in Azure. They are simple yet powerful, but the way how Azure handles the Web App FTP Deployment Credentials is not that nifty and better you should know and understand that.

If you find very lazy to read the entire post, you can jump to the summary section to grab the findings.

You can setup FTP Deployment Credentials for your Web App, as most of the developers do this, even though they use fully integrated CI/CD because this is very handy at certain times. You can enable the FTP Deployment Credentials (FDC) for the Web App in the portal under the PUBLISHING section of your Web App.

Click on Deployment credentials and you will see the blade where you can enter the FTP username and password. First I entered a common name (wonder what is it? Bob) and typed my usual password (wonder what is it? **********) and hit Save. I got the below error message. It is very clear that FTP usernames should be unique across all the Azure customers.

Then I entered a username which I assumed no any other Azure customers have taken (wonder what is it? nickiminaj), entered the usual password and hit Save. It worked. I got the success message. So now I can enter my FTP credentials when I browse to the FTP host name of the site. But this FTP Deployment Credential is shared among all your Web App regardless of which resource group or hosting plans or pricing tier or even the subscription they are in. This is generally known as Web Deployment Credentials.

FTP Deployment Credential include a username and the password. This Username is in the following format your Web App name\username and the password. Look at these below images of two different Web Apps from 2 different subscriptions.

CAUTION

So sharing your FTP Deployment Credentials of a Web App leaves you in the danger of exposing access to all your Web Apps the particular Microsoft Account / Azure AD account has access to. This can be disastrous when you share the credentials with third party developers, they only have to guess the name of your other sites to get the full username and they can access your Web Apps simply, since they already know the passwords.

The question is how to generate different FTP credentials for each Web App?

When you set up your Web App, each of it has its own FTP credential assigned by KUDU. KUDU is the project provides infrastructure for the Azure Web Apps. You can get this credential by downloading the Publish Profile of the web site.

Publishing profile is a simple XML file. Open the file and look for the <publishProfile> element where publishMethod attribute with the value of ‘FTP’.

This is credential is known as Site Level Credential of you Web App and only applicable to that particular Web App. Three green dots show the required information.

You cannot set the password but you can simply regenerate the password by clicking the Reset Publish Profile.

You can share this credentials with anyone and they can access only the particular Web App.

Summary

  • FTP Deployment Credential username should be unique across all Azure customers.
  • FTP Deployment Credential username is shared across all the Web Apps the current Microsoft Account has access to, regardless of resource groups, hosting plans, pricing tier and subscription. And the username is common for the Microsoft Account.
  • Each Web App has the FTP Deployment username as WebAppName\username
  • Ex: If you have two Web Apps (webapp1 and webapp2) and if you create the username kevin they will have the FTP deployment username webapp1\kevin and webapp2\kevin respectively, with the same password.
  • You can gain the Site Level Credential for your Web App with the generated username and password uniquely for each Web App from the Publishing Profile