In today’s post, we talk about the flavours of Redis Cache for Microsoft Azure and how to decrypt undocumented errors that we can receive from Redis during the provisioning phase.
When using Microsoft Azure, we have two main options for using Redis Cache:
- Azure Cache for Redis: a SaaS service provided by Microsoft that uses OSS Redis (Redis open-source)
- Azure Cache for Redis Enterprise: fully managed by Microsoft
In 90% of the cases, Azure Cache for Redis it’s the best-managed cache solution available in Microsoft Azure, offering a 99.9% availability SLA, supporting 120 GB of memory and 40k client connection. I had a great experience with it, as long as you understand the connection concept of Redis.
Azure Cache for Redis Enterprise provides more power, up to 13TB of memory, 2M client connection, a 99.999 availability SLA, 1M operations per second and all the features of Redis Enterprise like active-geo, modules, time series support and Redis on Flash.
Going with the Redis Enterprise tiers comes with a price, but it is a good offer, especially when you need active-geo replication. You need to consider that active-geo replication requires 2 instances of Redis Enterprise. The pricing model includes both replicas, because active-geo is the most common motivation to go with the Enterprise tier.
From the performance point of view, you should expect up to 70–75% more operations per second and 40% better latency when you compare the Premium tier of Azure Cache for Redis with Azure Cache for Redis Enterprise.
From the cost point of view, it is hard to compare, but if we compare the P5 tier of Premium offer of Azure Cache for Redis with E100 that is similar from the cache size point of view, your running cost is almost double, BUT you get two data nodes. The real cost hit is when you use the C5 or C5 tier standard tier, and you need to go with the enterprise one for active-geo, for example, when the running cost is 7–10 times more.
An important difference between the two services is who provides support for it. Azure Cache for Redis is fully managed by Microsoft and well documented. Azure Cache for Redis Enterprise is managed by Microsoft, and you get good support from the Redis team, but you need to consider that it is not directly from Microsoft.
When should I use the Enterprise tier?
The no. 1 feature of the Enterprise tier is the active-geo (active geo-replication) that makes the customer move to this tier, together with JSON and time series features. The performance provided by Azure Cache for the Redis Premium tier is very good. Until now, I was not involved in a project where migration to the Enterprise tier was caused by performance. Yes, we were using 2–4–6 instances of Premium tier deployed across regions without issues. But when geo-replication was required, the Enterprise tier was the best option, even in comparison with other solutions provided by the market.
When you consider active geo-replication of Redis, there are 3 main cases when you can use it:
(1) Geo-distributed applications: where you want to replicate content across multiple locations in near-real time
(2) Handle region failures: where you ensure a failover node, that is fully replicated
(3) Roaming user sessions: across 2 different locations, having the ability to serve the user from two different locations
Issues with Enterprise tier
A few weeks ago, we had an interesting experience with Redis Enterprise. For a few days, the team received the below error when they were trying to spin up an instance of Redis Enterprise.
“message”: “The resource operation completed with terminal provisioning state ‘Failed’.”
The error message is encrypted, and it is not very clear. You can make a lot of assumptions and you don’t know if the problem is from your side or from Redis Enterprise.
No additional information was provided, and the ARM scripts were correct. The same error message was provided when a new Redis Enterprise instance was created from the Azure Portal. We were trying to do a PoC, targeting the active-geo feature, and it was not the best experience for the technical team that was stuck. We opened an incident ticket to Redis related to it.
The cause of the incident was a lack of resources available to Redis in the given region. After a few days, we were able to create a new instance without a problem, but I still have a concern related to — What if this would happen in the production environment during an incident? Would be the customer solution down for a few days?