Should the Exchange 2019 Metacache Database Actually Be Implemented?

At Ignite 2018, Microsoft outlined several details related to Exchange Server 2019 that I previously wrote about. Included was the new Metacache Database (MCDB). The purpose is fairly clear: Exchange has been optimized to offer really large mailboxes on the cheapest storage possible which was afforded by being really intelligent about read/write operation and caching mailbox data in memory and Microsoft has wrung the towel dry. 100GB mailboxes on SATA HDDs are about the extent of what we can get and to improve search capabilities (that are also moving to “always online” operations for consistency between OWA, Mobile, and Outlook in cached and online modes), we need more disk performance. Also, as disks increase in size, it would be great to add more mailboxes per disk. We just cannot get there any longer with SATA HDDs.

One of the other things to recognize is that Microsoft has already deployed this within Exchange Online, but they also announced that the codebases for Exchange Online and Exchange Server have forked. This does not mean code will be written from scratch, but some of these new features are not going to be as rigorously tested as they have been prior to making it into the on-premises product.

NOTE: This argument is not considering whether Microsoft should implement MCDB within Exchange Online. There are at least 600K Exchange Servers in Exchange Online, from the last reported information. The scale and other factors make things unique for Microsoft and it is important that they operate at the absolute highest density as possible in terms of mailboxes per disk and mailboxes per server for the purposes of maintenance and even physical space.

MCDB Overview

If we review the Preferred Architecture, an Exchange Server 2019 system should have 12 SATA HDD with the maximum storage space (around 10TB each at the time of writing) for the Exchange mailbox databases, plus a spare for Auto Reseed. For every 3 active disks used for mailbox databases, there should be a single SSD (around 2TB at the time of writing), for a total of 4 SSD. The two classes of disk are at the core of the MCDB.

The MCDB will cache the in-mailbox search index and recent messages.

Other guidelines recommend four (4) database replicas per HDD where only one will be active when all replicas are in a healthy state throughout a DAG. The recommended limit is 2TB for a mailbox database which allows for proper overhead on a 10TB disk and the space consumed by the single lagged replica that will be resident per HDD.

The Challenges

As I noted before, this strategy makes sense for Microsoft based on their scale. However, even for the largest of on-premises deployments, this creates a number of challenges.

Mailbox Database Size

Microsoft supports a 2TB mailbox database, which is huge. For the smaller number of organizations that operate medium to large Exchange deployments, they are unlikely to want to operate mailbox databases of this size. Further, if the organization chooses to backup the databases, such large databases will be even more difficult. 400GB seems to be significantly more reasonable. With a smaller mailbox database, we no longer need a 10TB SATA HDD to support 4 replicas at 2TB each, we can get by with 2TB SSD with 4 replicas at 400GB each.

Complexity

Simplicity was the name of the game with the Database Availability Group was originally released with Exchange Server 2010. Sure, it is not the most simple solution and the complexity for operating a single Exchange server without consideration for high availability have increased dramatically, but the simplicity of operating a large deployment became so much easier as compared to operating clusters in Exchange Server 2003.

With Exchange Server 2013, we introduced Auto Reseed. This was a great addition because it helped to address some of the reliability concerns of operating without RAID in the local storage. However, it added some complexity because we had to layout our volumes and map them in such a way so that Exchange could determine what to do when an entire volume and it replicas were no longer available.

While MCDB is opportunistic (all requests natively still go to SATA HDD concurrently to cache requests and if the cache fails, you are still operational, but with degraded performance), it is yet another system that has to be properly designed and maintained. Further, while it should be resilient given that it is opportunistic, there have been tons of experiences where resiliency features have caused the exact downtime they were meant to stave off.

Cost

The cost of technology continues to decline and we will have larger and faster storage as time continues. A 2TB NVM.e SSD is somewhat expensive by today’s standards, but larger options will come on to the market in the coming years and the price per TB will diminish.

When considering the server design for the Preferred Architecture, we need eleven (11) 10TB SATA HDD ($400/each, or $4400 per server) and four (4) 2TB SSD ($350/each for SATA SSD, or $1400 per server, or $3000+/each for NVMe SSD and I could only reliably find 1.6TB at the time of writing, or $12,000 per server). The total server storage price (just for the Exchange mailbox databases) would be $16,400 when using NVMe and store 40 mailbox database replicas at 2TB each (80TB).

The Counter Argument

Simplicity. Do not deploy MCDB. Do not deploy SATA HDD + SSD. Deploy an SSD only solution without MCDB.

An optimal solution with MCDB would have high quality SATA HDD with NVMe SSD (as NVMe is roughly 4x the performance of SATA SDD).

A good performance compromise would be to deploy only SATA SDD. They are slower than the NVMe counterparts, but they are more affordable and since they will be available for all storage the storage it will be faster than the SATA HDD. Further, there will be fewer mailboxes per disk which lowers the performance requirements per disk.

Deploying SATA SSD will allow for a total of 15 SATA HDD to be deployed per server, with one being reserved for Auto Reseed. The total cost for this would be $5250 per server and would storage 56 mailbox database replicas at 400GB each (22TB).

It is clear that we are compromising mailbox-to-server density. However, having such a high density is a significantly greater risk for a smaller deployment (even the largest on-premises deployments could not approach Exchange Online).

This also means that the memory and processor per server could be lowered. I have written about the cascading effects of server failures related to Active Directory performance, in the past. Scaling out rather than scaling up has long been the mantra for Exchange. More servers with a lower mailbox density means fewer mailboxes impacted by replica failover events.