Menu

RabbitMQ, backing stores, databases and disks

From time to time, on our mailing list and elsewhere, the idea comes up of using a different backing store within RabbitMQ. The backing store is the bit that's responsible for writing messages to disk (a message can be written to disk for a number of reasons) and it's a fairly frequent suggestion to see what RabbitMQ would look like if its own backing store was replaced with another storage system.

Such a change would permit functionality that is not currently possible, for example out-of-band queue browsing, or distributed storage, but there is a fundamental difference in the nature of data storage and access patterns between a message broker such as RabbitMQ and a generic database. Indeed RabbitMQ deliberately does not store messages in such a database.

Firstly we need to discuss what properties RabbitMQ itself expects of any backing store. RabbitMQ writes messages to disk in two cases: either the message has been published in such a way that it must be written to disk (e.g. published with delivery_mode = 2) or memory pressure is causing RabbitMQ to start running out of RAM and so it is pushing messages to disk in order to free up RAM. In the first case, just because we've written the message to disk, does not mean that we're going to forget about it from RAM: if memory is abundant then there's no reason to incur the cost of a subsequent disk read.

In the second case, it means that any backing store that keeps everything in RAM all the time is immediately not a good fit: RabbitMQ writes messages to disk in order to free up RAM, thus if the "writing to disk" bit actually just moves the message from one area of RAM to another without freeing up RAM then nothing has been gained. Using such a backing store might work and it might achieve the improvements in functionality desired, but such a change would have substantial impact on the scalability of RabbitMQ: it would no longer be able to absorb more messages than can be kept in RAM, which was one of the raisons d'être for the new persister work that led to RabbitMQ's current default backing store.

Some databases or key-value stores write disk contents by initially writing a snapshot of their entire data set, and then writing deltas to that data set. After a while, either time based or based on the number of deltas, or ratio of deltas to snapshot size, a new snapshot is written, and then the previous snapshot and all its deltas can be thrown away. This is how RabbitMQ's old persister worked. The problem with this is that it can repeatedly cause vast amounts of data to be unnecessarily rewritten. Imagine you have two queues, one of which is entirely static: no one is publishing messages to it, and no one is consuming messages from it, it's just sitting there, but it contains several million messages, all of which have been written to disk. The other queue is almost always empty, but is moving very quickly -- thousands of messages a second are being published and consumed from it. Every message sent to that queue has to be written to disk, but they're all being consumed as soon as they've been written to disk. Consider the effect of this scenario on the backing store: the second queue will cause a rapid stream of deltas to occur but whenever the snapshot is rewritten, it'll cause the entire contents of the first queue to be rewritten too even though there has been no change to that queue's contents. So again, backing stores that write messages to disk in this way are likely to be a poor fit for RabbitMQ's needs.

So suitable backing stores (assuming the performance and scalability properties that RabbitMQ has need to be kept: this is by no means certain in all scenarios) would be able to store a volume of data bounded only by disk size rather than RAM, and also have a reasonably sophisticated means of storing data on disk such that unchanged data won't be rewritten indefinitely.

There are a couple of further aspects of RabbitMQ's default backing store that are worth mentioning. Queues themselves decide when and whether to write a message to disk. But a single message can be sent to multiple queues and it is obviously advantageous to make sure each message only gets written to disk once. However, there are two distinct pieces of information here: firstly, the message content itself. This is the same in every queue that the message has been sent to, and should only be written to disk once, regardless of the number of queues it goes to; note that subsequent writes of this do not need to do a value comparison: if the ID of the message is known to the backing store then the message body will match what is already on disk -- message content is never altered by the broker. The second piece of information is the existence of the message in each queue: where in the queue it lies, what its neighbours are, and what its queue-specific status is. This second piece of information is what allows RabbitMQ to start up, recover messages and queues from disk and ensure that the messages in each queue are in the same order as when RabbitMQ was shut down.

Thus RabbitMQ's default backing store consists of a node-global message store which is concerned only with writing message contents to disk; and a per queue queue index which uses a very different format for writing per message per queue data to disk. Because these two needs are very specific, there are an awful lot of optimisations that can be applied (and we have!).

Generic database benchmarks normally show that read performance vastly out performs write performance. If it doesn't then that normally means the writes aren't actually going to disk (with fsync), or there's a bug which is crippling read performance. And indeed, databases have historically been optimised for read-heavy workloads. This matches their general use case: there is a slowly expanding data set which must be queried in various different ways. Deletions tend to be quite rare: if you think about the typical website shopping basket on top of a relational database, then unless a customer deletes their account, there are very few reasons to ever issue deletions -- even if a product is discontinued, you're probably just going to set a flag on that product row because otherwise you risk stopping customers from being able to see their order history (assuming it's normalised).

So the vast volume of data in most databases is fairly static. This is the exact opposite of data in message brokers: for us, reading data is the rarest operation, and writing and deleting data are the common cases. Ideally, if RabbitMQ is running in plenty of memory, there will never be any reads from disk at all. There will only be writes for messages that are published in such a way that they have to be written to disk, and even then, provided we can get the message out to a consumer quickly enough, there are many ways in which we can optimise out those writes. We only ever read data when memory pressure has forced us to write messages to disk and then forget about the message from RAM. Read performance is certainly important: we work hard to make sure RabbitMQ gets rid of data as fast as possible (without utilising /dev/null) and being able to read messages from disk quickly is part of that. But avoiding the write in the first place is the goal.

In fact, as far as message brokers are concerned, it's best to think of RAM as a large write-back cache for the disk, and then the task is to optimise the management of this cache to maximise the elimination of writes by delaying them for as long as possible in the hope that the corresponding deletion occurs before the write has really gone to disk. This is quite obviously very different from normal databases which do not try to make gains from the lifespan of data being so short as it frequently is in a message broker.

None of this is meant to deter efforts to make RabbitMQ work with alternative backing stores, but merely to explain why we decided to do our own thing when writing the new persister for RabbitMQ (which first came out with RabbitMQ version 2.0.0) rather than use an off-the-shelf data store. It explains why building a high performance message broker directly on top of a normal database is tricky at best, and why the nature of data in a message broker is very different from the nature of data in a database.

6 Responses to “RabbitMQ, backing stores, databases and disks”

  1. Dmitriy Samovskiy Says:

    Great post, thanks! I would only like to point out one thing.

    You mostly look at the issue from perspective of optimizing throughput of a queuing system, as measured by the number of messages. There is another aspect of this problem - fault tolerance.

    There are many workloads where throughput is not a concern and no significant pressure on RAM is expected, but instead there is a significant issue that could be caused by loss of a host on which rabbitmq process is running. In this case, a delay in message delivery by say even 1 minute (amount it will take you to reboot a host) could be quite bad.

    The reason I have always wanted alternate backing stores is to do replication of messages in scenarios where throughput is less important than fault tolerance and reliable delivery. Detaching a queue from a host on which it's running could be huge as well, especially in IaaS cloud environments. Potentially being able to consume from the same queue on two different hosts coud be beneficial as well.

    Just wanted to mention this. I hope you guys consider alternate backing stores in the context of scenarios where fault tolerance is more important than throughput.

  2. Matthew Sackman Says:

    Hi Dmitriy, thanks for the comment.

    You are absolutely right that there is more to all this than just throughput: being able assert that a message or messages are safely written to disk and can't be lost in the event of many failures is very important too. RabbitMQ and AMQP support transactions which, in combination with durable queues and persistent messages, allow the broker to confirm back to the client when messages have been fully fsync'd to disk.

    Within a cluster of RabbitMQ's, you can consume from the same queue on two different hosts, and if you use a SAN or NAS or some such then you ensure that the disk to which RabbitMQ writes messages is fault tolerant and highly available. If you look at our pacemaker guide then that explains how to achieve one form of HA.

    We are actively working not just on other forms of HA, but also on improvements to the way in which clients can be informed that messages have been written to disk. Many of these improvements will be generally available fairly soon, so watch this space!

    Using a different backing store to achieve redundancy in storage is not a magic bullet: being able to replicate messages written to disk is only part of the challenge. And as mentioned above, we already support redundant storage -- all we rely on for the underlying storage is some sort of filesystem that Erlang can talk to, and that's hardly a limiting constraint!

  3. Dmitriy Samovskiy Says:

    Thanks for your thoughts.

    I certainly know about RabbitMQ cluster, but it's not a viable solution (at least for me) to achieve fault tolerance - due to mnesia and partitioned network. Please see http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2008-September/001620.html (whoa, been a long time since I posted to rabbitmq-discuss). Cluster also is not recommended for WAN deployments.

    SAN or NAS do not fit cloud deployment model very well, and I try to avoid shared disk as much as I can. Plus, they are usually not cheap.

    I also know about your Pacemaker scenario, but it looks like it's active-passive (please correct me if I am wrong). One single active node represents a potential bottleneck.

    What I am after is something that resembles riak_core - multiple storage nodes, no special nodes, N, R, W values that can be set by application, hinted handoff, etc.

  4. Matthew Sackman Says:

    Clustering is certainly not recommended for WAN links or anything that's really going to be less reliable than a LAN. That said, I certainly have read the articles recently pointing out that whilst small LANs may be fairly predictable and partitions rare, as soon as you get to datacentre size, with high end routers, the odds of something occasionally going wrong and not really knowing where the packets are ending up turns out to be quite high.

    Any distributed system has serious problems when quorum can't be reached. Depending on your pov, wrt mnesia, it could be said to be a fault that it doesn't shut itself down when a 2-node cluster splits. But maybe even population numbers clusters should never be tolerated as in the case of a split, there's no clear winner. These problems are complex and are the subject of a large body of research. Also, horses for courses: mnesia, by it's nature goes for C and A from the CAP triangle. Just because other products pick a combination involving P doesn't mean that what you're left with is easier to work with in practise. See, for example this blog entry.

    The pacemaker setup is very much active/passive, but can actually be run in a cluster.

  5. personalni agentury Says:

    Thanks for providing the such a useful article. Thx

  6. Java Developer Says:

    Thanks for the nice article. I have been searching for a provider option for rabbitmq and understand this useful and my search stops here.

    Though we are thinking about implementing a approach for large message sizes that are going to be not good for typical RabbitMQ processing.

    In this case we are going to go to a key-value storage for storing the message and then just pass on the key to RabbitMQ Queue. Hope, this may give us best of both world (in theory) .

    Would you like to share some thoughts/approaches on getting a good throughput with large message sizes like 4K - 10K ?