In push architectures, one of the main challenges is delivering data reliably to receivers. There are many reasons for this:
- Most push architectures (including those developed by our company) use the publish-subscribe messaging pattern, which is unreliable.
- TCP’s built-in reliability is not enough to ensure delivery, as modern network sessions span multiple connections.
- Receivers can’t tell the difference between data loss and intentional silence.
- There is no one size fits all answer.
The last point trips up developers new to this problem space, who may wish for push systems to provide “guaranteed delivery.” If only it were that simple. Like many challenges in computer science, there isn’t a best answer, just trade-offs you are willing to accept.
Below we’ll go over the various issues and recommended practices around reliable push.
Publish-subscribe is unreliable by design
If you rely on a publish-subscribe broker as your source of truth then you’ll most likely end up with an application that loses data. Publishers might crash before having a chance to publish, and brokers don’t guarantee delivery anyway.
The only way a publish-subscribe broker could conceivably deliver data 100% reliably would be to do one of two things:
- Implement publisher back pressure based on the slowest subscriber.
- Store all published messages in durable storage for all of time.
Both of these options are terrible, so in general nobody does them. Instead, what you see are various degrees of best-effort behavior, for example large queues, or queues with a time limit (e.g. messages deliverable for N hours/days).
This isn’t to say publish-subscribe is unsuitable for push; in fact it’s an essential tool in your toolbox. Just know that it’s only one piece of the puzzle. The ZeroMQ guide has a section on publish-subscribe that’s worth reading even if you’re not using ZeroMQ.
TCP is about flow control, not reliability
In the modern world of mobile devices and cloud services, a single user session can easily span multiple client/server IP address pairs. TCP will retransmit data as necessary within a single connection, but it won’t retransmit across connections, for example if a client has to reconnect. Further, even if IP addresses are stable, TCP doesn’t attempt retransmissions forever. It tries really hard for awhile but will eventually give up if a peer is unresponsive.
In practice, this means TCP alone isn’t enough for reliable data transfer, and you’ll need to layer a reliable protocol on top of it.
If you learned long ago that TCP is for reliable communication (compared to, say, UDP), then you may be confused by this and wonder why developers continue to build systems using TCP. Well, TCP provides other useful features, such as back pressure, ordered delivery, and payloads of arbitrary size, so it is still an incredibly useful protocol even if we can’t depend on it for reliable transmission.
Error handling on the receiver
With request/response interactions, you can retry if a request fails, or just bubble errors up to the UI. With push, lost data looks about the same as no data, and users are left wondering why their screens aren’t updating.
What’s really obnoxious about this problem is that it’s not enough to just throw an error when a connection to the server is lost. You can lose data even when a connection seems to be working fine, due to some deeper failure within the server. This means the receiver usually cannot rely on the existence of a connection or subscription to be very meaningful, and will need to discover errors some other way.
Reliable transmission fundamentals
Before we get into the common practices, let’s go over some basics. Reliably exchanging data over a network requires two things:
- An ability to retransmit data, potentially for a long time if loss is not tolerable.
- An entity responsible for initiating a retransmission.
For example, when a mail server sends email to another mail server using SMTP, the sending mail server owns the data to be retransmitted and also acts as the responsible entity.
The entity responsible for initiating a retransmission doesn’t have to be the sender, though. For example, in web architectures, it’s common for a server to attempt to push data to a receiver, but if that fails then it’s up to the receiver to query the server for the missed data. In this case the receiver is the responsible entity.
Before you can build a reliable system, you need to determine whether the sender or receiver should be the responsible entity. This usually comes down to which side should care more about the transmission.
Alright, so how can applications ensure data is reliably delivered?
Use a regular database as a source of truth. If pushed data is important, the first thing you should do is write it to durable storage (on disk, with no automatic expiration), before kicking off any push mechanisms. This way if there’s a problem during delivery, the data can later be recovered.
The receiver should be the responsible entity. Receivers usually come and go, and it makes more sense for receivers to keep track of what they need rather than for servers to keep track of what has been sent.
Have a way to sync with the server. Receivers should be able to ask the server for new data, independent of any realtime push mechanisms. This way, publish workers can crash, queues can max out, and connections can get dropped, yet receivers can always catch up by making requests for updates.
Architect your system initially without push. Related to the previous recommendations, if you build a system where your data lives in durable storage, and receivers have a way to ask for incremental updates, then it will be easy to add a push layer on top. This may seem like a boring approach, but boring works.
Consider sending hints. This is where you push indications of new data rather than the actual data. Receivers react to hints by making requests for updates. See Andyet’s article about this approach. Hints are straightforward and work well if there aren’t too many recipients for the same data.
Include sequencing information. If data may arrive from potentially two sources (push subscription or request for updates), or data is a stream of deltas, then receivers will need a way to detect for out-of-sequence data. This can be done by including a sequence or version ID in the data payloads. If there is a sequencing problem when receiving a stream of deltas, then the receiver will need to make a request to the server for updates.
Periodically sync with the server. If data is sent infrequently, then receivers may want to periodically check in with the server to ensure nothing has been missed. For a typical web app, this interval could be something like a few minutes.
Building reliable applications and APIs is tricky. It’s easy to make the mistake of putting too much trust in your publish-subscribe system and TCP. We hope the above recommend practices will help you create bulletproof realtime experiences!