In push architectures, one of the main challenges is delivering data reliably to receivers. There are many reasons for this:
- Most push architectures use the publish-subscribe messaging pattern, which is unreliable.
- TCP’s built-in reliability is not enough, as modern network sessions span multiple connections.
- Receivers can’t tell the difference between data loss and intentional silence.
- There is no one size fits all answer.
The last point trips up developers new to this problem space, who may wish for push systems to provide “guaranteed delivery.” If only it were that simple. Like many challenges in computer science, there isn’t a best answer, just trade-offs you are willing to accept.
Below we’ll go over the various issues and common practices around reliable push, plus a unique approach we developed at Fanout.
Publish-subscribe is unreliable by design
If you rely on a publish-subscribe broker as your source of truth then you’ll most likely end up with an application that loses data. Publishers might crash before having a chance to publish, and brokers don’t guarantee delivery anyway.
The only way a publish-subscribe broker could conceivably deliver data 100% reliably would be to do one of two things:
- Implement publisher back pressure based on the slowest subscriber.
- Store all published messages in durable storage for all of time.
Both of these options are terrible, so in general nobody does them. Instead, what you see are various degrees of best-effort behavior, for example large queues, or queues with a time limit (e.g. messages deliverable for N hours/days).
This isn’t to say publish-subscribe is unsuitable for push; in fact it’s an essential tool in your toolbox. Just know that it’s only one piece of the puzzle. The ZeroMQ guide has a section on publish-subscribe that’s worth reading even if you’re not using ZeroMQ.
TCP is about flow control, not reliability
In the modern world of mobile devices and cloud services, a single user session can easily span multiple client/server IP address pairs. TCP will retransmit data as necessary within a single connection, but it won’t retransmit across connections, for example if a client has to reconnect. Further, even if IP addresses are stable, TCP doesn’t attempt retransmissions forever. It tries really hard for awhile but will eventually give up if a peer is unresponsive.
In practice, this means that TCP alone isn’t enough for reliable data transfer, and you’ll need to layer a reliable protocol on top of it.
If you learned long ago that TCP is for reliable communication (compared to, say, UDP), you may be confused by this and wonder why people still use TCP then. Well, TCP provides other useful features to build off of, such as back pressure, ordered delivery, and payloads of arbitrary size, so it is still an incredibly useful protocol even if we can’t depend on it for reliable transmission.
Error handling on the receiver
With request/response interactions, you can retry if a request fails, or just bubble errors up to the UI. With push, lost data looks about the same as no data, and users are left wondering why their screens aren’t updating.
What’s really obnoxious about this problem is that it’s not enough to just throw an error when a connection to the server is lost. You can lose data even when a connection seems to be working fine, due to some deeper failure within the server. This means the receiver usually cannot rely on the existence of a connection or subscription to be very meaningful, and will need to discover errors some other way.
Reliable transmission fundamentals
Before we get into the common practices, let’s go over some basics. Reliably exchanging data over a network requires two things:
- An ability to retransmit data, potentially for a long time if loss is not tolerable.
- An entity responsible for initiating a retransmission.
For example, when a mail server sends email to another mail server using SMTP, the sending mail server owns the data to be retransmitted and also acts as the responsible entity.
The entity responsible for initiating a retransmission doesn’t have to be the sender, though. For example, in web architectures, it’s common for a server to attempt to push data to a receiver, but if that fails then it’s up to the receiver to query the server for the missed data. In this case the receiver is the responsible entity.
Before you can build a reliable system, you need to determine whether the sender or receiver should be the responsible entity. This usually comes down to which side should care more about the transmission.
Alright so how do we ensure data is reliably delivered?
Use a regular database as a source of truth. If pushed data is important, the first thing you should do is write it to durable storage (on disk, with no auto expiration), before kicking off any push mechanisms. This way if there’s a problem during delivery, the data can later be recovered.
The receiver should be the responsible entity. In most web architectures, this is what you want. Receivers usually come and go, and it makes more sense for receivers to keep track of what they need rather than for servers to keep track of what has been sent.
Have a way to sync with the server. Receivers should be able to ask the server for new data, independent of any realtime push mechanisms. This way, publish workers can crash, queues can max out, and connections can get dropped, yet receivers can always catch up by making requests for updates. We even recommend building this part before introducing push components to your stack, to ensure your data sync is bulletproof before making it instant.
Consider sending hints. This is where you push indications of new data rather than the actual data. Receivers react to hints by making requests for updates. See Andyet’s article about this approach. Hints are straightforward and work well if there aren’t too many recipients for the same data.
Include sequencing information. If data may arrive from potentially two sources (push subscription or request for updates), or data is a stream of deltas, then receivers will need a way to detect for out-of-sequence data. This can be done by including a sequence or version ID in the data payloads. If there is a sequencing problem when receiving a stream of deltas, then the receiver will need to make a request to the server for updates.
Periodically sync with the server. If data is sent infrequently, then receivers may want to periodically check in with the server to ensure nothing has been missed. For a typical web app, this interval could be something like a few minutes.
Pushpin reliable streaming
Since it’s burdensome for receivers to have to manage both push and pull data sources in order to receive data reliably, we devised a way to merge the two using Pushpin.
Pushpin is our open source proxy server that makes building realtime APIs easy. As a proxy server, it’s uniquely positioned to recover data from the backend server on a client’s behalf, and so we implemented a feature to do just that. Basically the way it works is the backend server provides a special URL that Pushpin can use to retrieve missing data.
This feature is described in the documentation and also in this video:
The end result is the client always sees a perfect stream. The overall architecture is still mostly publish-subscribe, we just moved the recovery logic up one hop from the client to the edge.
There are some interesting benefits of this from the client’s perspective:
- The client doesn’t need to worry about sequencing.
- The client doesn’t need to periodically sync with the server.
- As long as a connection exists, the client can assume there are no gaps in the data.
- A single request can be used to receive historical data and reliable pushed data going forward, so there is a single source of truth.
Most realtime APIs today don’t work this way, but we think our approach could greatly improve the developer experience around such APIs.
Liked this post? Follow this blog to get more.