Skip to content

Idea: Simplified Change Tracking

Jens Alfke edited this page Feb 13, 2014 · 2 revisions

This is a brainstorm, not a real spec or feature! —Jens, 13 Feb 2014

I've had some ideas recently on how to greatly simplify change-tracking and changes-feed generation. These take advantage of the fact that we're listening on a TAP feed.

The gist of it is: Get rid of the channel log documents. Instead, observe new documents from the TAP feed and keep an in-memory log of recent per-channel changes ordered by sequence number.

Revision Creation

When creating a revision, everything's the same as now, except that there are no channel log documents to append to. (Yay!)

Watching The Tap Feed

  • When receiving an update of a key corresponding to a database document, parse the JSON value and determine the current sequence number and channels.
  • If the sequence number does not immediately follow last-seen sequence number (i.e. there's a gap) hold onto the revision for "a few seconds" or until the previous sequence number is received, whichever comes first. This will ensure that sequence numbers are ordered. (See below for details.)
  • Add a change entry for this revision to the in-memory change log of each channel.
  • Notify all active changes feeds that are listening on these channels.
  • Periodically remove "old-enough" entries from the change logs (see below for details).

Possible Issues

The only tricky thing here is ensuring that the channel logs are always ordered by sequence number, which is necessary so that a clients can resume reading the changes feed from where it left off. Sequence numbers are already allocated in order via an INCR call. But the documents containing those sequences may not be saved in order, or the tap notifications may not arrive in order at every gateway node. However, they should only be out of order by a small time interval, unless the database is hugely overloaded or a gateway crashes before it can save a document. I think adding a delay for out-of-order sequences will work well.

The in-memory change log can't be unbounded. Since incoming entries are timestamped (as part of handling ordering) a task can periodically purge the oldest entries. The only restriction is that it can't purge entries that aren't yet available in the changes view. We can either set an expiration time that's greater than any expected view latency, or we can periodically check the view to see what its newest sequence is and only purge sequences older than that.

This scheme takes advantage of getting the full values of all documents from the tap feed. As already noted, this consumes a fair amount of bandwidth. (It'll be considerably less than today, though, because we're not creating those large channel-log docs.)

All incoming doc revisions need to be JSON-parsed. I don't think this will be too expensive, though; the rate of change isn't super high, and CPU hasn't been a bottleneck thus far.

Clone this wiki locally