Skip to content
jducoeur edited this page Oct 26, 2012 · 5 revisions

Okay, let's tackle the great architectural heresy right off the bat. Querki is, at least to start with, highly stateful. That is, the user isn't interacting directly with the database most of the time; instead, she is working primarily with an in-memory representation of the data.

Why?

Part of my response is, "Why not?" While "statelessness" has some real benefits, it has been turned a bit too much into a religion in recent years, and is often a knee-jerk requirement without careful thought. But let's take the question seriously. There are several interlocked reasons why I'm doing things this way.

There is no such thing as "stateless"

"Stateless" apps usually aren't anything of the sort. Nearly every app has oodles of state, ranging from user sessions to history to the information you're trying to serve.

What the shorthand of "stateless" usually means is, "stateless front end". Every web server is an equal peer, able to handle any request with aplomb. There's something very attractive to that, but it's misleading to call that "stateless". What it really means is that you've kicked the state can down the road to the database.

It's lovely when the database is well-suited to serving all of your state...

... but that's really not the case for Querki. Our data isn't tabular, like a conventional RDBMS wants, but it's also not a bunch of relatively simple, discrete documents like most of the NoSQL databases want. Instead, our data is highly hierarchical and mapped, cross-linked in ways that are poorly suited to any of the major databases.

Mind, that doesn't mean we can't represent that data: storage is fairly easy. But that brings us to the real rub:

Performance

I want Querki to be lightning-fast; indeed, I think it's necessary for commercial success, since users aren't going to put up with pokey performance in the modern Web. Doing all of the (fairly complex) queries that I anticipate for Querki is likely to be just plain slow on any of the databases I know, because they just aren't designed for it. Nothing's really designed to handle complex joins across heterogeneous data, and that's essentially what we're doing. The data will be fairly small, but still, I expect very slow response times -- prohibitively slow for some of the interactive features we have designed.

Interacting with the Data

Finally, there's the simple fact that Querki is trying a lot of radically different things, many of which map poorly to traditional SQL. There's a reason why I'm writing my own programming language for this system. Doing that with in-memory structures is pretty easy; trying to do it on-disk would be a lot more challenging, and I don't have time for those challenges quite yet.

Write-Through Cache

So instead, we're going to take most of the load off of the database engine, and just use it for storage. Everything will be contained in the DB, so that we have fast, easy and reliable recovery. But all the real interactions will be with an in-memory cache. Changes will go through that cache, and then be written out to disk. Queries always come through the cache: when you begin to work in a Space, we sweep that Space into memory, and work with the result.

Distributed State

We need to scale horizontally, of course. Suffice it to say, there will be one "master" copy of each Space, on some node; usually, this will be the node managing the session of the user who is working with it. Most interactions will be with immutable snapshots of the Space, which can be copied around the network as needed. This will be discussed in considerably more detail later in the Architecture docs.

Nothing's hard and fast

Mind, all of this is Querki v1, and I fully expect it to evolve. This architecture has many advantages, especially early on, but it does have one big cost: it takes a lot of RAM to run. I expect RAM to be the main expense in running Querki -- we're neither I/O nor CPU bound, we're memory-bound.

So in the long run, we'll probably revisit this decision, and move more out to the database. I fully expect that doing that well, with good enough performance, will require writing our own database, or putting significant extensions into one that is capable of it. (Eg, add an extension to hang off of Postgres hstore.) When we have a big enough company with some money, that will be a fun project. For now, though, with a one-man show for the programming, we'll go with something that works and doesn't distract me from the important stuff.

Clone this wiki locally