Persisting a re-usable database to speed up subsequent runs #588

cldellow · 2023-11-19T19:01:05Z

cldellow
Nov 19, 2023

I'm considering adding an optional feature to tilemaker: a cache of expensive function calls. This would not affect the initial runtime of Tilemaker on a PBF, but instead would decrease the runtime of subsequent runs. This would be of interest to people who want to maintain ongoing updates of their vector tiles.

By default, the cache would be disabled.

A user would opt-in via a CLI flag: tilemaker --function-cache /tmp/some.db

Tilemaker would then create a SQLite database with a schema like:

CREATE TABLE IF NOT EXISTS cache(
  k1 INTEGER NOT NULL,
  k2 INTEGER NOT NULL,
  k3 INTEGER NOT NULL,
  func INTEGER NOT NULL,
  result BLOB,
  UNIQUE (k1, k2, k3, func)
);

The first thing I have my eye on geom::is_valid. It's an attractive thing to memoize:

it's a pure function - we can lookup cache entries by a key of OSM ID and a fast 128-bit hash of the geometry values
the majority of calls are not that expensive, and could skip memoization entirely
the result is quite small (a boolean indicating success/fail, and a char indicating failure type), so the storage overhead is small, even for the whole planet

Adding persistent storage feels a bit against the spirit of Tilemaker, though. I rationalize it to myself by saying it'd be optional, and it'd have minimal knobs (just absent or present).

I started down the path of doing a proof of concept, and ran into some fiddly threading issues with SQLite. I have some ideas on how to work around them, but figured I'd double-check that you'd be interested in such a feature before spending the time to resolve the issues.

On Great Britain, I think the best case improvement from caching is_valid would be perhaps a 10% decrease in PBF reading time. There might be other places such a cache could be gainfully employed -- I think is_valid is now also called on output. It may also be reasonable to cache make_valid calls, although I haven't looked deeply into that yet to understand the storage implications.

systemed · 2023-11-19T22:25:06Z

systemed
Nov 19, 2023
Maintainer

Interesting ideas!

The slowest individual operations are generally the complex geometry ones. In boost::geometry, that's things like is_valid, _union, intersection and so on. In our own code it's the bespoke simplify and make_valid operations.

My gut feeling is that the best way to optimise for frequent updates is to capitalise on the fact that most data doesn't change frequently. Coastlines change barely at all. Rivers and railways infrequently. Roads change a bit more often and POIs more often still.

Because vector tiles are organised into layers, it might be possible to update particular layers and/or features. A naive, but I suspect effective, implementation might just have a "volatile" flag in the layer JSON which indicates whether it should be reprocessed with a partial update. Happily, the objects that change most often - roads and POIs - are probably pretty quick to update as they're either linestrings, points or simple polygons. (Compare to landuse and coastlines which are often expensive polygons.)

A smarter implementation would look at the revision date of individual OSM objects, and flag tiles up as needing rewriting as part of that. But that might lead into a massive rabbit-hole.

But this is only a 15-second brain dump!

0 replies

cldellow · 2023-11-20T14:59:18Z

cldellow
Nov 20, 2023
Author

Oh, that reframes things and refines my understanding of the problem space. Very clever. I now think a transparent function cache could still be useful for the case where a user is iterating on their Lua profile, but I agree it may not be as useful for the case where the only input that's changing is the PBF file, not the Lua script.

Can I pester you with some followup questions?

In boost::geometry, that's things like is_valid, _union, intersection and so on.

When does is_valid need to be called? Before a geometry is simplified? Before a call to anything in the geometry library like area, intersect, etc?

I guess right now it's eagerly called -- it might be interesting to make it lazy and see if that buys some improvements.

In our own code it's [...] simplify

I think right now simplification happens repeatedly on the base z14 geometry, e.g. at z6, we simplify the z14 geometry, at z7, we simplify the z14 geometry and so on.

Would it make sense if tiles were written out in the opposite zoom order, and simplification re-used the output of the next highest zoom's simplification?

eg: z14 writes the raw geometry, z13 simplifies the z14 geometry and writes it, z12 simplifies the z13 geometry and writes it

I haven't looked very closely at the output phase of Tilemaker yet. My hope is that it's cheaper to simplify an already simplified geometry. But I don't know if (a) that's true or (b) it would introduce weird artifacts / oversimplify.

In our own code it's [...] make_valid

Related, if a z14 geometry is valid, and we simplify it, is the simplified geometry guaranteed to be valid?

A naive, but I suspect effective, implementation might just have a "volatile" flag in the layer JSON which indicates whether it should be reprocessed with a partial update.

Is the purpose of an opt-in volatile flag a performance optimization? If it was possible to cheaply compute the set of features that had changed, and only process geometries for those, it might be more convenient to just do all the changed features vs putting the burden on the user to know how to set the config knobs correctly.

A smarter implementation would look at the revision date of individual OSM objects

Oh, interesting. I see Info and DenseInfo are marked as optional. In practice, do you know if these are present in the planet dump and geofabrik exports?

For versioning purposes, you wouldn't even need the full fidelity of the 8-byte timestamp -- just a byte tracking if it changed in the last 256 days would be enough to discriminate which things need to be reprocessed, so long as you were reprocessing on a cadence faster than 256 days.

I suspect the majority of items would not have changed in the last 256 days, so you could skip storing those and treat absence as >= 256 days stale. The memory impact in the Tilemaker process ought to be fairly manageable.

You'd have to store the revision dates in the mbtiles somehow, I guess?

But that might lead into a massive rabbit-hole.

But there'd be rabbits at the end!

2 replies

systemed Nov 20, 2023
Maintainer

When does is_valid need to be called? Before a geometry is simplified? Before a call to anything in the geometry library like area, intersect, etc?

We generally call is_valid immediately before one of the geometry correction routines - either boost::geometry's own ::correct, or kleunen's more powerful one in include/geometry/correct.hpp. is_valid is less expensive than the correction routines themselves, so it makes sense not to correct if we don't need to. Generally, incorrect geometries are an issue with (Multi)Polygons but not points or linestrings, save for the degenerate case of 0/1-length linestrings.

We correct geometries when first writing them to store (either in the SHP reader or in osm_lua_processing with CorrectGeometry).

We also correct on output in buildWayGeometry (output_object.cpp). This is particularly important because our clipping routine (fast_clip) can leave artefacts around the tile boundary. Simplification could also make geometries incorrect though the custom simplify routine in geom.cpp attempts to avoid it.

I have my head buried in the clipping/correction process at the moment as a result of #580 which is fairly hairy - trying to stop incorrect geometries being generated for a particular island off Athens at all zoom levels!

Would it make sense if tiles were written out in the opposite zoom order, and simplification re-used the output of the next highest zoom's simplification?

Perhaps yes. It might be interesting to experiment. I think we probably don't want to lock ourselves too closely to a particular write order for two reasons:

writing tiles by geographical chunk, read from disk (as envisaged by refactor_geometries) will mean a write order of
- tiles below z6
- chunk 1 z6, chunk 1 z7, chunk 1 z8 [...], chunk 1 z14
- chunk 2 z6, chunk 2 z7, chunk 2 z8 [...], chunk 2 z14
- chunk n z6, chunk n z7, chunk n z8 [...], chunk n z14
ordering will be significant when we support PMTiles, which I'd like to do

cldellow Nov 21, 2023
Author

Thanks, that's helpful!

You have more insight into how things will work in the future. I'm naively optimistic that if zoom order made a significant performance impact, we'd figure out a way to make it work.

This is all pretty speculative though! Probably not worth spending much time on it unless/until I do some experiments to assess the runtime benefit of re-using previously simplified geometries.

cldellow · 2023-11-20T15:08:50Z

cldellow
Nov 20, 2023
Author

Is the purpose of an opt-in volatile flag a performance optimization? If it was possible to cheaply compute the set of features that had changed, and only process geometries for those, it might be more convenient to just do all the changed features vs putting the burden on the user to know how to set the config knobs correctly.

Potentially answering my own question: yes, it's a performance optimization. The cost isn't so much in tracking what has changed, but in materializing those changes into the tileset.

For example, small changes to very large relations cause a tile write amplification issue. If someone makes a small adjustment to the borders of their waterfront town in Florida, it's cheap for us to flag the entire USA relation as being invalidated. But that would then mean that hundreds of thousands of tiles are invalidated, and, upon rematerializing them, we'd likely find that only a handful of them have actually changed.

This probably implies that you'd want to track the age of staleness at the per layer, per tile level.

0 replies

cldellow · 2023-11-20T15:30:21Z

cldellow
Nov 20, 2023
Author

Oh, I guess the ability to do partial updates also gets you 90% of the way to being able to do a layer-by-layer creation of a tileset.

If you lacked RAM, that'd be useful, as your peak memory usage would just be node store (all) + AttributeSets and geometries from a single layer.

The node stores a big burden for the entire planet, but I have a few ideas that ought to help there.

3 replies

systemed Nov 20, 2023
Maintainer

Very interesting thought!

As per above, I'm planning to add the option to serialise geometries to disk by geographical chunk (~ one z6 tile), and then create tiles for each chunk in turn. That will reduce peak memory consumption, during writing, to the size of the largest chunk. This is going to require some refactoring broadly around TileDataSource etc.

Potentially the on-disk chunks could alternatively be organised by layer rather than by z6 chunk. (Or even by both.)

(Vector tiles stored in sqlite are quite easy to modify in-place - that's how the --merge option works. PMTiles however doesn't offer random writes like sqlite does, so in the long term, we might want to think about retaining per-layer chunks and serialising the attribute store to enable per-layer updates.)

cldellow Nov 21, 2023
Author

True, once geometries are buffered to disk, the argument for layer-by-layer creation becomes much less compelling.

we might want to think about retaining per-layer chunks and serialising the attribute store to enable per-layer updates.

I'm probably missing some nuance here, as I'm not yet following why extra things would have to be serialized.

My naive thinking was that as long as PMTiles support random reads, you could compensate for lack of random writes by reading the old file as needed. It does require keeping the old file around, so you have 2x the storage requirements, which is unfortunate.

systemed Nov 21, 2023
Maintainer

Yes, that's true, you could. It's probably just a matter of what's fastest - deserialising the existing vector tiles, overwriting with the new content, and reserialising; or rereading from our on-disk chunks, adding the new content, and reserialising. .pbf is pretty efficient so there may not be much in it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persisting a re-usable database to speed up subsequent runs #588

{{title}}

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Persisting a re-usable database to speed up subsequent runs #588

cldellow Nov 19, 2023

Replies: 4 comments · 5 replies

systemed Nov 19, 2023 Maintainer

cldellow Nov 20, 2023 Author

systemed Nov 20, 2023 Maintainer

cldellow Nov 21, 2023 Author

cldellow Nov 20, 2023 Author

cldellow Nov 20, 2023 Author

systemed Nov 20, 2023 Maintainer

cldellow Nov 21, 2023 Author

systemed Nov 21, 2023 Maintainer

cldellow
Nov 19, 2023

Replies: 4 comments 5 replies

systemed
Nov 19, 2023
Maintainer

cldellow
Nov 20, 2023
Author

systemed Nov 20, 2023
Maintainer

cldellow Nov 21, 2023
Author

cldellow
Nov 20, 2023
Author

cldellow
Nov 20, 2023
Author

systemed Nov 20, 2023
Maintainer

cldellow Nov 21, 2023
Author

systemed Nov 21, 2023
Maintainer