Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible improvements for new ram middle #1466

Open
4 of 7 tasks
joto opened this issue Apr 29, 2021 · 4 comments
Open
4 of 7 tasks

Possible improvements for new ram middle #1466

joto opened this issue Apr 29, 2021 · 4 comments

Comments

@joto
Copy link
Collaborator

joto commented Apr 29, 2021

The new ram middle (#1461) is better than the previous implementations but there are several places where it could be made even better. Here are some ideas that we can work on as time permits:

Note that many of these are interrelated and related to other upcoming changes in the code. So in many cases it makes sense to defer decisions on what best to do until later when we have a better idea of the environment this code runs in. (In fact this is one reason I am writing this issue instead of improving the code directly, the other is limited time...)

  • The storage for node locations and way node lists is, in the end, a std::string (node_locations_t::m_data and middle_ram_t::m_way_nodes_data, respectively). As more data arrives this will be resized again and again. Copying the memory around when that happens isn't that expensive, so it isn't too bad. But while the resizing takes place, we temporarily basically need twice the memory. So when you get near your memory limit, this can break although enough memory would be available if we break the data into smaller chunks.
  • The node_locations_t store writes new locations into a temporary cache (m_block) which gets written out to the real data storage (m_data) when full. This could be done directly instead. We don't need the freeze() function any more then.
  • The node_locations_t store writes out all ids of one block and then all locations of that block. It might be better to write out all (id, location) pairs in order.
  • The node_locations_t::block_size = 32 should be evaluated and tuned.
  • Reading node location back from node_locations_t is expensive, because we have to read through a lot of varints to find the data we need. We should probably use some kind of cache so that we can reuse decoded blocks. I have not implemented this, because we might want to run the reading code in multiple threads later. If that's the case we might want to have the cache outside the node_locations_t class so threads don't trample on each other. Or maybe we need a mutex.
  • When two-stage processing is requested by the flex output, the middle needs to store tags (and possibly attributes) of objects (currently only way objects, but in the future also nodes and possibly relations) in addition to the node locations and way node lists. The current implementation simply stores the complete objects as they are in the osmium::memory::Buffer which takes a lot of memory. This was the simplest implementation I could think of and two-stage processing isn't used widely, so it is a reasonable compromise for the time being. But we can do much better here, details TBD.
  • The ram middle could also use the existing disk flat node store or any future disk-based node stores we work on. Or some implementation where the data is stored on disk but we keep the index in memory. This can be improved once we have the new version of the pgsql middle and see better where we can share code.
@mmd-osm
Copy link
Contributor

mmd-osm commented May 3, 2021

Regarding "Reading node location back from node_locations_t": Some SIMD-based LEB128 decoders claim to provide faster speeds than the current byte-by-byte approach used in protozero. It might be fast enough to avoid the additional caching of decoded blocks. Lots of research on this topic by https://lemire.me/blog.
Potentially interesting Rust implementation: https://github.com/as-com/varint-simd

Besides, the current approach already turned out to be about 10% faster than the old one in my tests. My impression was that fetching node locations isn't one of the major bottleneck in the overall process anymore. This might be different for other test cases, though.

@joto
Copy link
Collaborator Author

joto commented May 3, 2021

When I worked on protozero I looked at all thoses schemes to improve varint decoding speed but decided against them. I don't want to maintain something like that. This certainly applies much more here.

And yes, it is unclear how much any of these possible improvements would actually help and we'd need to look at the resulting code and benchmarks to see whether they are worth it.

joto added a commit to joto/osm2pgsql that referenced this issue May 13, 2021
Store ids and locations as they come in without going through a
temporary block first. Slightly simpler code and we don't need the
freeze() function any more.

See also osm2pgsql-dev#1466.
joto added a commit to joto/osm2pgsql that referenced this issue May 13, 2021
Store ids and locations as they come in without going through a
temporary block first. Slightly simpler code and we don't need the
freeze() function any more.

See also osm2pgsql-dev#1466.
@joto
Copy link
Collaborator Author

joto commented May 15, 2021

The node_locations_t::block_size = 32 turns out to be okay. With = 16 the memory use is about twice as big, with = 64 we save only about 1-5% memory. The runtime difference between 32 and 64 is minimal and vanishes in the noise. Lets stick with 32.

@joto
Copy link
Collaborator Author

joto commented Nov 1, 2023

The first item above (that node_locations_t::m_data is a std::string which will be resized) has an additional problem: If there isn't enough memory in a std::string it is typically resized by doubling the capacity. This can mean that, in the worst case, we use twice as much memory for the cache as was configured. That's not good.

To fix this we need to either implement the resizing of m_data ourselves, i.e. get a new std::string of the size we want, copy the data over and swap with m_data. Or we need to reserve the data in chunks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants