-
Notifications
You must be signed in to change notification settings - Fork 659
Arena #309
base: v1.0.0
Are you sure you want to change the base?
Arena #309
Conversation
This is based on statistical analysis of ~60K webpages from one segment of the CommonCrawl corpus. As a result, initial attribute vector size goes from 4 to 1, and initial stringbuffer size goes from 10 to 5 (initial child vector size was found to already be the optimum value, at 1). The program I used to analyze memory usage & allocations is here: https://github.com/nostrademons/GumboStats Benchmark results, using first that program and then ./benchmark: v0.10.0 This is the current master. parse_time: mean=4936.51, median=3395.50, 95th%=14145.60, max=167992.00 allocations: mean=19782.47, median=14190.00, 95th%=55512.45, max=805162.00 bytes_allocated: mean=697919.74, median=498341.00, 95th%=1962711.40, max=50036916.00 high_water_mark: mean=572190.85, median=204060.00, 95th%=832980.15, max=4294967294.00 hacker_news.html: 4015 microseconds. html5_spec.html: 690645 microseconds. xinhua.html: 27348 microseconds. baidu.html: 5443 microseconds. yahoo.html: 24486 microseconds. google.html: 23592 microseconds. bbc.html: 7975 microseconds. arabic_newspapers.html: 5998 microseconds. wikipedia.html: 36205 microseconds. With attributes = 0, string buffer size = 3 I tried this first as a "good enough" solution, but it lost a couple percentage points in speed in exchange for halving typical memory consumption. parse_time: mean=5227.30, median=3591.00, 95th%=14954.30, max=170137.00 allocations: mean=22808.84, median=16349.00, 95th%=64060.30, max=849723.00 bytes_allocated: mean=634346.26, median=453030.50, 95th%=1805861.60, max=49914829.00 high_water_mark: mean=5833772.92, median=108432.50, 95th%=472434.40, max=4294967295.00 hacker_news.html: 6021 microseconds. html5_spec.html: 723033 microseconds. xinhua.html: 20712 microseconds. baidu.html: 5426 microseconds. yahoo.html: 25010 microseconds. google.html: 23691 microseconds. bbc.html: 8555 microseconds. arabic_newspapers.html: 6216 microseconds. wikipedia.html: 36706 microseconds. With attributes = 1, string buffer size = 3 This is the final commit. Speed is statistically unchanged, but memory usage is down 30%. parse_time: mean=4924.18, median=3421.50, 95th%=14013.30, max=170302.00 allocations: mean=21730.23, median=15597.00, 95th%=61035.05, max=852718.00 bytes_allocated: mean=645380.09, median=461001.00, 95th%=1828158.55, max=49884481.00 high_water_mark: mean=3985020.84, median=139070.00, 95th%=591329.45, max=4294967295.00 hacker_news.html: 4670 microseconds. html5_spec.html: 697663 microseconds. xinhua.html: 24023 microseconds. baidu.html: 5576 microseconds. yahoo.html: 25100 microseconds. google.html: 24000 microseconds. bbc.html: 8214 microseconds. arabic_newspapers.html: 6377 microseconds. wikipedia.html: 36961 microseconds.
… stringbuffer (it fails if there are less than 7 characters available and doesn't explicitly reserve space).
…g routines. vsnprintf may modify the va_args state, and in doing so it introduced potential memory corruption in the buffer it writes to.
…a string_buffer_clear that just zeros out the vector instead of reallocating. Benchmarks: parse_time: median=3217.00, 95th%=13199.00 traversal_time: median=51000.00, 95th%=343000.00 allocations: median=12459.50, 95th%=49388.75 bytes_allocated: median=441710.00, 95th%=1756603.10 high_water_mark: median=194527.00, 95th%=796959.05 hacker_news.html: 4574 microseconds. html5_spec.html: 654969 microseconds. xinhua.html: 19776 microseconds. baidu.html: 5675 microseconds. yahoo.html: 27398 microseconds. google.html: 30849 microseconds. bbc.html: 7762 microseconds. arabic_newspapers.html: 5798 microseconds. wikipedia.html: 35138 microseconds. This gives a roughly 8-10% improvement in CPU performance, and reduces the total number of bytes allocated by about 5%. However, the cost of it is that it eliminates virtually all the memory savings of changing the initial vector numbers (it's maybe 5% over master, vs. about 30%). The source of that is that now the temporary vectors can grow unbounded, and so a large string will continue consuming memory until parsing is complete. I'm investigating some approaches to mitigate that.
…its reinitialized to a fresh buffer. This reduces the median high water mark from 195K to ~180K (still well above the 140K that we'd get by always reallocating the temporary buffers), but at the cost of about 50% of the CPU gains we got from gumbo_string_buffer_clear. I'm curious how it interacts with original-buffer-return and arenas, though.
…es. The effect of this change is actually (and counterintuitively) starkly negative: CPU time increases by ~5%, total bytes allocated increases by almost 50%, and memory used increases by 25%. The reason for this is that most text strings are very small, 1-2 chars, and so moving a 6-char buffer to the parse tree and allocating another one instead of just allocating the 1-2 chars necessary and resetting the original buffer is a big loss. I'm keeping this commit solely to serve as a base for an arena implementation, which may be able to win back much of this.
Seems to save about 10-20% on runtime over the previous performance fixes. I'm trying to get memory-usage numbers, but since this removes the custom allocator machinery, it'll take some invasive temporary functions to let the stats program query it. Benchmarks: hacker_news.html: 4153 microseconds. html5_spec.html: 588906 microseconds. xinhua.html: 15219 microseconds. baidu.html: 5563 microseconds. yahoo.html: 23156 microseconds. google.html: 23392 microseconds. bbc.html: 7011 microseconds. arabic_newspapers.html: 4687 microseconds. wikipedia.html: 28795 microseconds.
… sense using less, as we always return word-aligned chunks) and arena size to 200K.
…ffer at the maximum arena chunk size and continues parsing, setting the out_of_memory flag.
…This adds an out_of_memory flag to GumboOutput, which is set if malloc fails or a vector is requested that exceeds the maximum chunk size. The allocator will longjmp back to gumbo_parse, and then return the partial parse tree with the flag set. It also adds an arena_chunk_size member to GumboOptions, for sizing the arena.
…ist fields on GumboArenaChunk when allocating memory, causing code to blow past the end of the allocation.
…resh, dedicated block for them), and remove arena_chunk_size as a configurable parameter.
+1 from me, FWIW. The current parse->traverse->destroy API seems like the perfect use case for arena allocation. I've already ventured down the path of adding "just a bit of mutability" (as per #311) in lua-gumbo and it's laden with subtle traps and edge cases (not to mention feature creep), even when working in a language with garbage collection.
I've been using a custom allocator for a while, but only as a hack for out-of-memory handling. Since this PR fixes that too, I wouldn't really miss custom allocator support. |
FYI, https://github.com/user-none/Sigil/blob/master/internal/gumbo/gumbo_edit.h As you can see it is extremely simple and it uses the original code from the parser and attributes where-ever possible. We only use it for simple modifications such as changing one attribute for another, adding an attribute, removing an attribute, adding a child node, deleting a node and all of its children, etc, just before re-serializing. Most of the code just wrappers or calls the exact same code as the original parser (of course with vmg's simplifications). Under C with malloc/free it is not error prone or "full of hidden corner-cases". You just need to be careful to know what is malloced and what is not and to clean up after yourself just as in the original parser code. Tying this to a garbage collected language may be an issue. We (Sigil) wrapper each invocation of gumbo in C++ and are always careful to keep the original source around as long the parsed tree is active. We use it in a heavily threaded way to clean/fix multiple epub chapters at once each with its own parse tree with no issues. https://github.com/user-none/Sigil/blob/master/src/Misc/GumboInterface.cpp The only real additional downside of allowing editing other than not being easily adaptable to an arena allocator (depending on how it is implemented, how large the arena footprint it, and how threadsafe the arena implementation is) is we must use a larger set of tags, so that they can be used with the edit functions so that the tag names are recognized and do not depend on the original source code always existing. Again, we are not asking that code like gumbo_edit.h/cpp be added, just that @vmg s changes be accepted to allow things like this to be maintained outside the gumbo tree. Kevin Sent from my iPad
|
I'm still undecided about this patch - my main concern is that some other API change or feature request before 1.0 will require that it be undone, much like the next/prev changes. So my plan for the immediate future is to sit on it and see which direction the library evolves in. In the meantime, thanks for the links to Sigil's client code - that's exactly the sort of data I need. I'd also appreciate it if any silent lurkers could post links to their usage of Gumbo; one problem with open-source software is that I have no idea how it's being used by many of the folks who download it off the website, which makes smart future development a bit more challenging. |
Replace the customizable memory allocators with an arena, with the arena being stored in the GumboOutput and freed en-masse when the parse tree is destroyed. This also adds out-of-memory handling to the library: an OOM error causes the parser to stop parsing immediately but return the parse tree that it does have to the caller. It builds on the earlier 'performance' pull request, and currently includes commits from that.
Benchmarks are quite promising for this; the exact numbers are in the commits, but seem to indicate a 15-20% performance increase and memory usage that's under 800K for >50% of webpages and under 2M for >95%. There'll need to be some public discussion about whether people are okay with removing the customizable allocators, since this is not backwards-compatible. My early impression is that they're not used by many people, though.
There will likely be a few more cleanup commits added to this - I have to remove the stat-gathering function that was only used for benchmarking, and am tempted to remove the GumboOptions argument from gumbo_destroy_output now that it is no longer needed.