-
Notifications
You must be signed in to change notification settings - Fork 78
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
#Centipede RiegeliWriter: Increase effective throughput by 10-15x
`RiegeliWriter`: * Switched from flushing after each written blob to maximal leveraging of Riegeli's internal buffering and flushing after N blobs / M bytes / X time (whichever comes first), while maintaining the critical property that only complete records are ever committed to the output file. The buffered flushing parameters are currently fixed. It might be nice to make them configurable, but right now that would require exposing Riegeli-specific knobs in the generic `BlobFileWriter` API. Once Riegeli becomes the only blob writer in Centipede (maybe soon), we'll be able to parameterize. Measured effect: * A real-life distillation run concurrently reading 50 large input shards and mutually exclusively writing them to a single output shard: - wall clock time: 5500 sec -> 360 sec - effective CPU core utilization: 0.3 -> 2.8 - no measurable change in RSS usage profile * Features file in the run above (large blobs): - write throughput: ~340 KB/sec -> ~5300 KB/sec - file compression: ~6x -> ~7x * Corpus file in the run above (small blobs): - write throughput: ~11 KB/sec -> ~9000 KB/sec - file compression: ~2x -> ~12x PiperOrigin-RevId: 607818600
- Loading branch information
1 parent
d1f5dbb
commit cc15acf
Showing
2 changed files
with
132 additions
and
14 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters