Simpler, character-state-machine based "parser" #4

dhdaines · 2024-09-18T12:09:33Z

Relies on io to do buffering instead of some fragile things. EOF handling is still difficult though.

This is much faster on actual files (because letting cPython do buffering is a winning strategy), but for fairly obvious reasons, it's also much slower on BytesIO, and it turns out that the majority of "parsing" going on is over BytesIO objects. So there will be a separate PR to add a regex-based "parser" for in-memory data.

dhdaines added 15 commits September 17, 2024 20:07

feat: rewrite the parser to not do its own buffering

0cc245f

test: fix tests

e5c5192

fix: miscellaneous

90f7f80

fix: go back to using fp/BinaryIO

abc8ddb

test: fix tests

a6bc6f6

fix(test): BufferedReader not necessary

6661305

fix: read_header broke PDFStreamParser

087e852

fix: address mypy issues

4f286af

chore: format

e37ca93

ci: add basic ci

c58a91c

fix: clean up and correct get_inline_data

204952b

fix: better benchmark

75f9c25

fix: better better benchmark (we are slower stil)

a2c3921

feat: comparisons in benchmark

e6dfe40

feat: distinguish BinaryIO and BytesIO in benchmark

b7a7537

dhdaines merged commit 1394c50 into main Sep 18, 2024
1 check passed

dhdaines deleted the simple_parser branch September 18, 2024 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simpler, character-state-machine based "parser" #4

Simpler, character-state-machine based "parser" #4

dhdaines commented Sep 18, 2024 •

edited

Loading

Simpler, character-state-machine based "parser" #4

Simpler, character-state-machine based "parser" #4

Conversation

dhdaines commented Sep 18, 2024 • edited Loading

dhdaines commented Sep 18, 2024 •

edited

Loading