Releases: VikParuchuri/marker
Releases · VikParuchuri/marker
Fix math delimiter issue
Handle mismatched delimiters.
Additional bugfix
Merge pull request #456 from VikParuchuri/dev tqdm fix
Bugfix from last release
Fix an import bug in the last release.
LLM mode, better OCR heuristics, faster
Overview
Significant improvements to quality and speed. There is now LLM mode, which will optionally leverage LLMs to boost output quality. OCR heuristics are significantly improved, and marker will now make good decisions about when to re-OCR the document. Layout model is faster and more accurate.
Quality
- Optionally pass the
--use_llm
flag to improve tables, inline math, forms, complex pages, and general quality. - Automatically detect bad OCR text and re-OCR the document. This consists of some PDF-level heuristics and a new OCR quality model.
- Pass the
--strip_existing_ocr
flag to always ignore existing OCR and redo it instead. - Layout blocks are now detected more accurately when passing
--use_llm
.
Speed
- Layout model is now half the size and ~2x faster (most of the runtime in the general case is layout, so this should result in a big overall speedup). It's also more accurate.
Misc
- Pass the
--disable_image_extraction
flag to avoid extracting images. - Pass
--use_llm
and--disable_image_extraction
to automatically convert images to descriptions. - Made it easy to extract individual block types from the document (for example, getting all tables out)
Partial Changelog
- Add New OCR Heuristics Model by @tarun-menta in #427
- Vik dev by @VikParuchuri in #434
- High Quality Layout Builder and Text Processors by @iammosespaulr in #429
- Vik dev by @VikParuchuri in #438
- Vik dev by @VikParuchuri in #447
- Additional heuristics for bad PDF text extraction by @iammosespaulr in #446
- LLM based image captioning by @VikParuchuri in #454
New Contributors
- @tarun-menta made their first contribution in #427
Full Changelog: v1.1.0...v1.2.0
Marker Bugfixes and Improvements to `pdftext`
What's Changed
- Fix
chunk_convert.sh
to handleoutput_dir
correctly by @Leon-Sander in #415 pdftext
Improvements and Misc Bugfixes by @VikParuchuri and @iammosespaulr in #422- Blank page and TOC bugfixes
- Fix README.md and updated examples
- Update to the latest pdftext release, incorporating heuristic-based segmentation for enhanced performance and accuracy
- Update surya and tabled dependencies, incorporating various bugfixes.
New Contributors
- @Leon-Sander made their first contribution in #415
Full Changelog: v1.0.2...v1.1.0
Bugfixes - python 3.10 compatibility, quotes, images
- Fix issue with python 3.10
- Fix positions of quote characters
- Change default image output type to JPEG for speed and smaller filesize with minimal quality loss
Bugfixes and parsing improvements
- Fix lots of misc bugs, including encoding, empty page problems, and image rendering
- Improve list processing with joining and nesting
- Add in blockquotes
- Slightly improve performance
What's Changed
- Fix marker server by @VikParuchuri in #396
- Add
ListGroup
joining processor and refactorText
joining processor by @iammosespaulr in #402 - Misc fixes by @VikParuchuri in #397
- Add Blockquote Processor by @iammosespaulr in #404
- Add Nested Lists support to ListProcessor by @iammosespaulr in #410
- Marker Improvements and Bugfixes by @iammosespaulr in #403
Full Changelog: v1.0.0...v1.0.1
Marker v1!
This is the release of marker v1, a complete rewrite from scratch.
- 2x faster due to a new layout model
- Consistent internal schema for blocks and pages
- Modular architecture with processors and renderers that can easily be overridden
- JSON chunk and markdown output
- Lots of units tests
- Much higher output quality
What's Changed
- feat: API server file upload support by @tjbck in #332
- Upgrade line joining by @iammosespaulr in #344
- Surya Layout model and batch multiplier updates by @iammosespaulr in #335
- Initial document skeleton by @VikParuchuri in #345
- Add PDF Provider by @iammosespaulr in #346
- Add Layout Merging by @iammosespaulr in #348
- Vik v2 by @VikParuchuri in #349
- Layout Merging fixes and tests by @iammosespaulr in #350
- Vik v2 by @VikParuchuri in #351
- Decouple Span from Line by @iammosespaulr in #352
- Vik v2 by @VikParuchuri in #353
- Add simple line and span renderer, add blocktype class by @VikParuchuri in #357
- Add markdown renderer, swap how ids are named by @VikParuchuri in #358
- Fix markdown output by @VikParuchuri in #359
- Add OCR Builder by @iammosespaulr in #356
- Output images, clean up other output formats by @VikParuchuri in #362
- Vik v2 by @VikParuchuri in #364
- Cleanup and speed up tests by @iammosespaulr in #363
- Add CI tests by @iammosespaulr in #366
- Add debug utils, fix output quality issues by @VikParuchuri in #367
- Allow Overriding Node Classes by @iammosespaulr in #368
- Reorganize tests by @VikParuchuri in #369
- Minor debugging and misc fixes by @iammosespaulr in #370
- Chunk JSON output by @VikParuchuri in #371
- Vik v2 by @VikParuchuri in #372
- Add code processor, fix issues with structure by @VikParuchuri in #375
- Add Line merging across Pages and Columns by @iammosespaulr in #373
- PDF Converter Initialization refactor + Tests by @iammosespaulr in #379
- Wire up convert_single by @VikParuchuri in #380
- Fix tests by @VikParuchuri in #381
- Add Docstrings for Processors, Builders and Converters and
-l
to list them from theconvert.py
CLI + Misc Fixes by @iammosespaulr in #382 - Fix broken text by @VikParuchuri in #383
- Fix marker app by @VikParuchuri in #384
- Fix marker server by @VikParuchuri in #385
- Misc Bugfixes by @iammosespaulr in #386
- Vik v2 by @VikParuchuri in #387
- Update tests by @iammosespaulr in #388
- Additional Fixes by @iammosespaulr in #390
- Vik v2 by @VikParuchuri in #391
- Marker v2 by @VikParuchuri in #392
- Improve comparison performance by @VikParuchuri in #394
- Dev by @VikParuchuri in #395
New Contributors
Full Changelog: v0.3.10...v1.0.0
Performance improvements, API server
- Improve performance by 10-15%
- Add a simple API server for local use-cases
Flatten PDF, fix page separators, fix torch/transformers bugs
- Fix issues with transformers 4.46 and torch 2.5
- Improve page separators - they now appear that the start of the page, and show the page number
- Flatten form fields into the PDF before extracting markdown