Skip to content

Releases: VikParuchuri/marker

Fix math delimiter issue

03 Jan 21:31
3a20621
Compare
Choose a tag to compare

Handle mismatched delimiters.

Additional bugfix

02 Jan 21:38
4f6a089
Compare
Choose a tag to compare
Merge pull request #456 from VikParuchuri/dev

tqdm fix

Bugfix from last release

02 Jan 21:08
1d2da0a
Compare
Choose a tag to compare

Fix an import bug in the last release.

LLM mode, better OCR heuristics, faster

02 Jan 20:20
ab22a43
Compare
Choose a tag to compare

Overview

Significant improvements to quality and speed. There is now LLM mode, which will optionally leverage LLMs to boost output quality. OCR heuristics are significantly improved, and marker will now make good decisions about when to re-OCR the document. Layout model is faster and more accurate.

Quality

  • Optionally pass the --use_llm flag to improve tables, inline math, forms, complex pages, and general quality.
  • Automatically detect bad OCR text and re-OCR the document. This consists of some PDF-level heuristics and a new OCR quality model.
  • Pass the --strip_existing_ocr flag to always ignore existing OCR and redo it instead.
  • Layout blocks are now detected more accurately when passing --use_llm.

Speed

  • Layout model is now half the size and ~2x faster (most of the runtime in the general case is layout, so this should result in a big overall speedup). It's also more accurate.

Misc

  • Pass the --disable_image_extraction flag to avoid extracting images.
  • Pass --use_llm and --disable_image_extraction to automatically convert images to descriptions.
  • Made it easy to extract individual block types from the document (for example, getting all tables out)

Partial Changelog

New Contributors

Full Changelog: v1.1.0...v1.2.0

Marker Bugfixes and Improvements to `pdftext`

12 Dec 19:09
9185517
Compare
Choose a tag to compare

What's Changed

  • Fix chunk_convert.sh to handle output_dir correctly by @Leon-Sander in #415
  • pdftext Improvements and Misc Bugfixes by @VikParuchuri and @iammosespaulr in #422
    • Blank page and TOC bugfixes
    • Fix README.md and updated examples
    • Update to the latest pdftext release, incorporating heuristic-based segmentation for enhanced performance and accuracy
    • Update surya and tabled dependencies, incorporating various bugfixes.

New Contributors

Full Changelog: v1.0.2...v1.1.0

Bugfixes - python 3.10 compatibility, quotes, images

03 Dec 21:12
6ded3b9
Compare
Choose a tag to compare
  • Fix issue with python 3.10
  • Fix positions of quote characters
  • Change default image output type to JPEG for speed and smaller filesize with minimal quality loss

Bugfixes and parsing improvements

03 Dec 01:16
f446e56
Compare
Choose a tag to compare
  • Fix lots of misc bugs, including encoding, empty page problems, and image rendering
  • Improve list processing with joining and nesting
  • Add in blockquotes
  • Slightly improve performance

What's Changed

Full Changelog: v1.0.0...v1.0.1

Marker v1!

27 Nov 17:47
75091a0
Compare
Choose a tag to compare

This is the release of marker v1, a complete rewrite from scratch.

  • 2x faster due to a new layout model
  • Consistent internal schema for blocks and pages
  • Modular architecture with processors and renderers that can easily be overridden
  • JSON chunk and markdown output
  • Lots of units tests
  • Much higher output quality

What's Changed

New Contributors

Full Changelog: v0.3.10...v1.0.0

Performance improvements, API server

31 Oct 15:20
b8a8736
Compare
Choose a tag to compare
  • Improve performance by 10-15%
  • Add a simple API server for local use-cases

Flatten PDF, fix page separators, fix torch/transformers bugs

25 Oct 17:04
b2cae2e
Compare
Choose a tag to compare
  • Fix issues with transformers 4.46 and torch 2.5
  • Improve page separators - they now appear that the start of the page, and show the page number
  • Flatten form fields into the PDF before extracting markdown