03 Jan 21:31

VikParuchuri

3a20621

Fix math delimiter issue Latest

Latest

Handle mismatched delimiters.

Assets 2

02 Jan 21:38

VikParuchuri

v1.2.2

4f6a089

Additional bugfix

Merge pull request #456 from VikParuchuri/dev

tqdm fix

Assets 2

02 Jan 21:08

VikParuchuri

v1.2.1

1d2da0a

Bugfix from last release

Fix an import bug in the last release.

Assets 2

02 Jan 20:20

VikParuchuri

v1.2.0

ab22a43

LLM mode, better OCR heuristics, faster

Overview

Significant improvements to quality and speed. There is now LLM mode, which will optionally leverage LLMs to boost output quality. OCR heuristics are significantly improved, and marker will now make good decisions about when to re-OCR the document. Layout model is faster and more accurate.

Quality

Optionally pass the --use_llm flag to improve tables, inline math, forms, complex pages, and general quality.
Automatically detect bad OCR text and re-OCR the document. This consists of some PDF-level heuristics and a new OCR quality model.
Pass the --strip_existing_ocr flag to always ignore existing OCR and redo it instead.
Layout blocks are now detected more accurately when passing --use_llm.

Speed

Layout model is now half the size and ~2x faster (most of the runtime in the general case is layout, so this should result in a big overall speedup). It's also more accurate.

Misc

Pass the --disable_image_extraction flag to avoid extracting images.
Pass --use_llm and --disable_image_extraction to automatically convert images to descriptions.
Made it easy to extract individual block types from the document (for example, getting all tables out)

Partial Changelog

Add New OCR Heuristics Model by @tarun-menta in #427
Vik dev by @VikParuchuri in #434
High Quality Layout Builder and Text Processors by @iammosespaulr in #429
Vik dev by @VikParuchuri in #438
Vik dev by @VikParuchuri in #447
Additional heuristics for bad PDF text extraction by @iammosespaulr in #446
LLM based image captioning by @VikParuchuri in #454

New Contributors

@tarun-menta made their first contribution in #427

Full Changelog: v1.1.0...v1.2.0

Contributors

VikParuchuri, iammosespaulr, and tarun-menta

Assets 2

12 Dec 19:09

iammosespaulr

v1.1.0

9185517

Marker Bugfixes and Improvements to `pdftext`

What's Changed

Fix chunk_convert.sh to handle output_dir correctly by @Leon-Sander in #415
pdftext Improvements and Misc Bugfixes by @VikParuchuri and @iammosespaulr in #422
- Blank page and TOC bugfixes
- Fix README.md and updated examples
- Update to the latest pdftext release, incorporating heuristic-based segmentation for enhanced performance and accuracy
- Update surya and tabled dependencies, incorporating various bugfixes.

New Contributors

@Leon-Sander made their first contribution in #415

Full Changelog: v1.0.2...v1.1.0

Contributors

VikParuchuri, iammosespaulr, and Leon-Sander

Assets 2

03 Dec 21:12

VikParuchuri

v1.0.2

6ded3b9

Bugfixes - python 3.10 compatibility, quotes, images

Fix issue with python 3.10
Fix positions of quote characters
Change default image output type to JPEG for speed and smaller filesize with minimal quality loss

Assets 2

03 Dec 01:16

VikParuchuri

v1.0.1

f446e56

Bugfixes and parsing improvements

Fix lots of misc bugs, including encoding, empty page problems, and image rendering
Improve list processing with joining and nesting
Add in blockquotes
Slightly improve performance

What's Changed

Fix marker server by @VikParuchuri in #396
Add ListGroup joining processor and refactor Text joining processor by @iammosespaulr in #402
Misc fixes by @VikParuchuri in #397
Add Blockquote Processor by @iammosespaulr in #404
Add Nested Lists support to ListProcessor by @iammosespaulr in #410
Marker Improvements and Bugfixes by @iammosespaulr in #403

Full Changelog: v1.0.0...v1.0.1

Contributors

VikParuchuri and iammosespaulr

Assets 2

27 Nov 17:47

VikParuchuri

v1.0.0

75091a0

Marker v1!

This is the release of marker v1, a complete rewrite from scratch.

2x faster due to a new layout model
Consistent internal schema for blocks and pages
Modular architecture with processors and renderers that can easily be overridden
JSON chunk and markdown output
Lots of units tests
Much higher output quality

What's Changed

feat: API server file upload support by @tjbck in #332
Upgrade line joining by @iammosespaulr in #344
Surya Layout model and batch multiplier updates by @iammosespaulr in #335
Initial document skeleton by @VikParuchuri in #345
Add PDF Provider by @iammosespaulr in #346
Add Layout Merging by @iammosespaulr in #348
Vik v2 by @VikParuchuri in #349
Layout Merging fixes and tests by @iammosespaulr in #350
Vik v2 by @VikParuchuri in #351
Decouple Span from Line by @iammosespaulr in #352
Vik v2 by @VikParuchuri in #353
Add simple line and span renderer, add blocktype class by @VikParuchuri in #357
Add markdown renderer, swap how ids are named by @VikParuchuri in #358
Fix markdown output by @VikParuchuri in #359
Add OCR Builder by @iammosespaulr in #356
Output images, clean up other output formats by @VikParuchuri in #362
Vik v2 by @VikParuchuri in #364
Cleanup and speed up tests by @iammosespaulr in #363
Add CI tests by @iammosespaulr in #366
Add debug utils, fix output quality issues by @VikParuchuri in #367
Allow Overriding Node Classes by @iammosespaulr in #368
Reorganize tests by @VikParuchuri in #369
Minor debugging and misc fixes by @iammosespaulr in #370
Chunk JSON output by @VikParuchuri in #371
Vik v2 by @VikParuchuri in #372
Add code processor, fix issues with structure by @VikParuchuri in #375
Add Line merging across Pages and Columns by @iammosespaulr in #373
PDF Converter Initialization refactor + Tests by @iammosespaulr in #379
Wire up convert_single by @VikParuchuri in #380
Fix tests by @VikParuchuri in #381
Add Docstrings for Processors, Builders and Converters and -l to list them from the convert.py CLI + Misc Fixes by @iammosespaulr in #382
Fix broken text by @VikParuchuri in #383
Fix marker app by @VikParuchuri in #384
Fix marker server by @VikParuchuri in #385
Misc Bugfixes by @iammosespaulr in #386
Vik v2 by @VikParuchuri in #387
Update tests by @iammosespaulr in #388
Additional Fixes by @iammosespaulr in #390
Vik v2 by @VikParuchuri in #391
Marker v2 by @VikParuchuri in #392
Improve comparison performance by @VikParuchuri in #394
Dev by @VikParuchuri in #395

New Contributors

@tjbck made their first contribution in #332

Full Changelog: v0.3.10...v1.0.0

Contributors

VikParuchuri, tjbck, and iammosespaulr

Assets 2

31 Oct 15:20

VikParuchuri

v0.3.10

b8a8736

Performance improvements, API server

Improve performance by 10-15%
Add a simple API server for local use-cases

Assets 2

25 Oct 17:04

VikParuchuri

v0.3.9

b2cae2e

Flatten PDF, fix page separators, fix torch/transformers bugs

Fix issues with transformers 4.46 and torch 2.5
Improve page separators - they now appear that the start of the page, and show the page number
Flatten form fields into the PDF before extracting markdown

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overview

Quality

Speed

Misc

Partial Changelog

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

Releases: VikParuchuri/marker

Fix math delimiter issue

Additional bugfix

Bugfix from last release

LLM mode, better OCR heuristics, faster

Overview

Quality

Speed

Misc

Partial Changelog

New Contributors

Contributors

Marker Bugfixes and Improvements to `pdftext`

What's Changed

New Contributors

Contributors

Bugfixes - python 3.10 compatibility, quotes, images

Bugfixes and parsing improvements

What's Changed

Contributors

Marker v1!

What's Changed

New Contributors

Contributors

Performance improvements, API server

Flatten PDF, fix page separators, fix torch/transformers bugs