Releases · dhdaines/playa · GitHub

07 Jan 17:21

dhdaines

PLAYA-PDF 0.2.7: Definitive 0.2.x release Latest

Latest

What's Changed

Remove most uses of Typing.cast by @dhdaines in #37
Optimize text placement (some dare call it "rendering") by @dhdaines in #38
Fix font size and rotated/skewed bounding boxes by @dhdaines in #39
fix: deprecate layout in CLI right away and do other useful stuff by @dhdaines in #40
Correctly implement ToUnicode according to the PDF standard and not that bogus technical note (that the PDF standard refers to...) by @dhdaines in #41
feat: support slices and tuples in page list by @dhdaines in #42
Optimize text extraction a bit more by @dhdaines in #43
Make text less Lazy 😥 by @dhdaines in #47
Treat marked content sections (more) correctly
fix: recognize junk before header and compensate (fixes: #46) by @dhdaines in #48

Full Changelog: v0.2.6...v0.2.7

Contributors

dhdaines

Assets 2

30 Dec 18:30

dhdaines

PLAYA-PDF 0.2.6: New year, new acronym

What's Changed

ci: test on windows and mac by @dhdaines in #33
Support parallel operations over pages by @dhdaines in #36
Partially correct the handling of some types of CMaps (not fully correct though)

Full Changelog: v0.2.5...v0.2.6

Contributors

dhdaines

Assets 2

15 Dec 18:08

dhdaines

PLAYA-PDF 0.2.5: Bug fixes and improvements

What's Changed

Fix various bugs in the lazy API
- Add specialized __len__ methods to ContentObject classes
- Clarify iteration over ContentObject
Fix installation of playa-pdf[crypto]
Fix attribute classes in structure tree elements
Deprecate "user" device space to avoid confusion with user space
Parse embedded CMaps (mostly)
Update pdfplumber support
Add parser for object streams and iterator over all indirect objects
in a document

Full Changelog: v0.2.4...v0.2.5

Assets 2

03 Dec 04:08

dhdaines

v0.2.4

What's Changed

Add (and fix) 3rd party test suites, primariy pdf.js by @dhdaines in #26
Try much harder to read even very broken PDFs
Try somewhat harder to not produce empty TextObject (still a work in progress)

Full Changelog: v0.2.3...v0.2.4

Contributors

dhdaines

Assets 2

28 Nov 22:09

dhdaines

PLAYA-PDF 0.2.3: Release early and often (before vacation)

What's Changed

Require a newline before EI to fix various inline images by @dhdaines in #25
Refactoring the CMap parser missed a very important corner case (which somehow mypy did not flag?)
structtree property did not actually exist on Document and Page (oops!)

Full Changelog: v0.2.2...v0.2.3

Contributors

dhdaines

Assets 2

28 Nov 03:59

dhdaines

PLAYA-PDF 0.2.2: Make it go fast again

What's Changed

Resolve filters before checking if it isn't a list by @dhdaines in #22
Verify that we don't have pdfminer.six#1059 (and warn about it) by @dhdaines in #23
Optimize cmaps by @dhdaines in #24

Full Changelog: v0.2.1...v0.2.2

Contributors

dhdaines

Assets 2

27 Nov 04:01

dhdaines

PLAYA-PDF 0.2.1: Fix some bugs

What's Changed

Fix the RLE implementation by @dhdaines in #19 (originally pdfminer/pdfminer.six#1055 by @helpmefindaname)
Report the actual device space bounding box for rotated text by @dhdaines in #20
Prevent endless looping on bogus stream length and other EOFs by @dhdaines in #21

Full Changelog: v0.2...v0.2.1

Contributors

dhdaines and helpmefindaname

Assets 2

26 Nov 03:36

dhdaines

PLAYA-PDF 0.2: Break all the APIs

What's Changed

Support TIFF predictor on image streams by @dhdaines in #18 (originally from pdfminer/pdfminer.six#1058 by @helpmefindaname)
Support different "device spaces" (screen, page, and default user space)
expose form XObjects on Page to allow getting only their contents
expose form XObject IDs in LayoutDict
make TextState conform to PDF spec (leading and line matrix) and document it
expose more of TextState in LayoutDict (render mode in particular)
do not try to map characters with no ToUnicode and no Encoding
properly support Pattern color space (uncolored tiling patterns) the
way pdfplumber expects it to work
support marked content points as ContentObjects
document ContentObjects
make a proper schema for LayoutDict, document it, and communicate it to Polars
separate color values and patterns in LayoutDict

Full Changelog: v0.1.2...v0.2

Contributors

dhdaines and helpmefindaname

Assets 2

20 Nov 05:20

dhdaines

PLAYA 0.1.2: Initial release

Here's a first release, in case you want to use this. Reasons you might do so include:

Faster than pdfminer.six (about 20% or so)
Much friendlier APIs than PDFPageAggregator, PDFResourceManager, PDFPage, etc, etc.
Many outstanding pdfminer.six bugs have been fixed

Why would you not want to use this?

PyPI package name is not actually playa because somebody else took that name 13 years ago.
May be more or less tolerant of broken PDFs than pdfminer.six, and has no "strict mode" to be absolutely intolerant.
Doesn't let you extract image data (this is not always useful since PDFs tend to use compositing and thus you should use a real PDF renderer like pypdfium2 if you want to reliably extract images)
Is not (or ain't) a layout analyzer, so no LAParams, TextBox, and so on.
API subject to change and refinement.
Does not have abstractions. You do not have the flexibility to subclass everything and build a PDF renderer on top of PLAYA.
Probably contains bugs.
Definitely lacks documentation.

Assets 2