Releases: dhdaines/playa
Releases · dhdaines/playa
PLAYA-PDF 0.2.7: Definitive 0.2.x release
What's Changed
- Remove most uses of
Typing.cast
by @dhdaines in #37 - Optimize text placement (some dare call it "rendering") by @dhdaines in #38
- Fix font size and rotated/skewed bounding boxes by @dhdaines in #39
- fix: deprecate layout in CLI right away and do other useful stuff by @dhdaines in #40
- Correctly implement ToUnicode according to the PDF standard and not that bogus technical note (that the PDF standard refers to...) by @dhdaines in #41
- feat: support slices and tuples in page list by @dhdaines in #42
- Optimize text extraction a bit more by @dhdaines in #43
- Make text less Lazy 😥 by @dhdaines in #47
- Treat marked content sections (more) correctly
- fix: recognize junk before header and compensate (fixes: #46) by @dhdaines in #48
Full Changelog: v0.2.6...v0.2.7
PLAYA-PDF 0.2.6: New year, new acronym
What's Changed
- ci: test on windows and mac by @dhdaines in #33
- Support parallel operations over pages by @dhdaines in #36
- Partially correct the handling of some types of CMaps (not fully correct though)
Full Changelog: v0.2.5...v0.2.6
PLAYA-PDF 0.2.5: Bug fixes and improvements
What's Changed
- Fix various bugs in the lazy API
- Add specialized
__len__
methods toContentObject
classes - Clarify iteration over
ContentObject
- Add specialized
- Fix installation of playa-pdf[crypto]
- Fix attribute classes in structure tree elements
- Deprecate "user" device space to avoid confusion with user space
- Parse embedded CMaps (mostly)
- Update
pdfplumber
support - Add parser for object streams and iterator over all indirect objects
in a document
Full Changelog: v0.2.4...v0.2.5
v0.2.4
PLAYA-PDF 0.2.3: Release early and often (before vacation)
What's Changed
- Require a newline before EI to fix various inline images by @dhdaines in #25
- Refactoring the CMap parser missed a very important corner case (which somehow mypy did not flag?)
structtree
property did not actually exist onDocument
andPage
(oops!)
Full Changelog: v0.2.2...v0.2.3
PLAYA-PDF 0.2.2: Make it go fast again
PLAYA-PDF 0.2.1: Fix some bugs
What's Changed
- Fix the RLE implementation by @dhdaines in #19 (originally pdfminer/pdfminer.six#1055 by @helpmefindaname)
- Report the actual device space bounding box for rotated text by @dhdaines in #20
- Prevent endless looping on bogus stream length and other EOFs by @dhdaines in #21
Full Changelog: v0.2...v0.2.1
PLAYA-PDF 0.2: Break all the APIs
What's Changed
- Support TIFF predictor on image streams by @dhdaines in #18 (originally from pdfminer/pdfminer.six#1058 by @helpmefindaname)
- Support different "device spaces" (screen, page, and default user space)
- expose form XObjects on Page to allow getting only their contents
- expose form XObject IDs in LayoutDict
- make TextState conform to PDF spec (leading and line matrix) and document it
- expose more of TextState in LayoutDict (render mode in particular)
- do not try to map characters with no ToUnicode and no Encoding
- properly support Pattern color space (uncolored tiling patterns) the
way pdfplumber expects it to work - support marked content points as ContentObjects
- document ContentObjects
- make a proper schema for LayoutDict, document it, and communicate it to Polars
- separate color values and patterns in LayoutDict
Full Changelog: v0.1.2...v0.2
PLAYA 0.1.2: Initial release
Here's a first release, in case you want to use this. Reasons you might do so include:
- Faster than
pdfminer.six
(about 20% or so) - Much friendlier APIs than
PDFPageAggregator
,PDFResourceManager
,PDFPage
, etc, etc. - Many outstanding
pdfminer.six
bugs have been fixed
Why would you not want to use this?
- PyPI package name is not actually
playa
because somebody else took that name 13 years ago. - May be more or less tolerant of broken PDFs than
pdfminer.six
, and has no "strict mode" to be absolutely intolerant. - Doesn't let you extract image data (this is not always useful since PDFs tend to use compositing and thus you should use a real PDF renderer like pypdfium2 if you want to reliably extract images)
- Is not (or ain't) a layout analyzer, so no
LAParams
,TextBox
, and so on. - API subject to change and refinement.
- Does not have abstractions. You do not have the flexibility to subclass everything and build a PDF renderer on top of PLAYA.
- Probably contains bugs.
- Definitely lacks documentation.