Skip to content

Releases: dhdaines/playa

PLAYA-PDF 0.2.7: Definitive 0.2.x release

07 Jan 17:21
024b797
Compare
Choose a tag to compare

What's Changed

  • Remove most uses of Typing.cast by @dhdaines in #37
  • Optimize text placement (some dare call it "rendering") by @dhdaines in #38
  • Fix font size and rotated/skewed bounding boxes by @dhdaines in #39
  • fix: deprecate layout in CLI right away and do other useful stuff by @dhdaines in #40
  • Correctly implement ToUnicode according to the PDF standard and not that bogus technical note (that the PDF standard refers to...) by @dhdaines in #41
  • feat: support slices and tuples in page list by @dhdaines in #42
  • Optimize text extraction a bit more by @dhdaines in #43
  • Make text less Lazy 😥 by @dhdaines in #47
  • Treat marked content sections (more) correctly
  • fix: recognize junk before header and compensate (fixes: #46) by @dhdaines in #48

Full Changelog: v0.2.6...v0.2.7

PLAYA-PDF 0.2.6: New year, new acronym

30 Dec 18:30
Compare
Choose a tag to compare

What's Changed

  • ci: test on windows and mac by @dhdaines in #33
  • Support parallel operations over pages by @dhdaines in #36
  • Partially correct the handling of some types of CMaps (not fully correct though)

Full Changelog: v0.2.5...v0.2.6

PLAYA-PDF 0.2.5: Bug fixes and improvements

15 Dec 18:08
Compare
Choose a tag to compare

What's Changed

  • Fix various bugs in the lazy API
    • Add specialized __len__ methods to ContentObject classes
    • Clarify iteration over ContentObject
  • Fix installation of playa-pdf[crypto]
  • Fix attribute classes in structure tree elements
  • Deprecate "user" device space to avoid confusion with user space
  • Parse embedded CMaps (mostly)
  • Update pdfplumber support
  • Add parser for object streams and iterator over all indirect objects
    in a document

Full Changelog: v0.2.4...v0.2.5

v0.2.4

03 Dec 04:08
Compare
Choose a tag to compare

What's Changed

  • Add (and fix) 3rd party test suites, primariy pdf.js by @dhdaines in #26
  • Try much harder to read even very broken PDFs
  • Try somewhat harder to not produce empty TextObject (still a work in progress)

Full Changelog: v0.2.3...v0.2.4

PLAYA-PDF 0.2.3: Release early and often (before vacation)

28 Nov 22:09
Compare
Choose a tag to compare

What's Changed

  • Require a newline before EI to fix various inline images by @dhdaines in #25
  • Refactoring the CMap parser missed a very important corner case (which somehow mypy did not flag?)
  • structtree property did not actually exist on Document and Page (oops!)

Full Changelog: v0.2.2...v0.2.3

PLAYA-PDF 0.2.2: Make it go fast again

28 Nov 03:59
Compare
Choose a tag to compare

What's Changed

  • Resolve filters before checking if it isn't a list by @dhdaines in #22
  • Verify that we don't have pdfminer.six#1059 (and warn about it) by @dhdaines in #23
  • Optimize cmaps by @dhdaines in #24

Full Changelog: v0.2.1...v0.2.2

PLAYA-PDF 0.2.1: Fix some bugs

27 Nov 04:01
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.2...v0.2.1

PLAYA-PDF 0.2: Break all the APIs

26 Nov 03:36
Compare
Choose a tag to compare

What's Changed

  • Support TIFF predictor on image streams by @dhdaines in #18 (originally from pdfminer/pdfminer.six#1058 by @helpmefindaname)
  • Support different "device spaces" (screen, page, and default user space)
  • expose form XObjects on Page to allow getting only their contents
  • expose form XObject IDs in LayoutDict
  • make TextState conform to PDF spec (leading and line matrix) and document it
  • expose more of TextState in LayoutDict (render mode in particular)
  • do not try to map characters with no ToUnicode and no Encoding
  • properly support Pattern color space (uncolored tiling patterns) the
    way pdfplumber expects it to work
  • support marked content points as ContentObjects
  • document ContentObjects
  • make a proper schema for LayoutDict, document it, and communicate it to Polars
  • separate color values and patterns in LayoutDict

Full Changelog: v0.1.2...v0.2

PLAYA 0.1.2: Initial release

20 Nov 05:20
Compare
Choose a tag to compare

Here's a first release, in case you want to use this. Reasons you might do so include:

  • Faster than pdfminer.six (about 20% or so)
  • Much friendlier APIs than PDFPageAggregator, PDFResourceManager, PDFPage, etc, etc.
  • Many outstanding pdfminer.six bugs have been fixed

Why would you not want to use this?

  • PyPI package name is not actually playa because somebody else took that name 13 years ago.
  • May be more or less tolerant of broken PDFs than pdfminer.six, and has no "strict mode" to be absolutely intolerant.
  • Doesn't let you extract image data (this is not always useful since PDFs tend to use compositing and thus you should use a real PDF renderer like pypdfium2 if you want to reliably extract images)
  • Is not (or ain't) a layout analyzer, so no LAParams, TextBox, and so on.
  • API subject to change and refinement.
  • Does not have abstractions. You do not have the flexibility to subclass everything and build a PDF renderer on top of PLAYA.
  • Probably contains bugs.
  • Definitely lacks documentation.