Skip to content

Commit

Permalink
Start making all APIs use iterator protocol instead of bespoke method…
Browse files Browse the repository at this point in the history
…s/classes ad infinitum (#11)

* feat!: always in-memory parser and use iterator protocol (mostly)

* fix: avoid error if x was a tuple for some reason

* test: fix tests

* fix: minor tweaks

* ci: benchmark

* chore: ruff

* ci: make benchmark a separate job

* ci: make benchmark a separate workflow

* ci: report ccoverage

* refactor!: make lines/revlines behave the same way

* refactor!: remove the utterly useless PDFResourceManager

* chore: ruff

* fix: tolerate mangled PDF headers

* refactor!: nexttoken redundant for lexer

* refactor!: PDFEliminate PDFExtra PDFCharacters PDFEverwhere PDFWe PDFHave PDFNamespaces PDFAfter PDFAll

* refactor!: there can be only one (parser)

* refactor!: page indices (0-based), PDFRemove PDFMore PDFPrefixes

* docs: describe the desired API

* fix: seek 0 in iter

* feat: iterator-based layout API

* chore: ruff it up

* fix(tests): test layout against pdfminer.six

* fix: error consistent with pdfminer

* fix: ensure xobjects actually work

* fix: validate against pdfminer

* fix: STRICT breaks things

* fix(test): extra-dependencies
  • Loading branch information
dhdaines authored Nov 1, 2024
1 parent 6f80c3d commit 9c0e217
Show file tree
Hide file tree
Showing 26 changed files with 1,977 additions and 2,567 deletions.
21 changes: 21 additions & 0 deletions .github/workflows/benchmarks.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: Benchmark
on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]

jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: Install Hatch
uses: pypa/hatch@install
- name: Run benchmarks
run: |
hatch run bench:all
4 changes: 2 additions & 2 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Run all tests
name: Test
on:
push:
branches: [ "main" ]
Expand All @@ -17,4 +17,4 @@ jobs:
- name: Install Hatch
uses: pypa/hatch@install
- name: Run tests
run: hatch test
run: hatch test --cover
107 changes: 105 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# PLAYA Ain't a LAYout Analyzer 🏖️
# **P**LAYA ain't a **LAY**out **A**nalyzer 🏖️

## About

Expand Down Expand Up @@ -28,7 +28,110 @@ Notably this does *not* include the largely undocumented heuristic
to understand due to a Java-damaged API based on deeply nested class
hierarchies, and because layout analysis is best done
probabilistically/visually. Also, pdfplumber does its own, much
nicer, layout analysis.
nicer, layout analysis. Also, if you just want to extract text from a
PDF, there are a lot of better and faster tools and libraries out
there, see [benchmarks]() for a summary (TL;DR pypdfium2 is probably
what you want, but pdfplumber does a nice job of converting PDF to
ASCII art).

## Usage

Do you want to get stuff out of a PDF? You have come to the right
place! Let's open up a PDF and see what's in it:

```python
pdf = playa.open("my_awesome_document.pdf")
raw_byte_stream = pdf.buffer
a_bunch_of_tokens = list(pdf.tokens)
a_bunch_of_objects = list(pdf)
a_particular_indirect_object = pdf[42]
```

The raw PDF tokens and objects are probably not terribly useful to
you, but you might find them interesting.

It probably has some pages. How many? What are their numbers/labels?
(they could be things like "xviii", 'a", or "42", for instance)

```python
npages = len(pdf.pages)
page_numbers = [page.label for page in pdf.pages]
```

What's in the table of contents?

```python
for entry in pdf.outlines:
...
```

If you are lucky it has a "logical structure tree". The elements here
might even be referenced from the table of contents! (or, they might
not... with PDF you never know)

```python
structure = pdf.structtree
for element in structure:
for child in element:
...
```

Now perhaps we want to look at a specific page. Okay!
```python
page = pdf.pages[0] # they are numbered from 0
page = pdf.pages["xviii"] # but you can get them by label
page = pdf.pages["42"] # or "logical" page number (also a label)
a_few_content_streams = list(page.contents)
raw_bytes = b"".join(stream.buffer for stream in page.contents)
```

This page probably has text, graphics, etc, etc, in it. Remember that
**P**LAYA ain't a **LAY**out **A**nalyzer! You can either look at the
stream of tokens or mysterious PDF objects:
```python
for token in page.tokens:
...
for object in page:
...
```

Or you can access individual characters, lines, curves, and rectangles
(if you wanted to, for instance, do layout analysis):
```python
for item in page.layout:
...
```

Do we make you spelunk in a dank class hierarchy to know what these
items are? No, we do not! They are just NamedTuples with a very
helpful field *telling* you what they are, as a string.

In particular you can also extract all these items into a dataframe
using the library of your choosing (I like [Polars]()) and I dunno do
some Artifishul Intelligents or something with them:
```python
```

Or just write them to a CSV file:
```python
```

Note again that PLAYA doesn't guarantee that these characters come at
you in anything other than the order they occur in the file (but it
does guarantee that). It does, however, put them in (hopefully) the
right absolute positions on the page, and keep track of the clipping
path and the graphics state, so yeah, you *could* "render" them like
`pdfminer.six` pretended to do.

Certain PDF tools and/or authors are notorious for using "whiteout"
(set the color to the background color) or "scissors" (the clipping
path) to hide arbitrary text that maybe *you* don't want to see
either. PLAYA gives you some rudimentary tools to detect this:
```python
```

For everything else, there's pdfplumber, pdfium2, pikepdf, pypdf,
borb, pydyf, etc, etc, etc.

## Acknowledgement

Expand Down
2 changes: 1 addition & 1 deletion playa/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from os import PathLike
from typing import Union

from playa.pdfdocument import PDFDocument
from playa.document import PDFDocument

__version__ = "0.0.1"

Expand Down
11 changes: 4 additions & 7 deletions playa/cmapdb.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,8 @@
)

from playa.encodingdb import name2unicode
from playa.exceptions import PSEOF, PDFException, PDFTypeError, PSSyntaxError
from playa.psparser import KWD, PSKeyword, PSLiteral, PSStackParser, literal_name
from playa.exceptions import PDFException, PDFTypeError, PSSyntaxError
from playa.parser import KWD, Parser, PSKeyword, PSLiteral, literal_name
from playa.utils import choplist, nunpack

log = logging.getLogger(__name__)
Expand Down Expand Up @@ -275,7 +275,7 @@ def get_unicode_map(cls, name: str, vertical: bool = False) -> UnicodeMap:
return cls._umap_cache[name][vertical]


class CMapParser(PSStackParser[PSKeyword]):
class CMapParser(Parser[PSKeyword]):
def __init__(self, cmap: CMapBase, data: bytes) -> None:
super().__init__(data)
self.cmap = cmap
Expand All @@ -284,10 +284,7 @@ def __init__(self, cmap: CMapBase, data: bytes) -> None:
self._warnings: Set[str] = set()

def run(self) -> None:
try:
self.nextobject()
except PSEOF:
pass
next(self, None)

KEYWORD_BEGINCMAP = KWD(b"begincmap")
KEYWORD_ENDCMAP = KWD(b"endcmap")
Expand Down
2 changes: 1 addition & 1 deletion playa/pdfcolor.py → playa/color.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import collections
from typing import Dict

from playa.psparser import LIT
from playa.parser import LIT

LITERAL_DEVICE_GRAY = LIT("DeviceGray")
LITERAL_DEVICE_RGB = LIT("DeviceRGB")
Expand Down
Loading

0 comments on commit 9c0e217

Please sign in to comment.