Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start making all APIs use iterator protocol instead of bespoke methods/classes ad infinitum #11

Merged
merged 27 commits into from
Nov 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
fb85d4d
feat!: always in-memory parser and use iterator protocol (mostly)
dhdaines Oct 27, 2024
ec0785e
fix: avoid error if x was a tuple for some reason
dhdaines Oct 27, 2024
14644ef
test: fix tests
dhdaines Oct 27, 2024
19f448a
fix: minor tweaks
dhdaines Oct 27, 2024
88bf274
ci: benchmark
dhdaines Oct 27, 2024
400554f
chore: ruff
dhdaines Oct 27, 2024
61874f4
ci: make benchmark a separate job
dhdaines Oct 27, 2024
7bef7c7
ci: make benchmark a separate workflow
dhdaines Oct 27, 2024
5bedf50
ci: report ccoverage
dhdaines Oct 27, 2024
9b3d352
refactor!: make lines/revlines behave the same way
dhdaines Oct 28, 2024
f5bbaca
refactor!: remove the utterly useless PDFResourceManager
dhdaines Oct 28, 2024
9ea5c08
chore: ruff
dhdaines Oct 28, 2024
f5ab4bb
fix: tolerate mangled PDF headers
dhdaines Oct 28, 2024
1a12046
refactor!: nexttoken redundant for lexer
dhdaines Oct 28, 2024
2b375b8
refactor!: PDFEliminate PDFExtra PDFCharacters PDFEverwhere PDFWe PDF…
dhdaines Oct 29, 2024
8aaf9ab
refactor!: there can be only one (parser)
dhdaines Oct 29, 2024
3267b88
refactor!: page indices (0-based), PDFRemove PDFMore PDFPrefixes
dhdaines Oct 29, 2024
c04cd9f
docs: describe the desired API
dhdaines Oct 31, 2024
bb59a7b
fix: seek 0 in iter
dhdaines Oct 31, 2024
5453106
feat: iterator-based layout API
dhdaines Oct 31, 2024
31e2e9e
chore: ruff it up
dhdaines Oct 31, 2024
9e5ce2c
fix(tests): test layout against pdfminer.six
dhdaines Oct 31, 2024
7b962f7
fix: error consistent with pdfminer
dhdaines Oct 31, 2024
ab1160a
fix: ensure xobjects actually work
dhdaines Oct 31, 2024
e6867f6
fix: validate against pdfminer
dhdaines Nov 1, 2024
8400a85
fix: STRICT breaks things
dhdaines Nov 1, 2024
a782c3a
fix(test): extra-dependencies
dhdaines Nov 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .github/workflows/benchmarks.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: Benchmark
on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]

jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: Install Hatch
uses: pypa/hatch@install
- name: Run benchmarks
run: |
hatch run bench:all
4 changes: 2 additions & 2 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Run all tests
name: Test
on:
push:
branches: [ "main" ]
Expand All @@ -17,4 +17,4 @@ jobs:
- name: Install Hatch
uses: pypa/hatch@install
- name: Run tests
run: hatch test
run: hatch test --cover
107 changes: 105 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# PLAYA Ain't a LAYout Analyzer 🏖️
# **P**LAYA ain't a **LAY**out **A**nalyzer 🏖️

## About

Expand Down Expand Up @@ -28,7 +28,110 @@ Notably this does *not* include the largely undocumented heuristic
to understand due to a Java-damaged API based on deeply nested class
hierarchies, and because layout analysis is best done
probabilistically/visually. Also, pdfplumber does its own, much
nicer, layout analysis.
nicer, layout analysis. Also, if you just want to extract text from a
PDF, there are a lot of better and faster tools and libraries out
there, see [benchmarks]() for a summary (TL;DR pypdfium2 is probably
what you want, but pdfplumber does a nice job of converting PDF to
ASCII art).

## Usage

Do you want to get stuff out of a PDF? You have come to the right
place! Let's open up a PDF and see what's in it:

```python
pdf = playa.open("my_awesome_document.pdf")
raw_byte_stream = pdf.buffer
a_bunch_of_tokens = list(pdf.tokens)
a_bunch_of_objects = list(pdf)
a_particular_indirect_object = pdf[42]
```

The raw PDF tokens and objects are probably not terribly useful to
you, but you might find them interesting.

It probably has some pages. How many? What are their numbers/labels?
(they could be things like "xviii", 'a", or "42", for instance)

```python
npages = len(pdf.pages)
page_numbers = [page.label for page in pdf.pages]
```

What's in the table of contents?

```python
for entry in pdf.outlines:
...
```

If you are lucky it has a "logical structure tree". The elements here
might even be referenced from the table of contents! (or, they might
not... with PDF you never know)

```python
structure = pdf.structtree
for element in structure:
for child in element:
...
```

Now perhaps we want to look at a specific page. Okay!
```python
page = pdf.pages[0] # they are numbered from 0
page = pdf.pages["xviii"] # but you can get them by label
page = pdf.pages["42"] # or "logical" page number (also a label)
a_few_content_streams = list(page.contents)
raw_bytes = b"".join(stream.buffer for stream in page.contents)
```

This page probably has text, graphics, etc, etc, in it. Remember that
**P**LAYA ain't a **LAY**out **A**nalyzer! You can either look at the
stream of tokens or mysterious PDF objects:
```python
for token in page.tokens:
...
for object in page:
...
```

Or you can access individual characters, lines, curves, and rectangles
(if you wanted to, for instance, do layout analysis):
```python
for item in page.layout:
...
```

Do we make you spelunk in a dank class hierarchy to know what these
items are? No, we do not! They are just NamedTuples with a very
helpful field *telling* you what they are, as a string.

In particular you can also extract all these items into a dataframe
using the library of your choosing (I like [Polars]()) and I dunno do
some Artifishul Intelligents or something with them:
```python
```

Or just write them to a CSV file:
```python
```

Note again that PLAYA doesn't guarantee that these characters come at
you in anything other than the order they occur in the file (but it
does guarantee that). It does, however, put them in (hopefully) the
right absolute positions on the page, and keep track of the clipping
path and the graphics state, so yeah, you *could* "render" them like
`pdfminer.six` pretended to do.

Certain PDF tools and/or authors are notorious for using "whiteout"
(set the color to the background color) or "scissors" (the clipping
path) to hide arbitrary text that maybe *you* don't want to see
either. PLAYA gives you some rudimentary tools to detect this:
```python
```

For everything else, there's pdfplumber, pdfium2, pikepdf, pypdf,
borb, pydyf, etc, etc, etc.

## Acknowledgement

Expand Down
2 changes: 1 addition & 1 deletion playa/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from os import PathLike
from typing import Union

from playa.pdfdocument import PDFDocument
from playa.document import PDFDocument

__version__ = "0.0.1"

Expand Down
11 changes: 4 additions & 7 deletions playa/cmapdb.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,8 @@
)

from playa.encodingdb import name2unicode
from playa.exceptions import PSEOF, PDFException, PDFTypeError, PSSyntaxError
from playa.psparser import KWD, PSKeyword, PSLiteral, PSStackParser, literal_name
from playa.exceptions import PDFException, PDFTypeError, PSSyntaxError
from playa.parser import KWD, Parser, PSKeyword, PSLiteral, literal_name
from playa.utils import choplist, nunpack

log = logging.getLogger(__name__)
Expand Down Expand Up @@ -275,7 +275,7 @@ def get_unicode_map(cls, name: str, vertical: bool = False) -> UnicodeMap:
return cls._umap_cache[name][vertical]


class CMapParser(PSStackParser[PSKeyword]):
class CMapParser(Parser[PSKeyword]):
def __init__(self, cmap: CMapBase, data: bytes) -> None:
super().__init__(data)
self.cmap = cmap
Expand All @@ -284,10 +284,7 @@ def __init__(self, cmap: CMapBase, data: bytes) -> None:
self._warnings: Set[str] = set()

def run(self) -> None:
try:
self.nextobject()
except PSEOF:
pass
next(self, None)

KEYWORD_BEGINCMAP = KWD(b"begincmap")
KEYWORD_ENDCMAP = KWD(b"endcmap")
Expand Down
2 changes: 1 addition & 1 deletion playa/pdfcolor.py → playa/color.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import collections
from typing import Dict

from playa.psparser import LIT
from playa.parser import LIT

LITERAL_DEVICE_GRAY = LIT("DeviceGray")
LITERAL_DEVICE_RGB = LIT("DeviceRGB")
Expand Down
Loading