Skip to content

Commit

Permalink
docs: Update README.md
Browse files Browse the repository at this point in the history
Clarify a few things
  • Loading branch information
dhdaines authored Dec 11, 2024
1 parent 5c49d1b commit 8ba66dc
Showing 1 changed file with 16 additions and 6 deletions.
22 changes: 16 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,8 +135,9 @@ If, on the other hand, **you** are lazy, then you can just use
`page.layout`, which will flatten everything for you into a friendly
dictionary representation (but it is a
[`TypedDict`](https://typing.readthedocs.io/en/latest/spec/typeddict.html#typeddict))
which, um, looks a lot like what `pdfplumber` gives you, except in the
coordinate space defined previously.
which, um, looks a lot like what `pdfplumber` gives you, except possibly in
a different
coordinate space, as defined [below](#an-important-note-about-coordinate-spaces).

```python
for dic in page.layout:
Expand Down Expand Up @@ -195,7 +196,7 @@ device space, specifically:
- `(0, 0)` is the top-left corner of the page, as defined by its
`MediaBox` after rotation is applied.
- Coordinates increase from the top-left corner of the page towards
the bottom-left corner.
the bottom-right corner.

However, for compatibility with `pdfminer.six`, you can also pass
`space="page"` to `playa.open`. In this case, `(0, 0)` is the
Expand All @@ -207,7 +208,10 @@ If you don't care about absolute positioning, you can use
`space="user"`, which may be somewhat faster in the future (currently
it isn't). In this case, no translation or rotation of the default
user space is done (in other words any values of `MediaBox` or
`Rotate` in the page dictionary are simply ignored).
`Rotate` in the page dictionary are simply ignored). This is **definitely**
what you want if you wish to take advantage of the coordinates that
you may find in `outlines`, `dests`, tags and logical structure
elements.

## Lazy object API

Expand Down Expand Up @@ -342,7 +346,13 @@ don't access `obj.bbox` and it won't be computed. If you don't need
to know the position of each glyph but simply want the Unicode
characters, then just look at `obj.chars`.

Also, a lot of PDFs, especially ones produced by OCR, don't organize
It is important to understand that `obj.chars` may or may not correspond
to the actual text that a human will read on the page. To
actually extract *text* from a PDF necessarily involves Heuristics
or Machine Learning (yes, capitalized, like that) and PLAYA does not do
either of those things.

This is because PDFs, especially ones produced by OCR, don't organize
text objects in any meaningful fashion, so you will want to actually
look at the glyphs. This becomes a matter of iterating over the item,
giving you, well, more items, which are the individual glyphs:
Expand All @@ -360,7 +370,7 @@ PDF has the concept of a *text state* which determines some aspects of
how text is rendered. You can obviously access this though
`glyph.textstate` - note that the text state, like the graphics state,
is mutable, so you will have to copy it or save individual parameters
that you might care about.
that you might care about. This may be a major footgun so watch out.

PLAYA doesn't guarantee that text objects come at you in anything
other than the order they occur in the file (but it does guarantee
Expand Down

0 comments on commit 8ba66dc

Please sign in to comment.