pdfindex

PDF-index is a command line tool that find important terms in a PDF document and generates a ready-to-print index.

It relies on PyPDF and NLTK libraries for extracting and mining text.

Output formats currently supported are HTML and Markdown.

It works with Python 3.

Example

For generating an html index from the input.pdf document to output.html, selecting terms with a minimum score of 0.2:

$ python3 pdfindex.py --min-score 0.2 --format html input.pdf output.html

Usage

Within a virtualenv:

$ pip3 install -r requirements.txt

Print usage:

$ python3 pdfindex.py -h
usage: pdfindex.py [-h] [-m MIN_SCORE] [-f {html,markdown}] [-p PAGE_OFFSET]
                   input_file output_file

Extract text from a PDF file and generate a ready-to-print index

positional arguments:
  input_file            the PDF file
  output_file           the output file

optional arguments:
  -h, --help            show this help message and exit
  -m MIN_SCORE, --min-score MIN_SCORE
                        the minimum tfidf score required to be included in the
                        index
  -f {html,markdown}, --format {html,markdown}
                        the output format
  -p PAGE_OFFSET, --page-offset PAGE_OFFSET
                        the start of page numbering

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
pdfindex		pdfindex
samples		samples
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md
pdfindex.py		pdfindex.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdfindex

Example

Usage

About

Releases

Packages

Contributors 2

Languages

License

DBarthe/pdf-index

Folders and files

Latest commit

History

Repository files navigation

pdfindex

Example

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages