The script convert_to_txt.py converts documents (pdf, djvu, epub, word) to txt.
This is a partial Python port of convert-to-txt.sh (minus OCR) from ebook-tools written in shell by na--.
⭐ Other related Python projects based on ebook-tools
:
- find-isbns: find ISBNs from ebooks (pdf, djvu, epub) or any string given as input to the script
- ocr: run OCR on documents (pdf, djvu, and images)
- split-ebooks-into-folders: split the supplied ebook files into folders with consecutive names
- organize-ebooks: automatically organize folders with potentially huge amounts of unorganized ebooks. It leverages the previous Python scripts (minus
split_into_folders
).
This is the environment on which the script convert_to_txt.py was tested:
Platform: macOS
Python: version 3.7
textutil or catdoc: for converting doc to txt
NOTE: On macOS, you don't need
catdoc
since it has the built-intextutil
command-line tool that converts any txt, html, rtf, rtfd, doc, docx, wordml, odt, or webarchive fileDjVuLibre: it includes
djvutxt
for converting djvu to txt⚠️ To access the djvu command line utilities and their documentation, you must set the shell variable
PATH
andMANPATH
appropriately. This can be achieved by invoking a convenient shell script hidden inside the application bundle:$ eval `/Applications/DjView.app/Contents/setpath.sh`
Ref.: ReadMe from DjVuLibre
You need to softlink
djvutxt
in/user/local/bin
(or add it in$PATH
)
poppler: it includes
pdftotext
for converting pdf to txt
ℹ️ epub can be converted to txt by using unzip -c {input_file}
but it is not a cleaned conversion
since HTML data are also included.
Optionally:
calibre: for converting {pdf, djvu, epub, msword} to txt by using calibre's ebook-convert
⚠️ ebook-convert
is slower than the other conversion tools (textutil
,catdoc
,pdftotext
,djvutxt
)
Install first the dependencies.
Then you can install the convert_to_txt package:
$ pip install git+https://github.com/raul23/convert-to-txt#egg=convert-to-txt
Test installation
Test your installation by importing
convert_to_txt
and printing its version:$ python -c "import convert_to_txt; print(convert_to_txt.__version__)"
You can also test that you have access to the
convert_to_txt.py
script by showing the program's version:$ convert_to_txt --version
To uninstall the convert_to_txt package:
$ pip uninstall convert_to_txt
To display the script convert_to_txt.py list of options and their descriptions:
$ convert_to_txt -h usage: convert_to_txt [OPTIONS] {input_file} [{output_file}] Convert documents (pdf, djvu, epub, word) to txt. General options: -h, --help Show this help message and exit. -v, --version Show program's version number and exit. -q, --quiet Enable quiet mode, i.e. nothing will be printed. --verbose Print various debugging information, e.g. print traceback when there is an exception. --log-level {debug,info,warning,error} Set logging level. (default: info) --log-format {console,only_msg,simple} Set logging formatter. (default: only_msg) Convert-to-txt options: -p, --pages PAGES "Specify which pages should be processed. When this option is not specified, the text of all pages of the documents is concatenated into the output file. The page specification PAGES contains one or more comma-separated page ranges. A page range is either a page number, or two page numbers separated by a dash. For instance, specification 1-10 outputs pages 1 to 10, and specification 1,3,99999-4 outputs pages 1 and 3, followed by all the document pages in reverse order up to page 4." Ref.: https://man.archlinux.org/man/djvutxt.1.en --djvu {djvutxt,ebook-convert} Set the conversion method for djvu documents. (default: djvutxt) --epub {ebook-convert,epubtxt} Set the conversion method for epub documents. (default: ebook-convert) --msword {textutil,catdoc,ebook-convert} Set the conversion method for msword documents. (default: textutil) --pdf {pdftotext,ebook-convert} Set the conversion method for pdf documents. (default: pdftotext) Input/Output files: input Path of the file (pdf, djvu, epub, word) that will be converted to txt. output Path of the output txt file. (default: output.txt)
ℹ️ Explaining some of the options/arguments
The option
-p, --pages
is taken straight from djvutxt option--page=pagespec
.⚠️ Things to watch out when using the-p
option- If the option
-p
is not used, then by default all pages from the given document will be converted. - If the given document is not a pdf or djvu file, then the option
-p
will be ignored.
- If the option
input
andoutput
are positional arguments. Thus they must follow directly each other.output
is not required since by default the output txt file will be saved asoutput.txt
directly under the working directory.⚠️ output
needs to have a .txt extension!
Here are the important steps that the script convert_to_txt.py follows when converting a given document to txt:
- If the given document is already in .txt, then no need to go further!
- According to the mime type, the corresponding conversion tool is called upon:
- image/vnd.djvu:
djvutxt
- application/epub+zip:
unzip
- application/msword:
catdoc
ortextutil
- application/pdf:
pdftotext
ebook-convert
if the other conversion tools are not found
- image/vnd.djvu:
- The output txt file is checked if it actually contains text. If it doesn't, the user is warned that the conversion failed.
These are the files that are supported for conversion to txt and the corresponding conversion tools used:
Files supported | Conversion tool #1 | Conversion tool #2 | Conversion tool #3 |
---|---|---|---|
pdftotext |
ebook-convert (calibre) |
||
djvu | djvutxt |
ebook-convert (calibre) |
|
epub | ebook-convert (calibre) |
epubtxt |
|
docx (Word 2007) | ebook-convert (calibre) |
||
doc (Word 97) | textutil (macOS) |
catdoc |
ebook-convert (calibre) |
rtf | ebook-convert (calibre) |
ℹ️ Some explanations about the table
epubtxt
is a fancy way to sayunzip
.- By default,
ebook-convert
(calibre) is used for converting epub to txt because it does a better job thanepubtxt
sinceepubtxt
also includes HTML data.
For comparison, here are the times taken to convert completely a 154-pages PDF document to txt for both supported conversion methods:
pdftotext
: 4.27sebook-convert
(calibre): 80.91s
Let's say you want to convert specific pages of a pdf file to txt, then the following command will do the trick:
convert_to_txt ~/Data/convert/K.pdf K.txt -p 15-10,3,23-30
ℹ️ Explaining the command
-p 15-10,3,23-30
: specifies that pages 15 to 10 (reverse order), 3 and 23 to 30 from the given pdf document will be converted to txt.⚠️ No spaces when specifying the pages.~/Data/convert/K.pdf K.txt
: these are the input and output files, respectively.NOTE: by default if no output file is specified, then the resultant text will be saved as
output.txt
directly under the working directory.
Sample output:
Starting document conversion to txt... Conversion successful!
To convert a pdf file to txt using the API:
from convert_to_txt.lib import convert
txt = convert('/Users/test/Data/convert/B.pdf', convert_pages='10-12')
# Do something with `txt`
ℹ️ Explaining the snippet of code
convert(input_file, output_file=None, convert_pages=CONVERT_PAGES)
:By default
output_file
is None and henceconvert()
will return the text from the conversion. If you setoutput_file
to for example output.txt, thenconvert()
will just return a status code (1 for error and 0 for success) and will write the text from the conversion to output.txt.The variable
txt
will contain the text from the conversion.
By default when using the API, the loggers are disabled. If you want to enable them, call the
function setup_log()
(with the desired log level in all caps) at the beginning of your code before
the conversion function convert()
:
from convert_pages.lib import convert, setup_log
setup_log(logging_level='DEBUG')
txt = convert('/Users/test/Data/convert/B.pdf', convert_pages='10-12')
# Do something with `txt`
Sample output:
Running /Users/test/miniconda3/envs/mlpy37/lib/python3.7/site-packages/convert_to_txt/lib.py v0.1.0 Verbose option disabled mime type: application/pdf Output text file already exists: output.txt Full path of output text file: '/Users/test/convert_to_txt/test_installation/output.txt' Starting document conversion to txt... The file looks like a pdf, using pdftotext to extract the text These are all the pages that need to be converted: 10-12 Pages to process: [10, 11, 12] Processing page 1 of 3 Page number: 10 Using tmp file /var/folders/b8/k1ndbdn53zs1m078zwwrbc4w0000gn/T/tmpc9ma3mwr.txt Result of 'pdftotext': stdout=, stderr=, returncode=0, args=['pdftotext', '/Users/test/Data/convert/B.pdf', '/var/folders/b8/k1ndbdn53zs1m078zwwrbc4w0000gn/T/tmpc9ma3mwr.txt', '-f', '10', '-l', '10'] Cleaning up tmp file
Finally, just like you can set the conversion method via the command-line, you can also do it via the API:
from convert_pages.lib import convert
txt = convert('/Users/test/Data/convert/B.pdf', convert_pages='10-12', pdf_convert_method='ebook-convert')
ℹ️ The full signature for the function convert()
:
convert(input_file, output_file=None,
convert_pages=None,
djvu_convert_method='djvutxt',
epub_convert_method='epubtxt',
msword_convert_method='textutil',
pdf_convert_method='pdftotext', **kwargs)