Crash in TeXKeys extraction #40

kaplun · 2017-10-05T07:24:08Z

Given the PDF available at: http://arxiv.org/pdf/1710.01077 refextract crashes in PyPDF2 code:

Traceback (most recent call last):
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 529, in _process
    self.run_callbacks(callbacks, objects, obj)
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 465, in run_callbacks
    indent + 1)
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 465, in run_callbacks
    indent + 1)
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 481, in run_callbacks
    self.execute_callback(callback_func, obj)
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 564, in execute_callback
    callback(obj, self)
  File "/opt/inspire/src/inspire/inspirehep/modules/workflows/utils.py", line 135, in _decorator
    res = func(*args, **kwargs)
  File "/opt/inspire/src/inspire/inspirehep/modules/workflows/tasks/actions.py", line 238, in refextract
    references = extract_references(uri, source)
  File "/opt/inspire/lib/python2.7/site-packages/timeout_decorator/timeout_decorator.py", line 81, in new_function
    return function(*args, **kwargs)
  File "/opt/inspire/src/inspire/inspirehep/modules/workflows/tasks/refextract.py", line 95, in extract_references
    reference_format=u'{title},{volume},{page}'
  File "/opt/inspire/lib/python2.7/site-packages/refextract/references/api.py", line 149, in extract_references_from_file
    texkeys = extract_texkeys_from_pdf(path)
  File "/opt/inspire/lib/python2.7/site-packages/refextract/references/pdf.py", line 54, in extract_texkeys_from_pdf
    pdf = PdfFileReader(pdf_stream, strict=False)
  File "/opt/inspire/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1084, in __init__
    self.read(stream)
  File "/opt/inspire/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1803, in read
    idnum, generation = self.readObjectHeader(stream)
  File "/opt/inspire/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1667, in readObjectHeader
    return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: 'f'

It should instead handle the exception and continue without extracting TeXKeys.

The text was updated successfully, but these errors were encountered:

michamos · 2017-10-05T07:32:22Z

looks like PyPDF2 is really brittle. The crash happens when trying to parse the PDF, so nothing we could easily fix. Maybe we should wrap calls to PyPDF2 in a big

try:
    # call PyPDF2
except Exception as e:
    # log the exception

kaplun · 2017-10-05T07:35:00Z

Yeah exactly.

michamos · 2017-10-05T07:40:42Z

we wouldn't lose much anyway: texkey extraction is useful only for articles using Inspire texkeys (and maybe other platforms like ADS in the future). Those will in the vast majority of cases be produced by a standard TeX pipeline, which we know works well with PyPDF2.

kaplun added the bug label Oct 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash in TeXKeys extraction #40

Crash in TeXKeys extraction #40

kaplun commented Oct 5, 2017 •

edited

Loading

michamos commented Oct 5, 2017 •

edited

Loading

kaplun commented Oct 5, 2017

michamos commented Oct 5, 2017

Crash in TeXKeys extraction #40

Crash in TeXKeys extraction #40

Comments

kaplun commented Oct 5, 2017 • edited Loading

michamos commented Oct 5, 2017 • edited Loading

kaplun commented Oct 5, 2017

michamos commented Oct 5, 2017

kaplun commented Oct 5, 2017 •

edited

Loading

michamos commented Oct 5, 2017 •

edited

Loading