Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in TeXKeys extraction #40

Open
kaplun opened this issue Oct 5, 2017 · 3 comments
Open

Crash in TeXKeys extraction #40

kaplun opened this issue Oct 5, 2017 · 3 comments
Labels

Comments

@kaplun
Copy link
Contributor

kaplun commented Oct 5, 2017

Given the PDF available at: http://arxiv.org/pdf/1710.01077 refextract crashes in PyPDF2 code:

Traceback (most recent call last):
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 529, in _process
    self.run_callbacks(callbacks, objects, obj)
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 465, in run_callbacks
    indent + 1)
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 465, in run_callbacks
    indent + 1)
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 481, in run_callbacks
    self.execute_callback(callback_func, obj)
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 564, in execute_callback
    callback(obj, self)
  File "/opt/inspire/src/inspire/inspirehep/modules/workflows/utils.py", line 135, in _decorator
    res = func(*args, **kwargs)
  File "/opt/inspire/src/inspire/inspirehep/modules/workflows/tasks/actions.py", line 238, in refextract
    references = extract_references(uri, source)
  File "/opt/inspire/lib/python2.7/site-packages/timeout_decorator/timeout_decorator.py", line 81, in new_function
    return function(*args, **kwargs)
  File "/opt/inspire/src/inspire/inspirehep/modules/workflows/tasks/refextract.py", line 95, in extract_references
    reference_format=u'{title},{volume},{page}'
  File "/opt/inspire/lib/python2.7/site-packages/refextract/references/api.py", line 149, in extract_references_from_file
    texkeys = extract_texkeys_from_pdf(path)
  File "/opt/inspire/lib/python2.7/site-packages/refextract/references/pdf.py", line 54, in extract_texkeys_from_pdf
    pdf = PdfFileReader(pdf_stream, strict=False)
  File "/opt/inspire/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1084, in __init__
    self.read(stream)
  File "/opt/inspire/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1803, in read
    idnum, generation = self.readObjectHeader(stream)
  File "/opt/inspire/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1667, in readObjectHeader
    return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: 'f'

It should instead handle the exception and continue without extracting TeXKeys.

@kaplun kaplun added the bug label Oct 5, 2017
@michamos
Copy link
Contributor

michamos commented Oct 5, 2017

looks like PyPDF2 is really brittle. The crash happens when trying to parse the PDF, so nothing we could easily fix. Maybe we should wrap calls to PyPDF2 in a big

try:
    # call PyPDF2
except Exception as e:
    # log the exception

@kaplun
Copy link
Contributor Author

kaplun commented Oct 5, 2017

Yeah exactly.

@michamos
Copy link
Contributor

michamos commented Oct 5, 2017

we wouldn't lose much anyway: texkey extraction is useful only for articles using Inspire texkeys (and maybe other platforms like ADS in the future). Those will in the vast majority of cases be produced by a standard TeX pipeline, which we know works well with PyPDF2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants