You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 529, in _process
self.run_callbacks(callbacks, objects, obj)
File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 465, in run_callbacks
indent + 1)
File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 465, in run_callbacks
indent + 1)
File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 481, in run_callbacks
self.execute_callback(callback_func, obj)
File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 564, in execute_callback
callback(obj, self)
File "/opt/inspire/src/inspire/inspirehep/modules/workflows/utils.py", line 135, in _decorator
res = func(*args, **kwargs)
File "/opt/inspire/src/inspire/inspirehep/modules/workflows/tasks/actions.py", line 238, in refextract
references = extract_references(uri, source)
File "/opt/inspire/lib/python2.7/site-packages/timeout_decorator/timeout_decorator.py", line 81, in new_function
return function(*args, **kwargs)
File "/opt/inspire/src/inspire/inspirehep/modules/workflows/tasks/refextract.py", line 95, in extract_references
reference_format=u'{title},{volume},{page}'
File "/opt/inspire/lib/python2.7/site-packages/refextract/references/api.py", line 149, in extract_references_from_file
texkeys = extract_texkeys_from_pdf(path)
File "/opt/inspire/lib/python2.7/site-packages/refextract/references/pdf.py", line 54, in extract_texkeys_from_pdf
pdf = PdfFileReader(pdf_stream, strict=False)
File "/opt/inspire/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1084, in __init__
self.read(stream)
File "/opt/inspire/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1803, in read
idnum, generation = self.readObjectHeader(stream)
File "/opt/inspire/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1667, in readObjectHeader
return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: 'f'
It should instead handle the exception and continue without extracting TeXKeys.
The text was updated successfully, but these errors were encountered:
looks like PyPDF2 is really brittle. The crash happens when trying to parse the PDF, so nothing we could easily fix. Maybe we should wrap calls to PyPDF2 in a big
try:
# call PyPDF2
except Exception as e:
# log the exception
we wouldn't lose much anyway: texkey extraction is useful only for articles using Inspire texkeys (and maybe other platforms like ADS in the future). Those will in the vast majority of cases be produced by a standard TeX pipeline, which we know works well with PyPDF2.
Given the PDF available at: http://arxiv.org/pdf/1710.01077 refextract crashes in PyPDF2 code:
It should instead handle the exception and continue without extracting TeXKeys.
The text was updated successfully, but these errors were encountered: