Not able to parse files: HTTP ERROR 403 #199

telboth · 2024-12-17T10:18:37Z

**I did a pip install megaparse (tried both in Windows and Linux) in a clean Pyton 3.11 enviroment. I then tried to run the basic examples from the home page to parse a pdf. The error(s) listing below occured:

I believe there might be two seperate issues here:
The first in the warning about "...conflict with protected namespace "model_"."

The second problem is that I get "...ERROR 403: Forbidden"**

python .\parse_shiptool.py
C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\pydantic_internal_fields.py:132: UserWarning: Field "model_name" in UploadFileConfig has conflict with protected namespace "model_".

You may be able to resolve this warning by setting model_config['protected_namespaces'] = ().
warnings.warn(
Traceback (most recent call last):
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\megaparse.py", line 89, in aload
parsed_document = await parser.convert(
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\parser\unstructured_parser.py", line 110, in convert
elements = partition(
^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\auto.py", line 438, in partition
elements = _partition_pdf(
^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\documents\elements.py", line 593, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\file_utils\filetype.py", line 429, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\file_utils\filetype.py", line 385, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\chunking\dispatch.py", line 74, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\pdf.py", line 208, in partition_pdf
return partition_pdf_or_image(
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\pdf.py", line 355, in partition_pdf_or_image
out_elements = _process_uncategorized_text_elements(elements)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\pdf.py", line 930, in _process_uncategorized_text_elements
new_el = element_from_text(cast(Text, el).text)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\text.py", line 295, in element_from_text
elif is_possible_narrative_text(text):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\text_type.py", line 80, in is_possible_narrative_text
if exceeds_cap_ratio(text, threshold=cap_threshold):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\text_type.py", line 276, in exceeds_cap_ratio
if sentence_count(text, 3) > 1:
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\text_type.py", line 225, in sentence_count
sentences = sent_tokenize(text)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\nlp\tokenize.py", line 136, in sent_tokenize
_download_nltk_packages_if_not_present()
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\nlp\tokenize.py", line 130, in _download_nltk_packages_if_not_present
download_nltk_packages()
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\nlp\tokenize.py", line 88, in download_nltk_packages
urllib.request.urlretrieve(NLTK_DATA_URL, tgz_file_path)
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 241, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 216, in urlopen
return opener.open(url, data, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 525, in open
response = meth(req, response)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 634, in http_response
response = self.parent.error(
^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 563, in error
return self._call_chain(*args)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 496, in _call_chain
result = func(*args)
^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\telboth\OneDrive - Shearwater Geoservices Norway AS\Documents\Python\chatbot\MegaParse\parse_shiptool.py", line 31, in
response = megaparse.load("./ylh-20240277-dch.pdf")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\megaparse.py", line 109, in load
return loop.run_until_complete(
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\asyncio\base_events.py", line 654, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\megaparse.py", line 98, in aload
raise ParsingException(
megaparse.exceptions.base.ParsingException: Error while parsing file ./ylh-20240277-dch.pdf, file_extension: FileExtension.PDF: HTTP Error 403: Forbidden

The text was updated successfully, but these errors were encountered:

telboth · 2024-12-17T10:29:10Z

Here is the basic code:

from megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.parser.unstructured_parser import UnstructuredParser

parser = UnstructuredParser()
megaparse = MegaParse(parser)

response = megaparse.load("./ylh-20240277-dch.pdf") #this is the pdf I want to read/parse
print(response)
megaparse.save("./test.md")

koeckc · 2024-12-17T10:40:06Z

I got exactly the same issue on macos Sonoma with python 3.11 with sample from readme

jpiliukaitis · 2024-12-17T10:51:59Z

Same. With Dockerfile.gpu build. Using /v1/file endpoint.

tried .pdf and .docx files.

{ "detail": "Error while parsing file <_io.BytesIO object at 0x761b4ab1b470>, file_extension: FileExtension.PDF: HTTP Error 403: Forbidden" }

telboth · 2024-12-17T11:34:39Z

Hmmm - Something spooky is going on! I discovered that it worked fine with some pdf files, but not with others.
That is: some files just run through, while others give the error mentioned above. I am not sure what's the difference betwee these files...

DimonLavron · 2024-12-17T21:56:21Z

I had the same issue

As far as I found, this is an issue with unstructured and the way they handled ntlk libraries download.
Related issue: Unstructured-IO/unstructured#3795
Related MR: Unstructured-IO/unstructured#3796

So I think updating unstructured version should resolve this issue

telboth · 2024-12-18T07:58:58Z

I tried pip install unstructured==0.16.11, but that did not solve the issue.
When running megaparse 0.0.49 and unstructured==0.15, I can parse some pdf's. but not all.
When running megaparse 0.0.52 (latest) nothing works and I now get the the the following error messages:

C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\pydantic_internal_fields.py:132: UserWarning: Field "model_name" in UploadFileConfig has conflict with protected namespace "model_".

You may be able to resolve this warning by setting model_config['protected_namespaces'] = ().
warnings.warn(

Traceback (most recent call last):
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\megaparse.py", line 89, in aload
parsed_document = await parser.convert(
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\parser\megaparse_vision.py", line 142, in convert
self.parsed_chunks = await asyncio.gather(*tasks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\parser\megaparse_vision.py", line 114, in send_to_mlm
response = await self.model.ainvoke([message])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\langchain_core\language_models\chat_models.py", line 307, in ainvoke
llm_result = await self.agenerate_prompt(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\langchain_core\language_models\chat_models.py", line 796, in agenerate_prompt
return await self.agenerate(
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\langchain_core\language_models\chat_models.py", line 756, in agenerate
raise exceptions[0]
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\langchain_core\language_models\chat_models.py", line 924, in _agenerate_with_cache
result = await self._agenerate(
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\langchain_openai\chat_models\base.py", line 825, in _agenerate
response = await self.async_client.create(**payload)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\openai\resources\chat\completions.py", line 1720, in create
return await self._post(
^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\openai_base_client.py", line 1843, in post
return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\openai_base_client.py", line 1537, in request
return await self._request(
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\openai_base_client.py", line 1638, in _request
raise self._make_status_error_from_response(err.response) from None
openai.NotFoundError: Error code: 404 - {'error': {'message': 'The model gpt-4o does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\telboth\OneDrive - Shearwater Geoservices Norway AS\Documents\Python\chatbot\parse_shiptool.py", line 30, in
response = megaparse.load("MegaParse/InterpolationMissingFrequencies.pdf") #working pdf
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\megaparse.py", line 109, in load
return loop.run_until_complete(
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\asyncio\base_events.py", line 654, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\megaparse.py", line 98, in aload
raise ParsingException(
megaparse.exceptions.base.ParsingException: Error while parsing file MegaParse/InterpolationMissingFrequencies.pdf, file_extension: FileExtension.PDF: Error code: 404 - {'error': {'message': 'The model gpt-4o does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}

(And yes - I do have gpt-4o access - since the code runs with megaparce==0.0.49)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to parse files: HTTP ERROR 403 #199

Not able to parse files: HTTP ERROR 403 #199

telboth commented Dec 17, 2024 •

edited

Loading

telboth commented Dec 17, 2024

koeckc commented Dec 17, 2024

jpiliukaitis commented Dec 17, 2024

telboth commented Dec 17, 2024 •

edited

Loading

DimonLavron commented Dec 17, 2024

telboth commented Dec 18, 2024 •

edited

Loading

Not able to parse files: HTTP ERROR 403 #199

Not able to parse files: HTTP ERROR 403 #199

Comments

telboth commented Dec 17, 2024 • edited Loading

telboth commented Dec 17, 2024

koeckc commented Dec 17, 2024

jpiliukaitis commented Dec 17, 2024

telboth commented Dec 17, 2024 • edited Loading

DimonLavron commented Dec 17, 2024

telboth commented Dec 18, 2024 • edited Loading

telboth commented Dec 17, 2024 •

edited

Loading

telboth commented Dec 17, 2024 •

edited

Loading

telboth commented Dec 18, 2024 •

edited

Loading