Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to parse files: HTTP ERROR 403 #199

Open
telboth opened this issue Dec 17, 2024 · 6 comments
Open

Not able to parse files: HTTP ERROR 403 #199

telboth opened this issue Dec 17, 2024 · 6 comments

Comments

@telboth
Copy link

telboth commented Dec 17, 2024

**I did a pip install megaparse (tried both in Windows and Linux) in a clean Pyton 3.11 enviroment. I then tried to run the basic examples from the home page to parse a pdf. The error(s) listing below occured:

I believe there might be two seperate issues here:
The first in the warning about "...conflict with protected namespace "model_"."

The second problem is that I get "...ERROR 403: Forbidden"**

python .\parse_shiptool.py
C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\pydantic_internal_fields.py:132: UserWarning: Field "model_name" in UploadFileConfig has conflict with protected namespace "model_".

You may be able to resolve this warning by setting model_config['protected_namespaces'] = ().
warnings.warn(
Traceback (most recent call last):
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\megaparse.py", line 89, in aload
parsed_document = await parser.convert(
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\parser\unstructured_parser.py", line 110, in convert
elements = partition(
^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\auto.py", line 438, in partition
elements = _partition_pdf(
^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\documents\elements.py", line 593, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\file_utils\filetype.py", line 429, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\file_utils\filetype.py", line 385, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\chunking\dispatch.py", line 74, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\pdf.py", line 208, in partition_pdf
return partition_pdf_or_image(
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\pdf.py", line 355, in partition_pdf_or_image
out_elements = _process_uncategorized_text_elements(elements)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\pdf.py", line 930, in _process_uncategorized_text_elements
new_el = element_from_text(cast(Text, el).text)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\text.py", line 295, in element_from_text
elif is_possible_narrative_text(text):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\text_type.py", line 80, in is_possible_narrative_text
if exceeds_cap_ratio(text, threshold=cap_threshold):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\text_type.py", line 276, in exceeds_cap_ratio
if sentence_count(text, 3) > 1:
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\text_type.py", line 225, in sentence_count
sentences = sent_tokenize(text)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\nlp\tokenize.py", line 136, in sent_tokenize
_download_nltk_packages_if_not_present()
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\nlp\tokenize.py", line 130, in _download_nltk_packages_if_not_present
download_nltk_packages()
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\nlp\tokenize.py", line 88, in download_nltk_packages
urllib.request.urlretrieve(NLTK_DATA_URL, tgz_file_path)
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 241, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 216, in urlopen
return opener.open(url, data, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 525, in open
response = meth(req, response)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 634, in http_response
response = self.parent.error(
^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 563, in error
return self._call_chain(*args)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 496, in _call_chain
result = func(*args)
^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\telboth\OneDrive - Shearwater Geoservices Norway AS\Documents\Python\chatbot\MegaParse\parse_shiptool.py", line 31, in
response = megaparse.load("./ylh-20240277-dch.pdf")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\megaparse.py", line 109, in load
return loop.run_until_complete(
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\asyncio\base_events.py", line 654, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\megaparse.py", line 98, in aload
raise ParsingException(
megaparse.exceptions.base.ParsingException: Error while parsing file ./ylh-20240277-dch.pdf, file_extension: FileExtension.PDF: HTTP Error 403: Forbidden

@telboth
Copy link
Author

telboth commented Dec 17, 2024

Here is the basic code:

from megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.parser.unstructured_parser import UnstructuredParser

parser = UnstructuredParser()
megaparse = MegaParse(parser)

response = megaparse.load("./ylh-20240277-dch.pdf") #this is the pdf I want to read/parse
print(response)
megaparse.save("./test.md")

@koeckc
Copy link

koeckc commented Dec 17, 2024

I got exactly the same issue on macos Sonoma with python 3.11 with sample from readme

@jpiliukaitis
Copy link

Same. With Dockerfile.gpu build. Using /v1/file endpoint.

tried .pdf and .docx files.

{ "detail": "Error while parsing file <_io.BytesIO object at 0x761b4ab1b470>, file_extension: FileExtension.PDF: HTTP Error 403: Forbidden" }

@telboth
Copy link
Author

telboth commented Dec 17, 2024

Hmmm - Something spooky is going on! I discovered that it worked fine with some pdf files, but not with others.
That is: some files just run through, while others give the error mentioned above. I am not sure what's the difference betwee these files...

@DimonLavron
Copy link

I had the same issue

As far as I found, this is an issue with unstructured and the way they handled ntlk libraries download.
Related issue: Unstructured-IO/unstructured#3795
Related MR: Unstructured-IO/unstructured#3796

So I think updating unstructured version should resolve this issue

@telboth
Copy link
Author

telboth commented Dec 18, 2024

I tried pip install unstructured==0.16.11, but that did not solve the issue.
When running megaparse 0.0.49 and unstructured==0.15, I can parse some pdf's. but not all.
When running megaparse 0.0.52 (latest) nothing works and I now get the the the following error messages:

C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\pydantic_internal_fields.py:132: UserWarning: Field "model_name" in UploadFileConfig has conflict with protected namespace "model_".

You may be able to resolve this warning by setting model_config['protected_namespaces'] = ().
warnings.warn(

Traceback (most recent call last):
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\megaparse.py", line 89, in aload
parsed_document = await parser.convert(
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\parser\megaparse_vision.py", line 142, in convert
self.parsed_chunks = await asyncio.gather(*tasks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\parser\megaparse_vision.py", line 114, in send_to_mlm
response = await self.model.ainvoke([message])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\langchain_core\language_models\chat_models.py", line 307, in ainvoke
llm_result = await self.agenerate_prompt(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\langchain_core\language_models\chat_models.py", line 796, in agenerate_prompt
return await self.agenerate(
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\langchain_core\language_models\chat_models.py", line 756, in agenerate
raise exceptions[0]
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\langchain_core\language_models\chat_models.py", line 924, in _agenerate_with_cache
result = await self._agenerate(
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\langchain_openai\chat_models\base.py", line 825, in _agenerate
response = await self.async_client.create(**payload)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\openai\resources\chat\completions.py", line 1720, in create
return await self._post(
^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\openai_base_client.py", line 1843, in post
return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\openai_base_client.py", line 1537, in request
return await self._request(
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\openai_base_client.py", line 1638, in _request
raise self._make_status_error_from_response(err.response) from None
openai.NotFoundError: Error code: 404 - {'error': {'message': 'The model gpt-4o does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\telboth\OneDrive - Shearwater Geoservices Norway AS\Documents\Python\chatbot\parse_shiptool.py", line 30, in
response = megaparse.load("MegaParse/InterpolationMissingFrequencies.pdf") #working pdf
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\megaparse.py", line 109, in load
return loop.run_until_complete(
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\asyncio\base_events.py", line 654, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\megaparse.py", line 98, in aload
raise ParsingException(
megaparse.exceptions.base.ParsingException: Error while parsing file MegaParse/InterpolationMissingFrequencies.pdf, file_extension: FileExtension.PDF: Error code: 404 - {'error': {'message': 'The model gpt-4o does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}

(And yes - I do have gpt-4o access - since the code runs with megaparce==0.0.49)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants