Skip to content

Commit

Permalink
chore(deps): Bump unstructured[local-inference] from 0.10.14 to 0.10.…
Browse files Browse the repository at this point in the history
…15 in /requirements (#242)

Bumps
[unstructured[local-inference]](https://github.com/Unstructured-IO/unstructured)
from 0.10.14 to 0.10.15.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/Unstructured-IO/unstructured/releases">unstructured[local-inference]'s
releases</a>.</em></p>
<blockquote>
<h2>0.10.15</h2>
<h3>Enhancements</h3>
<ul>
<li><strong>Suport for better element categories from the
next-generation image-to-text model (&quot;chipper&quot;).</strong>.
Previously, not all of the classifications from Chipper were being
mapped to proper <code>unstructured</code> element categories so the
consumer of the library would see many <code>UncategorizedText</code>
elements. This fixes the issue, improving the granularity of the element
categories outputs for better downstream processing and chunking. The
mapping update is:
<ul>
<li>&quot;Threading&quot;: <code>NarrativeText</code></li>
<li>&quot;Form&quot;: <code>NarrativeText</code></li>
<li>&quot;Field-Name&quot;: <code>Title</code></li>
<li>&quot;Value&quot;: <code>NarrativeText</code></li>
<li>&quot;Link&quot;: <code>NarrativeText</code></li>
<li>&quot;Headline&quot;: <code>Title</code> (with
<code>category_depth=1</code>)</li>
<li>&quot;Subheadline&quot;: <code>Title</code> (with
<code>category_depth=2</code>)</li>
<li>&quot;Abstract&quot;: <code>NarrativeText</code></li>
</ul>
</li>
<li><strong>Better ListItem grouping for PDF's (fast strategy).</strong>
The <code>partition_pdf</code> with <code>fast</code> strategy
previously broke down some numbered list item lines as separate
elements. This enhancement leverages the x,y coordinates and bbox sizes
to help decide whether the following chunk of text is a continuation of
the immediate previous detected ListItem element or not, and not detect
it as its own non-ListItem element.</li>
<li><strong>Fall back to text-based classification for uncategorized
Layout elements for Images and PDF's</strong>. Improves element
classification by running existing text-based rules on previously
UncategorizedText elements</li>
<li><strong>Adds table partitioning for Partitioning for many doc types
including: .html, .epub., .md, .rst, .odt, and .msg.</strong> At the
core of this change is the .html partition functionality, which is
leveraged by the other effected doc types. This impacts many scenarios
where <code>Table</code> Elements are now propery extracted.</li>
<li><strong>Create and add <code>add_chunking_strategy</code> decorator
to partition functions.</strong> Previously, users were responsible for
their own chunking after partitioning elements, often required for
downstream applications. Now, individual elements may be combined into
right-sized chunks where min and max character size may be specified if
<code>chunking_strategy=by_title</code>. Relevant elements are grouped
together for better downstream results. This enables users immediately
use partitioned results effectively in downstream applications (e.g. RAG
architecture apps) without any additional post-processing.</li>
<li><strong>Adds <code>languages</code> as an input parameter and marks
<code>ocr_languages</code> kwarg for deprecation in pdf, image, and auto
partitioning functions.</strong> Previously, language information was
only being used for Tesseract OCR for image-based documents and was in a
Tesseract specific string format, but by refactoring into a list of
standard language codes independent of Tesseract, the
<code>unstructured</code> library will better support
<code>languages</code> for other non-image pipelines and/or support for
other OCR engines.</li>
<li><strong>Removes <code>UNSTRUCTURED_LANGUAGE</code> env var usage and
replaces <code>language</code> with <code>languages</code> as an input
parameter to unstructured-partition-text_type functions.</strong> The
previous parameter/input setup was not user-friendly or scalable to the
variety of elements being processed. By refactoring the inputted
language information into a list of standard language codes, we can
support future applications of the element language such as detection,
metadata, and multi-language elements. Now, to skip English specific
checks, set the <code>languages</code> parameter to any non-English
language(s).</li>
<li><strong>Adds <code>xlsx</code> and <code>xls</code> filetype
extensions to the <code>skip_infer_table_types</code> default list in
<code>partition</code>.</strong> By adding these file types to the input
parameter these files should not go through table extraction. Users can
still specify if they would like to extract tables from these filetypes,
but will have to set the <code>skip_infer_table_types</code> to exclude
the desired filetype extension. This avoids mis-representing complex
spreadsheets where there may be multiple sub-tables and other
content.</li>
<li><strong>Better debug output related to sentence counting
internals</strong>. Clarify message when sentence is not counted toward
sentence count because there aren't enough words, relevant for
developers focused on <code>unstructured</code>s NLP internals.</li>
<li><strong>Faster ocr_only speed for partitioning PDF and
images.</strong> Use
<code>unstructured_pytesseract.run_and_get_multiple_output</code>
function to reduce the number of calls to <code>tesseract</code> by half
when partitioning pdf or image with <code>tesseract</code></li>
<li><strong>Adds data source properties to fsspec connectors</strong>
These properties (date_created, date_modified, version, source_url,
record_locator) are written to element metadata during ingest, mapping
elements to information about the document source from which they
derive. This functionality enables downstream applications to reveal
source document applications, e.g. a link to a GDrive doc, Salesforce
record, etc.</li>
<li><strong>Add delta table destination connector</strong> New delta
table destination connector added to ingest CLI. Users may now use
<code>unstructured-ingest</code> to write partitioned data from over 20
data sources (so far) to a Delta Table.</li>
<li><strong>Rename to Source and Destination Connectors in the
Documentation.</strong> Maintain naming consistency between Connectors
codebase and documentation with the first addition to a destination
connector.</li>
<li><strong>Non-HTML text files now return unstructured-elements as
opposed to HTML-elements.</strong> Previously the text based files that
went through <code>partition_html</code> would return HTML-elements but
now we preserve the format from the input using
<code>source_format</code> argument in the partition call.</li>
<li><strong>Adds <code>PaddleOCR</code> as an optional alternative to
<code>Tesseract</code></strong> for OCR in processing of PDF or Image
files, it is installable via the <code>makefile</code> command
<code>install-paddleocr</code>. For experimental purposes only.</li>
<li><strong>Bump unstructured-inference</strong> to 0.5.28. This version
bump markedly improves the output of table data, rendered as
<code>metadata.text_as_html</code> in an element. These changes include:
<ul>
<li>add env variable <code>ENTIRE_PAGE_OCR</code> to specify using
paddle or tesseract on entire page OCR</li>
<li>table structure detection now pads the input image by 25 pixels in
all 4 directions to improve its recall (0.5.27)</li>
<li>support paddle with both cpu and gpu and assumed it is pre-installed
(0.5.26)</li>
<li>fix a bug where <code>cells_to_html</code> doesn't handle cells
spanning multiple rows properly (0.5.25)</li>
<li>remove <code>cv2</code> preprocessing step before OCR step in table
transformer (0.5.24)</li>
</ul>
</li>
</ul>
<h3>Features</h3>
<ul>
<li><strong>Adds element metadata via <code>category_depth</code> with
default value None</strong>.
<ul>
<li>This additional metadata is useful for vectordb/LLM, chunking
strategies, and retrieval applications.</li>
</ul>
</li>
<li><strong>Adds a naive hierarchy for elements via a
<code>parent_id</code> on the element's metadata</strong>
<ul>
<li>Users will now have more metadata for implementing vectordb/LLM
chunking strategies. For example, text elements could be queried by
their preceding title element.</li>
<li>Title elements created from HTML headings will properly nest</li>
</ul>
</li>
</ul>
<h3>Fixes</h3>
<ul>
<li><strong><code>add_pytesseract_bboxes_to_elements</code> no longer
returns <code>nan</code> values</strong>. The function logic is now
broken into new methods
<code>_get_element_box</code> and
<code>convert_multiple_coordinates_to_new_system</code></li>
<li><strong>Selecting a different model wasn't being respected when
calling <code>partition_image</code>.</strong> Problem:
<code>partition_pdf</code> allows for passing a <code>model_name</code>
parameter. Given the similarity between the image and PDF pipelines, the
expected behavior is that <code>partition_image</code> should support
the same parameter, but <code>partition_image</code> was unintentionally
not passing along its <code>kwargs</code>. This was corrected by adding
the kwargs to the downstream call.</li>
<li><strong>Fixes a chunking issue via dropping the field
&quot;coordinates&quot;.</strong> Problem: chunk_by_title function was
chunking each element to its own individual chunk while it needed to
group elements into a fewer number of chunks. We've discovered that this
happens due to a metadata matching logic in chunk_by_title function, and
discovered that elements with different metadata can't be put into the
same chunk. At the same time, any element with &quot;coordinates&quot;
essentially had different metadata than other elements, due each element
locating in different places and having different coordinates. Fix: That
is why we have included the key &quot;coordinates&quot; inside a list of
excluded metadata keys, while doing this &quot;metadata_matches&quot;
comparision. Importance: This change is crucial to be able to chunk by
title for documents which include &quot;coordinates&quot; metadata in
their elements.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/Unstructured-IO/unstructured/blob/main/CHANGELOG.md">unstructured[local-inference]'s
changelog</a>.</em></p>
<blockquote>
<h2>0.10.15</h2>
<h3>Enhancements</h3>
<ul>
<li><strong>Suport for better element categories from the
next-generation image-to-text model (&quot;chipper&quot;).</strong>.
Previously, not all of the classifications from Chipper were being
mapped to proper <code>unstructured</code> element categories so the
consumer of the library would see many <code>UncategorizedText</code>
elements. This fixes the issue, improving the granularity of the element
categories outputs for better downstream processing and chunking. The
mapping update is:
<ul>
<li>&quot;Threading&quot;: <code>NarrativeText</code></li>
<li>&quot;Form&quot;: <code>NarrativeText</code></li>
<li>&quot;Field-Name&quot;: <code>Title</code></li>
<li>&quot;Value&quot;: <code>NarrativeText</code></li>
<li>&quot;Link&quot;: <code>NarrativeText</code></li>
<li>&quot;Headline&quot;: <code>Title</code> (with
<code>category_depth=1</code>)</li>
<li>&quot;Subheadline&quot;: <code>Title</code> (with
<code>category_depth=2</code>)</li>
<li>&quot;Abstract&quot;: <code>NarrativeText</code></li>
</ul>
</li>
<li><strong>Better ListItem grouping for PDF's (fast strategy).</strong>
The <code>partition_pdf</code> with <code>fast</code> strategy
previously broke down some numbered list item lines as separate
elements. This enhancement leverages the x,y coordinates and bbox sizes
to help decide whether the following chunk of text is a continuation of
the immediate previous detected ListItem element or not, and not detect
it as its own non-ListItem element.</li>
<li><strong>Fall back to text-based classification for uncategorized
Layout elements for Images and PDF's</strong>. Improves element
classification by running existing text-based rules on previously
UncategorizedText elements</li>
<li><strong>Adds table partitioning for Partitioning for many doc types
including: .html, .epub., .md, .rst, .odt, and .msg.</strong> At the
core of this change is the .html partition functionality, which is
leveraged by the other effected doc types. This impacts many scenarios
where <code>Table</code> Elements are now propery extracted.</li>
<li><strong>Create and add <code>add_chunking_strategy</code> decorator
to partition functions.</strong> Previously, users were responsible for
their own chunking after partitioning elements, often required for
downstream applications. Now, individual elements may be combined into
right-sized chunks where min and max character size may be specified if
<code>chunking_strategy=by_title</code>. Relevant elements are grouped
together for better downstream results. This enables users immediately
use partitioned results effectively in downstream applications (e.g. RAG
architecture apps) without any additional post-processing.</li>
<li><strong>Adds <code>languages</code> as an input parameter and marks
<code>ocr_languages</code> kwarg for deprecation in pdf, image, and auto
partitioning functions.</strong> Previously, language information was
only being used for Tesseract OCR for image-based documents and was in a
Tesseract specific string format, but by refactoring into a list of
standard language codes independent of Tesseract, the
<code>unstructured</code> library will better support
<code>languages</code> for other non-image pipelines and/or support for
other OCR engines.</li>
<li><strong>Removes <code>UNSTRUCTURED_LANGUAGE</code> env var usage and
replaces <code>language</code> with <code>languages</code> as an input
parameter to unstructured-partition-text_type functions.</strong> The
previous parameter/input setup was not user-friendly or scalable to the
variety of elements being processed. By refactoring the inputted
language information into a list of standard language codes, we can
support future applications of the element language such as detection,
metadata, and multi-language elements. Now, to skip English specific
checks, set the <code>languages</code> parameter to any non-English
language(s).</li>
<li><strong>Adds <code>xlsx</code> and <code>xls</code> filetype
extensions to the <code>skip_infer_table_types</code> default list in
<code>partition</code>.</strong> By adding these file types to the input
parameter these files should not go through table extraction. Users can
still specify if they would like to extract tables from these filetypes,
but will have to set the <code>skip_infer_table_types</code> to exclude
the desired filetype extension. This avoids mis-representing complex
spreadsheets where there may be multiple sub-tables and other
content.</li>
<li><strong>Better debug output related to sentence counting
internals</strong>. Clarify message when sentence is not counted toward
sentence count because there aren't enough words, relevant for
developers focused on <code>unstructured</code>s NLP internals.</li>
<li><strong>Faster ocr_only speed for partitioning PDF and
images.</strong> Use
<code>unstructured_pytesseract.run_and_get_multiple_output</code>
function to reduce the number of calls to <code>tesseract</code> by half
when partitioning pdf or image with <code>tesseract</code></li>
<li><strong>Adds data source properties to fsspec connectors</strong>
These properties (date_created, date_modified, version, source_url,
record_locator) are written to element metadata during ingest, mapping
elements to information about the document source from which they
derive. This functionality enables downstream applications to reveal
source document applications, e.g. a link to a GDrive doc, Salesforce
record, etc.</li>
<li><strong>Add delta table destination connector</strong> New delta
table destination connector added to ingest CLI. Users may now use
<code>unstructured-ingest</code> to write partitioned data from over 20
data sources (so far) to a Delta Table.</li>
<li><strong>Rename to Source and Destination Connectors in the
Documentation.</strong> Maintain naming consistency between Connectors
codebase and documentation with the first addition to a destination
connector.</li>
<li><strong>Non-HTML text files now return unstructured-elements as
opposed to HTML-elements.</strong> Previously the text based files that
went through <code>partition_html</code> would return HTML-elements but
now we preserve the format from the input using
<code>source_format</code> argument in the partition call.</li>
<li><strong>Adds <code>PaddleOCR</code> as an optional alternative to
<code>Tesseract</code></strong> for OCR in processing of PDF or Image
files, it is installable via the <code>makefile</code> command
<code>install-paddleocr</code>. For experimental purposes only.</li>
<li><strong>Bump unstructured-inference</strong> to 0.5.28. This version
bump markedly improves the output of table data, rendered as
<code>metadata.text_as_html</code> in an element. These changes include:
<ul>
<li>add env variable <code>ENTIRE_PAGE_OCR</code> to specify using
paddle or tesseract on entire page OCR</li>
<li>table structure detection now pads the input image by 25 pixels in
all 4 directions to improve its recall (0.5.27)</li>
<li>support paddle with both cpu and gpu and assumed it is pre-installed
(0.5.26)</li>
<li>fix a bug where <code>cells_to_html</code> doesn't handle cells
spanning multiple rows properly (0.5.25)</li>
<li>remove <code>cv2</code> preprocessing step before OCR step in table
transformer (0.5.24)</li>
</ul>
</li>
</ul>
<h3>Features</h3>
<ul>
<li><strong>Adds element metadata via <code>category_depth</code> with
default value None</strong>.
<ul>
<li>This additional metadata is useful for vectordb/LLM, chunking
strategies, and retrieval applications.</li>
</ul>
</li>
<li><strong>Adds a naive hierarchy for elements via a
<code>parent_id</code> on the element's metadata</strong>
<ul>
<li>Users will now have more metadata for implementing vectordb/LLM
chunking strategies. For example, text elements could be queried by
their preceding title element.</li>
<li>Title elements created from HTML headings will properly nest</li>
</ul>
</li>
</ul>
<h3>Fixes</h3>
<ul>
<li><strong><code>add_pytesseract_bboxes_to_elements</code> no longer
returns <code>nan</code> values</strong>. The function logic is now
broken into new methods
<code>_get_element_box</code> and
<code>convert_multiple_coordinates_to_new_system</code></li>
<li><strong>Selecting a different model wasn't being respected when
calling <code>partition_image</code>.</strong> Problem:
<code>partition_pdf</code> allows for passing a <code>model_name</code>
parameter. Given the similarity between the image and PDF pipelines, the
expected behavior is that <code>partition_image</code> should support
the same parameter, but <code>partition_image</code> was unintentionally
not passing along its <code>kwargs</code>. This was corrected by adding
the kwargs to the downstream call.</li>
<li><strong>Fixes a chunking issue via dropping the field
&quot;coordinates&quot;.</strong> Problem: chunk_by_title function was
chunking each element to its own individual chunk while it needed to
group elements into a fewer number of chunks. We've discovered that this
happens due to a metadata matching logic in chunk_by_title function, and
discovered that elements with different metadata can't be put into the
same chunk. At the same time, any element with &quot;coordinates&quot;
essentially had different metadata than other elements, due each element
locating in different places and having different coordinates. Fix: That
is why we have included the key &quot;coordinates&quot; inside a list of
excluded metadata keys, while doing this &quot;metadata_matches&quot;
comparision. Importance: This change is crucial to be able to chunk by
title for documents which include &quot;coordinates&quot; metadata in
their elements.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/Unstructured-IO/unstructured/commit/b534b2a6cdd91f2a6380b8e1097de28e141a0d8e"><code>b534b2a</code></a>
Chore: bump inference package version to 0.5.28 and new release (<a
href="https://redirect.github.com/Unstructured-IO/unstructured/issues/1355">#1355</a>)</li>
<li><a
href="https://github.com/Unstructured-IO/unstructured/commit/09a0958f900a6748c217c9f022ca90b4ab01b3a5"><code>09a0958</code></a>
Feat: CORE-1269 - Install paddlepaddle wheel dependent on arch,
supporting aa...</li>
<li><a
href="https://github.com/Unstructured-IO/unstructured/commit/36d026cb1bb009a275c3afb6a79bd7237c762027"><code>36d026c</code></a>
chore: update CHANGELOG.md bullets (<a
href="https://redirect.github.com/Unstructured-IO/unstructured/issues/1436">#1436</a>)</li>
<li><a
href="https://github.com/Unstructured-IO/unstructured/commit/6187dc09768df825920dca0e323005712aad05d2"><code>6187dc0</code></a>
update links in integrations.rst (<a
href="https://redirect.github.com/Unstructured-IO/unstructured/issues/1418">#1418</a>)</li>
<li><a
href="https://github.com/Unstructured-IO/unstructured/commit/333558494e6695717b73421cf9fea1a3285925ef"><code>3335584</code></a>
roman/delta lake dest connector (<a
href="https://redirect.github.com/Unstructured-IO/unstructured/issues/1385">#1385</a>)</li>
<li><a
href="https://github.com/Unstructured-IO/unstructured/commit/98d3541909f64290b5efb65a226fc3ee8a7cc5ee"><code>98d3541</code></a>
Update CHANGELOG.md (<a
href="https://redirect.github.com/Unstructured-IO/unstructured/issues/1435">#1435</a>)</li>
<li><a
href="https://github.com/Unstructured-IO/unstructured/commit/de4d496fcf64cfadfcdc4ab065c106287eb48637"><code>de4d496</code></a>
Fix bbox coordinates for ocr_only strategy (<a
href="https://redirect.github.com/Unstructured-IO/unstructured/issues/1325">#1325</a>)</li>
<li><a
href="https://github.com/Unstructured-IO/unstructured/commit/0d61c9848170b8db090b121b97e3822dbeff4eab"><code>0d61c98</code></a>
fix: Pass partition_image kwargs downstream (<a
href="https://redirect.github.com/Unstructured-IO/unstructured/issues/1426">#1426</a>)</li>
<li><a
href="https://github.com/Unstructured-IO/unstructured/commit/fe11ab4235ad2b2bc8328a036b4da33b7392f8fb"><code>fe11ab4</code></a>
feat: improved mapping for missing chipper elements (<a
href="https://redirect.github.com/Unstructured-IO/unstructured/issues/1431">#1431</a>)</li>
<li><a
href="https://github.com/Unstructured-IO/unstructured/commit/50db2abd9f6f0eadd456a4b5026b4ff0dbdc5d75"><code>50db2ab</code></a>
fix: updating element types (<a
href="https://redirect.github.com/Unstructured-IO/unstructured/issues/1394">#1394</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/Unstructured-IO/unstructured/compare/0.10.14...0.10.15">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=unstructured[local-inference]&package-manager=pip&previous-version=0.10.14&new-version=0.10.15)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com>
Co-authored-by: Austin Walker <[email protected]>
  • Loading branch information
3 people authored Sep 18, 2023
1 parent 6923a24 commit af24905
Show file tree
Hide file tree
Showing 4 changed files with 54 additions and 43 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
## 0.0.45-dev0
## 0.0.45

* Drop `detection_class_prob` from the element metadata. This broke backwards compatibility when library users called `partition_via_api`.
* Bump unstructured to 0.10.15

## 0.0.44

Expand Down
35 changes: 19 additions & 16 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ click==8.1.3
# uvicorn
coloredlogs==15.0.1
# via onnxruntime
contourpy==1.1.0
contourpy==1.1.1
# via matplotlib
cryptography==41.0.3
# via pdfminer-six
Expand All @@ -51,7 +51,7 @@ exceptiongroup==1.1.3
# via anyio
fastapi==0.103.1
# via -r requirements/base.in
filelock==3.12.3
filelock==3.12.4
# via
# huggingface-hub
# torch
Expand All @@ -62,11 +62,11 @@ flatbuffers==23.5.26
# via onnxruntime
fonttools==4.42.1
# via matplotlib
fsspec==2023.9.0
fsspec==2023.9.1
# via huggingface-hub
h11==0.14.0
# via uvicorn
huggingface-hub==0.17.1
huggingface-hub==0.17.2
# via
# timm
# transformers
Expand Down Expand Up @@ -99,7 +99,7 @@ markupsafe==2.1.3
# via jinja2
marshmallow==3.20.1
# via dataclasses-json
matplotlib==3.7.3
matplotlib==3.8.0
# via pycocotools
mpmath==1.3.0
# via sympy
Expand All @@ -111,7 +111,7 @@ networkx==3.1
# via torch
nltk==3.8.1
# via unstructured
numpy==1.25.2
numpy==1.26.0
# via
# contourpy
# layoutparser
Expand Down Expand Up @@ -146,6 +146,7 @@ packaging==23.1
# onnxruntime
# pytesseract
# transformers
# unstructured-pytesseract
pandas==2.1.0
# via
# layoutparser
Expand All @@ -160,7 +161,7 @@ pdfminer-six==20221105
# unstructured
pdfplumber==0.10.2
# via layoutparser
pillow==10.0.0
pillow==10.0.1
# via
# layoutparser
# matplotlib
Expand All @@ -169,7 +170,8 @@ pillow==10.0.0
# pytesseract
# python-pptx
# torchvision
portalocker==2.7.0
# unstructured-pytesseract
portalocker==2.8.2
# via iopath
protobuf==4.24.3
# via
Expand All @@ -181,7 +183,7 @@ pycocotools==2.0.7
# via effdet
pycparser==2.21
# via cffi
pycryptodome==3.18.0
pycryptodome==3.19.0
# via -r requirements/base.in
pydantic==1.10.12
# via
Expand All @@ -191,7 +193,7 @@ pypandoc==1.11
# via unstructured
pyparsing==3.1.1
# via matplotlib
pypdf==3.16.0
pypdf==3.16.1
# via -r requirements/base.in
pypdfium2==4.20.0
# via pdfplumber
Expand Down Expand Up @@ -275,12 +277,11 @@ tqdm==4.66.1
# iopath
# nltk
# transformers
transformers==4.33.1
transformers==4.33.2
# via unstructured-inference
typing-extensions==4.7.1
typing-extensions==4.8.0
# via
# fastapi
# filelock
# huggingface-hub
# iopath
# onnx
Expand All @@ -293,15 +294,17 @@ typing-inspect==0.9.0
# via dataclasses-json
tzdata==2023.3
# via pandas
unstructured[local-inference]==0.10.14
unstructured[local-inference]==0.10.15
# via -r requirements/base.in
unstructured-inference==0.5.25
unstructured-inference==0.5.28
# via unstructured
unstructured-pytesseract==0.3.12
# via unstructured
urllib3==2.0.4
# via requests
uvicorn==0.23.2
# via -r requirements/base.in
xlrd==2.0.1
# via unstructured
xlsxwriter==3.1.3
xlsxwriter==3.1.4
# via python-pptx
53 changes: 29 additions & 24 deletions requirements/test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ comm==0.1.4
# via
# ipykernel
# ipywidgets
contourpy==1.1.0
contourpy==1.1.1
# via
# -r requirements/base.txt
# matplotlib
Expand All @@ -105,7 +105,7 @@ dataclasses-json==0.6.0
# via
# -r requirements/base.txt
# unstructured
debugpy==1.7.0
debugpy==1.8.0
# via ipykernel
decorator==5.1.1
# via ipython
Expand Down Expand Up @@ -146,7 +146,7 @@ fastcore==1.5.29
# nbdev
fastjsonschema==2.18.0
# via nbformat
filelock==3.12.3
filelock==3.12.4
# via
# -r requirements/base.txt
# huggingface-hub
Expand All @@ -168,7 +168,7 @@ fonttools==4.42.1
# matplotlib
fqdn==1.5.1
# via jsonschema
fsspec==2023.9.0
fsspec==2023.9.1
# via
# -r requirements/base.txt
# huggingface-hub
Expand All @@ -183,7 +183,7 @@ httpcore==0.18.0
# via httpx
httpx==0.25.0
# via -r requirements/test.in
huggingface-hub==0.17.1
huggingface-hub==0.17.2
# via
# -r requirements/base.txt
# timm
Expand Down Expand Up @@ -220,7 +220,7 @@ ipython==8.15.0
# jupyter-console
ipython-genutils==0.2.0
# via qtconsole
ipywidgets==8.1.0
ipywidgets==8.1.1
# via jupyter
isoduration==20.11.0
# via jsonschema
Expand Down Expand Up @@ -284,15 +284,15 @@ jupyter-server==2.7.3
# notebook-shim
jupyter-server-terminals==0.4.4
# via jupyter-server
jupyterlab==4.0.5
jupyterlab==4.0.6
# via notebook
jupyterlab-pygments==0.2.2
# via nbconvert
jupyterlab-server==2.25.0
# via
# jupyterlab
# notebook
jupyterlab-widgets==3.0.8
jupyterlab-widgets==3.0.9
# via ipywidgets
kiwisolver==1.4.5
# via
Expand Down Expand Up @@ -322,7 +322,7 @@ marshmallow==3.20.1
# via
# -r requirements/base.txt
# dataclasses-json
matplotlib==3.7.3
matplotlib==3.8.0
# via
# -r requirements/base.txt
# pycocotools
Expand Down Expand Up @@ -363,7 +363,7 @@ nbformat==5.9.2
# jupyter-server
# nbclient
# nbconvert
nest-asyncio==1.5.7
nest-asyncio==1.5.8
# via ipykernel
networkx==3.1
# via
Expand All @@ -379,7 +379,7 @@ notebook-shim==0.2.3
# via
# jupyterlab
# notebook
numpy==1.25.2
numpy==1.26.0
# via
# -r requirements/base.txt
# contourpy
Expand Down Expand Up @@ -440,6 +440,7 @@ packaging==23.1
# qtconsole
# qtpy
# transformers
# unstructured-pytesseract
pandas==2.1.0
# via
# -r requirements/base.txt
Expand Down Expand Up @@ -469,7 +470,7 @@ pexpect==4.8.0
# via ipython
pickleshare==0.7.5
# via ipython
pillow==10.0.0
pillow==10.0.1
# via
# -r requirements/base.txt
# layoutparser
Expand All @@ -479,13 +480,14 @@ pillow==10.0.0
# pytesseract
# python-pptx
# torchvision
# unstructured-pytesseract
platformdirs==3.10.0
# via
# black
# jupyter-core
pluggy==1.3.0
# via pytest
portalocker==2.7.0
portalocker==2.8.2
# via
# -r requirements/base.txt
# iopath
Expand Down Expand Up @@ -520,7 +522,7 @@ pycparser==2.21
# via
# -r requirements/base.txt
# cffi
pycryptodome==3.18.0
pycryptodome==3.19.0
# via -r requirements/base.txt
pydantic==1.10.12
# via
Expand All @@ -542,7 +544,7 @@ pyparsing==3.1.1
# via
# -r requirements/base.txt
# matplotlib
pypdf==3.16.0
pypdf==3.16.1
# via -r requirements/base.txt
pypdfium2==4.20.0
# via
Expand Down Expand Up @@ -638,7 +640,7 @@ rfc3986-validator==0.1.1
# via
# jsonschema
# jupyter-events
rpds-py==0.10.2
rpds-py==0.10.3
# via
# jsonschema
# referencing
Expand Down Expand Up @@ -737,7 +739,7 @@ tqdm==4.66.1
# iopath
# nltk
# transformers
traitlets==5.9.0
traitlets==5.10.0
# via
# comm
# ipykernel
Expand All @@ -754,17 +756,16 @@ traitlets==5.9.0
# nbconvert
# nbformat
# qtconsole
transformers==4.33.1
transformers==4.33.2
# via
# -r requirements/base.txt
# unstructured-inference
typing-extensions==4.7.1
typing-extensions==4.8.0
# via
# -r requirements/base.txt
# async-lru
# black
# fastapi
# filelock
# huggingface-hub
# iopath
# mypy
Expand All @@ -781,9 +782,13 @@ tzdata==2023.3
# via
# -r requirements/base.txt
# pandas
unstructured[local-inference]==0.10.14
unstructured[local-inference]==0.10.15
# via -r requirements/base.txt
unstructured-inference==0.5.25
unstructured-inference==0.5.28
# via
# -r requirements/base.txt
# unstructured
unstructured-pytesseract==0.3.12
# via
# -r requirements/base.txt
# unstructured
Expand All @@ -809,13 +814,13 @@ websocket-client==1.6.3
# via jupyter-server
wheel==0.41.2
# via astunparse
widgetsnbextension==4.0.8
widgetsnbextension==4.0.9
# via ipywidgets
xlrd==2.0.1
# via
# -r requirements/base.txt
# unstructured
xlsxwriter==3.1.3
xlsxwriter==3.1.4
# via
# -r requirements/base.txt
# python-pptx
Expand Down
6 changes: 4 additions & 2 deletions scripts/parallel-mode-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,9 @@ do
echo Testing: "$curl_command"

# Run in single mode
$curl_command 2> /dev/null | jq -S > output.json
# Note(austin): Parallel mode screws up hierarchy! While we deal with that,
# let's ignore parent_id fields in the results
$curl_command 2> /dev/null | jq -S 'del(..|.parent_id?)' > output.json

# Stop if curl didn't work
if [ ! -s output.json ]; then
Expand All @@ -38,7 +40,7 @@ do

# Run in parallel mode
curl_command="curl $base_url_2/general/v0/general $params"
$curl_command 2> /dev/null | jq -S > parallel_output.json
$curl_command 2> /dev/null | jq -S 'del(..|.parent_id?)' > parallel_output.json

# Stop if curl didn't work
if [ ! -s parallel_output.json ]; then
Expand Down

0 comments on commit af24905

Please sign in to comment.