Batch Processing Documents [DEPRECATED]

For the latest approach, go to: v2

The unstructured-ingest CLI

The unstructured library includes a CLI to batch ingest documents from various sources, storing structured outputs locally on the filesystem.

For example, the following command processes all the documents in S3 in the utic-dev-tech-fixtures bucket with a prefix of small-pdf-set/.

unstructured-ingest \
   s3 \
   --remote-url s3://utic-dev-tech-fixtures/small-pdf-set/ \
   --anonymous \
   --output-dir s3-small-batch-output \
   --num-processes 2

Naturally, --num-processes may be adjusted for better instance utilization with multiprocessing.

Installation note: make sure to install the following extras when installing unstructured, needed for the above command:

pip install "unstructured[s3,local-inference]"

See the Quick Start which documents how to pip install dectectron2 and other OS dependencies, necessary for the parsing of .PDF files.

Developers' Guide

Local testing

When testing from a local checkout rather than a pip-installed version of unstructured, just execute unstructured_ingest/main.py, e.g.:

PYTHONPATH=. ./unstructured_ingest/main.py \
   s3 \
   --remote-url s3://utic-dev-tech-fixtures/small-pdf-set/ \
   --anonymous \
   --output-dir s3-small-batch-output \
   --num-processes 2

Adding Source Data Connectors

To add a connector, refer to unstructured_ingest/connector/github.py as an example that implements the three relevant abstract base classes.

If the connector has an available fsspec implementation, then refer to unstructured_ingest/connector/s3.py.

Then, update unstructured_ingest/main.py/cli to add a subcommand associated with the connector, and hook it up to the parent group.

Add an implementation of BaseRunner in the runner directory to connect the invocation of the CLI with the underlying connector created.

Create at least one folder examples/ingest with an easily reproducible script that shows the new connector in action.

Finally, to ensure the connector remains stable, add a new script test_unstructured_ingest/test-ingest-<the-new-data-source>.sh similar to test_unstructured_ingest/test-ingest-s3.sh, and append a line invoking the new script in test_unstructured_ingest/test-ingest.sh.

You'll notice that the unstructured outputs for the new documents are expected to be checked into CI under test_unstructured_ingest/expected-structured-output/<folder-name-relevant-to-your-dataset>. So, you'll need to git add those json outputs so that test-ingest.sh passes in CI.

The main.py flags of --re-download/--no-re-download , --download-dir, --preserve-downloads, --structured-output-dir, and --reprocess are honored by the connector.

Adding Destination Data Connectors

To add a destination connector, refer to unstructured_ingest/connector/delta-table.py as an example, which extends the BaseDestinationConnector, and the WriteConfig. It also shows how an existing data provider can be used for both a source and destination connector.

Similar to the runner used to connect source connectors with the CLI, destination connectors require an entry in the writer map defined in unstructured_ingest/runner/writers.py. This allows any source connector to use any destination connector.

Regarding the entry in the CLI, destination connectors are exposed as a subcommand that gets added to each source connector parent command. Special care needs to be taken here to not break the code being run by the source connector. Take a look at how the base runner class is dynamically pulled using the name of the parent CLI command in unstructured_ingest/cli/cmds/delta_table.py.

Similar tests and examples should be added to demonstrate/validate the use of the destination connector similar to the steps laid out for a source connector.

The checklist:

In checklist form, the above steps are summarized as:

Design References

unstructured_ingest/main.py is the entrypoint for the unstructured-ingest cli. It calls the cli Command as fetched from cli.py get_cmd().

The ingest directory is broken up in such a way that most of the code can be used with or without invoking the CLI itself:

Connector: This houses the main code that is responsible for reaching out to external data providers and pulling down the data (i.e. S3, Azure, etc)
Runner: This serves as the interface between the CLI specific commands and running the connector code. A base runner class exists that defines much of the common functionality across all connectors and allowed for typed methods to be defined to explicitly connect the CLI command to the specific connector.
CLI: This is where the Click python library is introduced to create the cli bindings that a user interacts with then invoking the CLI directly. Many of the common options across commands are abstracted away and add options dynamically to click commands.

The ingest flow is similar to an ETL pipeline that gets defined at runtime based on user input:

Each step in the pipeline caches the results in a default location if one is not provided to it. This allows for the pipeline to pick up where it ended if an error occurred before it finished without having to recompute everything that ran successfully. It uses a hash of the parameters passed in for each step along with the previous step to know if the results it already has are still valid or should be recomputed regardless of them existing already. This allows you to change parameters associated with a step in the tail end of the pipeline and it only recomputes from there.

Multiprocessing: One of the options for the pipeline is how many processes to use. Not all steps support multiprocessing, but if they do, a multiprocessing Pool is used to speed up the process. For debugging purposes, if a single process is set, multiprocessing Pool isn't used at all.

While all the configurations are added to a single Click command when the CLI is invoked as options, many of these are bundled together based on a particular step in the pipeline. A BaseConfig is extended in the root interfaces file and then that can be extended once again in the cli-specific interfaces file which adds a function on how the fields in the base config should be mapped to Click options.

Configs

PartitionConfig: Data associated with running the partitioning over the files pulled down via the source connector.
ProcessorConfig: Data around the process as a whole, such as number of processes to use when running, where to store the final result of the pipeline, and if an error should be raised if a single doc fails. By default, the pipeline will continue with that it can, so if a doc fails out of many, an error will be logged and the rest will continue.
ReadConfig: Data associated with pulling the data from the source data provider, such as if it should be redownloaded, regardless of the files already existing.
EmbeddingConfig: Data associated with running an optional embedder on the data, which adds a new field to the output json for each element with it's associated embeddings vector.
ChunkingConfig: Data associated with running an optional chunker over the partitioned data.
PermissionsConfig: Data associated with pulling down permissions data (i.e. RBAC). This is an optional feature and if enabled, will append the information pulled down to the metadata associated with an element.
WriteConfig: Any specific data needed to write to a destination connector. This does not have to be used if not needed.

For the flow of the pipeline, the only required steps are:

Doc Factory: This creates instances of BaseIngestDoc which provide references to a file on the source data provider without downloading anything yet.
Source Node: This is responsible for downloading and content and producing a representation of that content suitable for partitioning.
Partitioner: Responsible for running partition over the content produced by the previous source node.

Optional Steps:

Reformat Nodes: Any number of reformat nodes can be set to modify the partitioned content. Currently chunking and embedding are supported.
Write Node: If set, write the results to a destination via a destination connector.

Because there can be any number of reformat nodes, the final destination is not deterministic, so an extra step is added at the end of all reformat nodes to copy the final result to the location the user expects it to be when the pipeline ends.

Name	Name	Last commit message	Last commit date
Latest commit rbiseck3 Create new release (Unstructured-IO#9 ) Aug 1, 2024 8d37b4e · Aug 1, 2024 History 76 Commits
.github	.github	feat: Add filter step (Unstructured-IO#7 )	Aug 1, 2024
example-docs	example-docs	fix unit tests	Jul 19, 2024
requirements	requirements	feat: add milvus destination connector (Unstructured-IO#5 )	Jul 31, 2024
scripts	scripts	feat: Add filter step (Unstructured-IO#7 )	Aug 1, 2024
test	test	Bump key size to 1024 for salesforce unit test	Jul 25, 2024
test_e2e	test_e2e	feat: Add filter step (Unstructured-IO#7 )	Aug 1, 2024
unstructured_ingest	unstructured_ingest	Create new release (Unstructured-IO#9 )	Aug 1, 2024
.gitignore	.gitignore	Copy over much of the content from the original unstructured repo	Jul 19, 2024
CHANGELOG.md	CHANGELOG.md	Create new release (Unstructured-IO#9 )	Aug 1, 2024
Makefile	Makefile	feat: Add filter step (Unstructured-IO#7 )	Aug 1, 2024
README.md	README.md	Copy over much of the content from the original unstructured repo	Jul 19, 2024
pyproject.toml	pyproject.toml	Copy over much of the content from the original unstructured repo	Jul 19, 2024
setup.cfg	setup.cfg	Copy over much of the content from the original unstructured repo	Jul 19, 2024
setup.py	setup.py	feat: add milvus destination connector (Unstructured-IO#5 )	Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Batch Processing Documents [DEPRECATED]

The unstructured-ingest CLI

Developers' Guide

Local testing

Adding Source Data Connectors

Adding Destination Data Connectors

The checklist:

Design References

Configs

About

Releases

Packages

Languages

lokesh-couchbase/unstructured-ingest

Folders and files

Latest commit

History

Repository files navigation

Batch Processing Documents [DEPRECATED]

The unstructured-ingest CLI

Developers' Guide

Local testing

Adding Source Data Connectors

Adding Destination Data Connectors

The checklist:

Design References

Configs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages