Skip to content

Commit

Permalink
feat: Add filter step (Unstructured-IO#7)
Browse files Browse the repository at this point in the history
* Add filter step

* Fix cli

* Fix elasticsearch

* Bump version of docker compose

* install docker compose manually

* Install as root

* apt-get update before docker install command

* update to use self hosted CI image

* Add script to install docker compose

* Update docker compose install script

* Install docker as sudo

* Set missing variables

* Print version of docker compose

* Add docker compose installation as part of update fixture step

* Specify docker compose version

* Fix docker compose command

* Update ingest test fixtures (Unstructured-IO#10)

Co-authored-by: rbiseck3 <[email protected]>

* rmove file used for testing

* Update e2e tests with new docker compose install script

* lint shell

* fix sqlite issue

* Generate filter step from cli inputs

* bugfix and add ingest test

* update s3 example file

* Remove glob in original connectors

* Add file size in local indexer

* Update ingest test fixtures (Unstructured-IO#11)

Co-authored-by: rbiseck3 <[email protected]>

---------

Co-authored-by: Unstructured-DevOps <[email protected]>
Co-authored-by: rbiseck3 <[email protected]>
  • Loading branch information
3 people authored Aug 1, 2024
1 parent 0510b4f commit 889a4d5
Show file tree
Hide file tree
Showing 117 changed files with 10,043 additions and 2,962 deletions.
14 changes: 6 additions & 8 deletions .github/workflows/e2e.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,10 +47,6 @@ jobs:
uses: ./.github/actions/base-cache
with:
python-version: ${{ matrix.python-version }}
- name: Setup docker-compose
uses: KengoTODA/actions-setup-docker-compose@v1
with:
version: '2.22.0'
- name: Test (end-to-end)
env:
AIRTABLE_PERSONAL_ACCESS_TOKEN: ${{ secrets.AIRTABLE_PERSONAL_ACCESS_TOKEN }}
Expand Down Expand Up @@ -108,6 +104,8 @@ jobs:
sudo apt-get install -y tesseract-ocr-kor
sudo apt-get install diffstat
tesseract --version
sudo make install-docker-compose
docker compose version
./test_e2e/test-src.sh
Expand All @@ -132,10 +130,6 @@ jobs:
uses: ./.github/actions/base-cache
with:
python-version: ${{ matrix.python-version }}
- name: Setup docker-compose
uses: KengoTODA/actions-setup-docker-compose@v1
with:
version: '2.22.0'
- name: Test (end-to-end)
env:
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
Expand Down Expand Up @@ -164,6 +158,8 @@ jobs:
DATABRICKS_USERNAME: ${{secrets.DATABRICKS_USERNAME}}
DATABRICKS_PASSWORD: ${{secrets.DATABRICKS_PASSWORD}}
DATABRICKS_CATALOG: ${{secrets.DATABRICKS_CATALOG}}
SHAREPOINT_CLIENT_ID: ${{secrets.SHAREPOINT_CLIENT_ID}}
SHAREPOINT_CRED: ${{secrets.SHAREPOINT_CRED}}
TABLE_OCR: "tesseract"
OCR_AGENT: "unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract"
CI: "true"
Expand All @@ -178,4 +174,6 @@ jobs:
sudo apt-get install -y tesseract-ocr-kor
sudo apt-get install diffstat
tesseract --version
sudo make install-docker-compose
docker compose version
./test_e2e/test-dest.sh
6 changes: 2 additions & 4 deletions .github/workflows/ingest-test-fixtures-update-pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,6 @@ jobs:
uses: ./.github/actions/base-cache
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Setup docker-compose
uses: KengoTODA/actions-setup-docker-compose@v1
with:
version: '2.22.0'
- name: Update test fixtures
env:
AIRTABLE_PERSONAL_ACCESS_TOKEN: ${{ secrets.AIRTABLE_PERSONAL_ACCESS_TOKEN }}
Expand Down Expand Up @@ -95,6 +91,8 @@ jobs:
sudo apt-get install -y tesseract-ocr
sudo apt-get install -y tesseract-ocr-kor
tesseract --version
sudo make install-docker-compose
docker compose version
./test_e2e/test-src.sh
- name: Save branch name to environment file
Expand Down
5 changes: 2 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
## 0.0.2-dev1
## 0.0.2-dev2

### Enhancements

* **Use uuid for s3 identifiers** Update unique id to use uuid derived from file path rather than the filepath itself.
* **V2 connectors precheck support** All steps in the v2 pipeline support an optional precheck call, which encompasses the previous check connection functionality.
* **Filter Step** Support dedicated step as part of the pipeline to filter documents.

## 0.0.1

Expand All @@ -19,8 +20,6 @@

## 0.0.0

### Enhancements

### Features

* **Initial Migration** Create the structure of this repo from the original code in the [Unstructured](https://github.com/Unstructured-IO/unstructured) project.
Expand Down
4 changes: 4 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,10 @@ install-all-deps:
install-pandoc:
ARCH=${ARCH} ./scripts/install-pandoc.sh

.PHONY: install-docker-compose
install-docker-compose:
ARCH=${ARCH} ./scripts/install-docker-compose.sh

.PHONY: install-ci
install-ci: install-all-connectors install-all-embedders
pip install -r requirements/local_partition/pdf.txt
Expand Down
23 changes: 23 additions & 0 deletions scripts/install-docker-compose.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/usr/bin/env bash

set -euo pipefail

DOCKER_ARCH=${ARCH}
if [ "${ARCH}" = "x86_64" ]; then
TARGETARCH="amd64"
elif [ "${ARCH}" = "arm64" ] || [ "${ARCH}" = "aarch64" ]; then
TARGETARCH="arm64"
fi
TARGETOS=linux
DOCKER_VERSION=26.1.3
BUILDX_VERSION=0.16.0
DOCKER_COMPOSE_VERSION=2.28.1

curl -fLo docker.tgz https://download.docker.com/${TARGETOS}/static/stable/"${DOCKER_ARCH}"/docker-${DOCKER_VERSION}.tgz
tar zxvf docker.tgz
rm -rf docker.tgz
mkdir -p /usr/local/lib/docker/cli-plugins
curl -fLo /usr/local/lib/docker/cli-plugins/docker-buildx "https://github.com/docker/buildx/releases/download/v${BUILDX_VERSION}/buildx-v${BUILDX_VERSION}.linux-${TARGETARCH}"
chmod +x /usr/local/lib/docker/cli-plugins/docker-buildx
curl -SL https://github.com/docker/compose/releases/download/v${DOCKER_COMPOSE_VERSION}/docker-compose-${TARGETOS}-"${DOCKER_ARCH}" -o /usr/local/lib/docker/cli-plugins/docker-compose
chmod +x /usr/local/lib/docker/cli-plugins/docker-compose
2 changes: 1 addition & 1 deletion test_e2e/dest/elasticsearch.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ source "$SCRIPT_DIR"/env_setup/elasticsearch/common/es-dest-ingest-test-creds.en
function cleanup {
# Index cleanup
echo "Stopping Elasticsearch Docker container"
docker-compose -f "$SCRIPT_DIR"/env_setup/elasticsearch/common/docker-compose.yaml down --remove-orphans -v
docker compose -f "$SCRIPT_DIR"/env_setup/elasticsearch/common/docker-compose.yaml down --remove-orphans -v

# Local file cleanup
cleanup_dir "$WORK_DIR"
Expand Down
2 changes: 1 addition & 1 deletion test_e2e/dest/kafka-local.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ function cleanup {
cleanup_dir "$OUTPUT_DIR"

echo "Stopping local Kafka instance"
docker-compose -f "$SCRIPT_DIR"/env_setup/kafka/docker-compose.yml down --remove-orphans -v
docker compose -f "$SCRIPT_DIR"/env_setup/kafka/docker-compose.yml down --remove-orphans -v
}

trap cleanup EXIT
Expand Down
2 changes: 1 addition & 1 deletion test_e2e/dest/opensearch.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ source "$SCRIPT_DIR"/cleanup.sh
function cleanup {
# Index cleanup
echo "Stopping OpenSearch Docker container"
docker-compose -f "$SCRIPT_DIR"/env_setup/opensearch/common/docker-compose.yaml down --remove-orphans -v
docker compose -f "$SCRIPT_DIR"/env_setup/opensearch/common/docker-compose.yaml down --remove-orphans -v

# Local file cleanup
cleanup_dir "$WORK_DIR"
Expand Down
2 changes: 1 addition & 1 deletion test_e2e/dest/pgvector.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ DATABASE_TYPE="pgvector"
source "$SCRIPT_DIR"/cleanup.sh
function cleanup {
echo "Stopping SQL DB Docker container"
docker-compose -f "$SCRIPT_DIR"/env_setup/sql/docker-compose-"$DATABASE_TYPE".yaml down --remove-orphans -v
docker compose -f "$SCRIPT_DIR"/env_setup/sql/docker-compose-"$DATABASE_TYPE".yaml down --remove-orphans -v
# Local file cleanup
cleanup_dir "$WORK_DIR"
cleanup_dir "$OUTPUT_DIR"
Expand Down
2 changes: 1 addition & 1 deletion test_e2e/dest/weaviate.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ source "$SCRIPT_DIR"/cleanup.sh
function cleanup {
# Index cleanup
echo "Stopping Weaviate Docker container"
docker-compose -f "$SCRIPT_DIR"/env_setup/weaviate/docker-compose.yml down --remove-orphans -v
docker compose -f "$SCRIPT_DIR"/env_setup/weaviate/docker-compose.yml down --remove-orphans -v

# Local file cleanup
cleanup_dir "$WORK_DIR"
Expand Down
6 changes: 3 additions & 3 deletions test_e2e/env_setup/kafka/create-kafka-instance.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ set -e
SCRIPT_DIR=$(dirname "$(realpath "$0")")

# Create the Weaviate instance
docker-compose version
docker-compose -f "$SCRIPT_DIR"/docker-compose.yml up --wait
docker-compose -f "$SCRIPT_DIR"/docker-compose.yml ps
docker compose version
docker compose -f "$SCRIPT_DIR"/docker-compose.yml up --wait
docker compose -f "$SCRIPT_DIR"/docker-compose.yml ps

echo "Instance is live."
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ set -e
SCRIPT_DIR=$(dirname "$(realpath "$0")")

# Create the Weaviate instance
docker-compose version
docker-compose -f "$SCRIPT_DIR"/docker-compose.yml up --wait
docker-compose -f "$SCRIPT_DIR"/docker-compose.yml ps
docker compose version
docker compose -f "$SCRIPT_DIR"/docker-compose.yml up --wait
docker compose -f "$SCRIPT_DIR"/docker-compose.yml ps

echo "Instance is live."
Loading

0 comments on commit 889a4d5

Please sign in to comment.