From 5abc6b367ec489f6fc8020cff615f7419b779ede Mon Sep 17 00:00:00 2001 From: benoit74 Date: Fri, 19 Jul 2024 14:02:24 +0000 Subject: [PATCH] Fix README and Dockerfile for imprecisions (#314) --- CHANGELOG.md | 4 ++++ Dockerfile | 1 + README.md | 23 ++++++++--------------- 3 files changed, 13 insertions(+), 15 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 3a6e1a4..f2fd66d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +### Fixed + +- Fix README imprecisions + add back warc2zim availability in docker image (#314) + ## [2.0.4] - 2024-07-15 ### Changed diff --git a/Dockerfile b/Dockerfile index e0fa38e..d5a08e9 100644 --- a/Dockerfile +++ b/Dockerfile @@ -41,6 +41,7 @@ COPY *.md /src/ # Install + cleanup RUN . /app/zimit/bin/activate && python -m pip install --no-cache-dir /src \ && ln -s /app/zimit/bin/zimit /usr/bin/zimit \ + && ln -s /app/zimit/bin/warc2zim /usr/bin/warc2zim \ && chmod +x /usr/bin/zimit \ && rm -rf /src diff --git a/README.md b/README.md index 7688793..ef08104 100644 --- a/README.md +++ b/README.md @@ -22,10 +22,9 @@ The system: The `zimit.py` is the entrypoint for the system. -After the crawl is done, warc2zim is used to write a zim to the -`/output` directory, which can be mounted as a volume. +After the crawl is done, warc2zim is used to write a zim to the `/output` directory, which should be mounted as a volume to not loose the ZIM created when container stops. -Using the `--keep` flag, the crawled WARCs will also be kept in a temp directory inside `/output` +Using the `--keep` flag, the crawled WARCs and few other artifacts will also be kept in a temp directory inside `/output` Usage ----- @@ -40,30 +39,24 @@ docker build -t ghcr.io/openzim/zimit . The image accepts the following parameters, **as well as any of the [warc2zim](https://github.com/openzim/warc2zim) ones**; useful for setting metadata, for instance: -- `--url URL` - the url to be crawled (required) -- `--workers N` - number of crawl workers to be run in parallel -- `--wait-until` - Puppeteer setting for how long to wait for page load. See [page.goto waitUntil options](https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagegotourl-options). The default is `load`, but for static sites, `--wait-until domcontentloaded` may be used to speed up the crawl (to avoid waiting for ads to load for example). -- `--name` - Name of ZIM file (defaults to the hostname of the URL) +- Required: `--url URL` - the url to be crawled +- Required: `--name` - Name of ZIM file - `--output` - output directory (defaults to `/output`) - `--limit U` - Limit capture to at most U URLs +- `--behaviors` - Control which browsertrix behaviors are ran (defaults to `autoplay,autofetch,siteSpecific`, adding `autoscroll` to the list is possible to automatically scroll the pages and fetch resources which are lazy loaded) - `--exclude ` - skip URLs that match the regex from crawling. Can be specified multiple times. An example is `--exclude="(\?q=|signup-landing\?|\?cid=)"`, where URLs that contain either `?q=` or `signup-landing?` or `?cid=` will be excluded. -- `--scroll [N]` - if set, will activate a simple auto-scroll behavior on each page to scroll for upto N seconds +- `--workers N` - number of crawl workers to be run in parallel +- `--wait-until` - Puppeteer setting for how long to wait for page load. See [page.goto waitUntil options](https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagegotourl-options). The default is `load`, but for static sites, `--wait-until domcontentloaded` may be used to speed up the crawl (to avoid waiting for ads to load for example). - `--keep` - if set, keep the WARC files in a temp directory inside the output directory -The following is an example usage. The `--shm-size` flags is [needed to run Chrome in Docker](https://github.com/puppeteer/puppeteer/blob/v1.0.0/docs/troubleshooting.md#tips). - Example command: ```bash docker run ghcr.io/openzim/zimit zimit --help docker run ghcr.io/openzim/zimit warc2zim --help -docker run -v /output:/output \ - --shm-size=1gb ghcr.io/openzim/zimit zimit --url URL --name myzimfile --workers 2 --waitUntil domcontentloaded +docker run -v /output:/output ghcr.io/openzim/zimit zimit --url URL --name myzimfile ``` -The puppeteer-cluster provides monitoring output which is enabled by -default and prints the crawl status to the Docker log. - **Note**: Image automatically filters out a large number of ads by using the 3 blocklists from [anudeepND](https://github.com/anudeepND/blacklist). If you don't want this filtering, disable the image's entrypoint in your container (`docker run --entrypoint="" ghcr.io/openzim/zimit ...`). Nota bene