Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker #360

Open
Querela opened this issue Jan 8, 2021 · 9 comments
Open

Docker #360

Querela opened this issue Jan 8, 2021 · 9 comments

Comments

@Querela
Copy link

Querela commented Jan 8, 2021

I wrote a Docker file for the current version(s). Maybe you want to look into it and integrate it here.
It works for me but I only have some simple use-cases (like API tests with python3), so I do not know how it performs under stress. And whether users require more configuration options. (But they could theoretically bind-mount other files if required.)

See Docker-Hub: https://hub.docker.com/r/ekoerner/heritrix

My Dockerfile (currently in private repository, so I can't provide any link, just the content here)

ARG java=11-jre

FROM openjdk:${java}

ARG version="3.4.0-20210923"
ARG contrib=0
ARG user="heritrix"
ARG userid=1000

LABEL version=${version}
LABEL contrib=${contrib}
LABEL user=${user}/$userid

# create user
RUN \
    groupadd -g $userid $user && \
    useradd -r -u $userid -g $user $user

# install other requirements (for contrib)
RUN \
    if [ ${contrib} -eq 1 ] ; then \
        apt-get update && \
        apt-get install -y --no-install-recommends \
            youtube-dl && \
        rm -rf /var/lib/apt/lists/* ; \
    fi

WORKDIR /opt

# download latest version according to:
#   https://github.com/internetarchive/heritrix3/releases/tag/3.4.0-20210923
RUN \
    if [ ${contrib} -eq 1 ] ; then \
        wget -O heritrix-contrib-${version}-dist.tar.gz https://repo1.maven.org/maven2/org/archive/heritrix/heritrix-contrib/${version}/heritrix-contrib-${version}-dist.tar.gz && \
        tar xvfz heritrix-contrib-${version}-dist.tar.gz && \
        rm heritrix-contrib-${version}-dist.tar.gz && \
        mv heritrix-contrib-${version} heritrix ; \
    else \
        wget -O heritrix-${version}-dist.zip https://repo1.maven.org/maven2/org/archive/heritrix/heritrix/${version}/heritrix-${version}-dist.zip && \
        unzip heritrix-${version}-dist.zip && \
        rm heritrix-${version}-dist.zip && \
        mv heritrix-${version} heritrix ; \
    fi && \
    chmod u+x heritrix/bin/heritrix && \
    chown -R $user:$user /opt/heritrix

# create a run script because dynamic configuration of credentials
RUN printf '%s\n' \
    '#!/bin/bash' \
    '' \
    '_JOBARGS="-b /"' \
    '' \
    '# set credentials (require both USERNAME and PASSWORD)' \
    '# -a "${USERNAME}:${PASSWORD}"' \
    'if [[ ! -z "$USERNAME" ]] && [[ ! -z "$PASSWORD" ]]; then' \
    '    echo "${USERNAME}:${PASSWORD}" > ${HERITRIX_HOME}/credentials.txt' \
    '    _JOBARGS="$_JOBARGS -a @${HERITRIX_HOME}/credentials.txt"' \
    'elif [[ ! -z "$CREDSFILE" ]]; then' \
    '    _JOBARGS="$_JOBARGS -a @${CREDSFILE}"' \
    'else' \
    '    >&2 echo "No USERNAME and/or PASSWORD environment var set!"' \
    'fi' \
    '' \
    '# check if -r mode' \
    'if [[ ! -z "$JOBNAME" ]]; then' \
    '    >&2 echo "Found JOBNAME envvar, just running job: $JOBNAME"' \
    '    _JOBARGS="$_JOBARGS -r $JOBNAME"' \
    '    if [ ! -f "/opt/heritrix/jobs/$JOBNAME/crawler-beans.cxml" ]; then' \
    '        >&2 echo "Did not find any '"'"'crawler-beans.cxml'"'"' for job '"'"'$JOBNAME'"'"'!"' \
    '    fi' \
    'fi' \
    '' \
    '# run' \
    'exec ${HERITRIX_HOME}/bin/heritrix $_JOBARGS' \
    '' \
    > heritrix.sh && \
    chmod +x heritrix.sh && \
    chown $user:$user heritrix.sh

WORKDIR /opt/heritrix

USER $user

ENV HERITRIX_HOME /opt/heritrix
# let it run in the foreground, required for docker
ENV FOREGROUND true

# standard webport
# NOTE: that the webpage is via HTTPS only available!
EXPOSE 8443

CMD ["/opt/heritrix.sh"]

Build it:

docker build --build-arg version=3.4.0-20210923 -t heritrix .

Build heritrix-contrib (requires Java 8, with Java 11 (JRE/JDK) some JNI error, maybe related to #265?)

docker build --build-arg version=3.4.0-20210923 --build-arg contrib=1 --build-arg java=8-jre -t heritrix-contrib .

Example docker-compose.yml (also on DockerHub currently)

version: "3.7"
services:

  heritrix:
    build: .
    container_name: "heritrix"
    # TEST: keeps the container running without doing anything (for inspections)
    # entrypoint: bash -c 'while :; do :; done & kill -STOP $$! && wait $$!'
    # env_file: .env
    environment:
      - USERNAME=admin
      - PASSWORD=admin
      # optional jobname to run (will only run this single job and exit!)
      # - JOBNAME=myjob
      # - JAVA_OPTS=-Xmx1024M
    init: true
    ports:
      # if you want to use a .env file with `PORT=8443` for example
      # - ${PORT}:8443
      - 8443:8443
    restart: unless-stopped
    volumes:
      # where jobs will be stored
      - job-files:/opt/heritrix/jobs
      # or if JOBNAME envvar is used (mount just the single job folder)
      # jobfolder in the container needs to have the same name as in JOBNAME
      # - $(pwd)/host_myjob:/opt/heritrix/jobs/myjob

volumes:
  job-files:

UPDATE: I added the -r <jobname> option to my image on dockerhub. Simply set the JOBNAME=jobname environment variable to run the job jobname. Take care to mount the (preconfigured) job folder into the image, see above. Only works from version 3.4.0-20210803, see pull request #406.
UPDATE2: I added a contrib image that uses heritrix-contrib. For now it only includes youtube-dl as extra dependency and it only works with Java 8 JRE. The contrib image is only available from version 3.4.0-20210923.
UPDATE3: Added a custom user to make it a bit more secure (e. g., no package installs possible anymore). Note that -b / is required to make the web UI visible in the docker image.

@818S
Copy link

818S commented May 20, 2021

+1

@ato
Copy link
Collaborator

ato commented Jul 7, 2021

Just noting that if anyone would like to see a Dockerfile merged please submit it as a pull request and include the documentation/examples you feel appropriate. I'm willing merge it and connect it to Docker Hub under the IIPC group but I don't use Docker much myself so you'll need to do the legwork and testing. :-)

@Querela
Copy link
Author

Querela commented Jul 7, 2021

I find myself unable to really stress-test my own docker image. It works for some toy samples but I'm not sure about more involved scenarios and how docker handles this. Mine was more for short-term and low url count crawls. 😃
I also think the configuration handling can be improved by a lot. In my use case I just needed the most basic things but I saw use-cased on the internet that did much more. So, I'm not sure whether my image might be a good "official" image.
(But I will still update my dockerhub images with each new release here. And the code above is my most current version.)

@Querela
Copy link
Author

Querela commented Aug 13, 2021

I added the -r <jobname> flag into my image. This is option really nice and makes automation easier.
I updated the first comment of the issue.

@Querela
Copy link
Author

Querela commented Dec 10, 2021

So, after a request I added a heritrix-contrib docker image (same docker hub URL, just :contrib tag). But I had difficulties finding any documentation about the contrib stuff. I found the javadocs but nowhere was mentioned how to set it up, what other requirements are there (e.g. for the various extractors, ...) and so on. I also found that it only worked with Java 8 and not with Java 11.

Now my Dockerfile gets to the point that it might make sense to create a pull request. What exactly would be required? I'm especially puzzled about tests since I can do some manual tests but how would I do automated stuff?

@ato
Copy link
Collaborator

ato commented Dec 10, 2021

All I had in mind was a a pull request that adds the Dockerfile itself and maybe a section named something like 'Running Heritrix under Docker' with some brief usage instructions to docs/operating.rs. By testing I just meant manually verifying the instructions work not automated tests. :-)

@Querela Querela mentioned this issue Dec 10, 2021
@Querela
Copy link
Author

Querela commented Dec 10, 2021

Ok. I'm working on it.
I did extract the entrypoint script outside, so it is a bit easier to edit. And a separate Dockerfile for the heritrix-contrib image.
And I added a Makefile to create the images.

I did not yet add a description on how to build the docker image. Would a README.md be enough in the docker folder or a wiki page (currently in my fork only)?
I would suggest running docker with the official images, so the image build process uses the maven releases and does not build from the sources again.

I found the following Docker Hub users:

Which should then also be used in the documentation. (instead of just heritrix)

@ato
Copy link
Collaborator

ato commented Dec 12, 2021

Thanks. That looks great.

I've merged it and pushed the main and contrib images to iipc/heritrix. I had intended to automate this with the autobuilder but it seems the free tier of that has been discontinued. I'll look into alternative options but I guess it's not too difficult to build them manually after each release.

I used the IIPC Docker org because the Heritrix "interim" releases are currently maintained by some members of the IIPC community and several of us (including someone from IA) have access to that org.

@Querela
Copy link
Author

Querela commented Dec 13, 2021

I can take a look at using GH Actions. It seems to me that the tags correspond to the releases.
So, build the docker image after a new tag is pushed, or on a new release (tag) has been added. I think it should be possible to extract the current or latest tag to supply the build arg. Or alternatively, manually update the standard release number for each release in the Dockerfile.

Then, we can probably also transfer all the old images from my hub account to the iipc one, if necessary?
I will later clear out my hub repo to remove confusion. But no concrete time plan yet.

And thanks about the IIPC explanation. :-)

As for the tags, I had -jre in case a -jdk base image might be added later on, and where subsequent users would want to base their custom images on either one, depending on their requirements and to-be-installed software.

Then, I also added the Docker wiki page. If anyone plans to rename it, please update the link in docker/README.md.
I updated wiki: HOWTO Ship a Heritrix Release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants