Repository for the ECIR21 Reproducibility Paper "An Empirical Comparison of Web Page Segmentation Algorithms"
Paper by Johannes Kiesel, Lars Meyer, Florian Kneist, Benno Stein, and Martin Potthast.
This repository enables the reproduction of the experiments from the paper, but also to run the algorithms on new data.
Outline:
- Common Preparations: necessary setup steps
- Algorithms: running each segmentation algorithm on the Webis-WebSeg-20: Baseline, VIPS, HEPS, Cormier et al., MMDetection, Meier et al., Ensemble
- Evaluation: evaluating the segmentations
- Plotting Segmentations: visually checking on segmentations
- Check out this repository.
- If not done already, get the source code of the evaluation framework paper, extract it next to this README, and rename the extracted directory (
cikm20-web-page-...
) tocikm20
. - Make sure your system fulfills all the requirements of the evaluation framework.
- If it does not exist yet, create the directory
segmentations
next to this README.
We here describe how to get the code and how to run each algorithm for one page, so that it produces a segmentation in the common format (a JSON file in the segmentations
directory) which can then be used in the evaluation.
The instructions here use the page with ID 000000 so that they work with the sample ZIP archives, webis-webseg-20-000000.zip
and webis-web-archive-17-000000.zip
, as well as with the full datasets of segmentations and archives. For the sample ZIP archives, download them from the respective full dataset pages, extract them next to this README, and rename them by removing the -000000
suffix. If you download and extract the full datasets, they already have the correct name. Then follow the instructions below.
The baseline creates a single segment that contains the entire web page.
- In a shell, go to the directory that contains this README.
Rscript algorithms/baseline/src/main/r/segmentation-baseline.R \
--input webis-webseg-20/000000/screenshot.png \
--output segmentations/baseline.json
We use a TypeScript port of Tomáš Popela's vips_java, transpiled to JavaScript. We thank the original author for providing his implementation.
The implementation is in the vips.js. This file is loaded into the webis-web-archiver to run on web pages that are reproduced from web archives. If needed, you can use the compile.sh to re-compile the Java part that controls the browser and executes the VIPS JavaScript (re-compilation requires a Java 8 JDK or above installed).
- Install Docker.
- In a shell, go to the directory that contains this README.
You can find the corresponding URL for an archive of the webis-web-archive-17 in the sites-and-pages.txt. Note that the docker image may take quite some time to download when you run it the first time.
# Execute VIPS while reproducing the web page from the archive
./algorithms/vips/vips.sh \
--archive webis-web-archive-17/pages/000000/ \
--pdoc 5 \
--url "http://008345152.blog.fc2.com/blog-date-201305.html" \
--id 000000 \
--output segmentations
# Convert hierarchical segmentation to a flat one
Rscript cikm20/src/main/r/flatten-segmentations.R \
--input segmentations/vips.json \
--output segmentations/vips-flattened.json
We use a slightly modified version of Manabe et al.'s HEPS implementation that outputs bounding box coordinates instead of text segments. We thank the original authors for providing their implementation.
The implementation is in the heps.js. This file is loaded into the webis-web-archiver to run on web pages that are reproduced from web archives. If needed, you can use the compile.sh to re-compile the Java part that controls the browser and executes the HEPS JavaScript (re-compilation requires a Java 8 JDK or above installed).
- Install Docker.
- In a shell, go to the directory that contains this README.
You can find the corresponding URL for an archive of the webis-web-archive-17 in the sites-and-pages.txt. Note that the docker image may take quite some time to download when you run it the first time.
# Execute HEPS while reproducing the web page from the archive
./algorithms/heps/heps.sh \
--archive webis-web-archive-17/pages/000000/ \
--url "http://008345152.blog.fc2.com/blog-date-201305.html" \
--id 000000 \
--output segmentations
We use a Python implementation graciously provided by Michael Cormier and Zhuofu Tao, to whom we express our gratitude.
You may adjust the min_l
and line_length
parameters in cormier.py
.
- Install Python 3 (e.g., for Debian/Ubuntu:
sudo apt install python3
). - Install
pip3
(e.g., for Debian/Ubuntu:sudo apt install python3-pip
). - In a shell, go to the directory that contains this README.
- Install the required Python packages:
pip3 install -r algorithms/cormier/requirements.txt
.
python3 algorithms/cormier/cormier.py \
--image webis-webseg-20/000000/screenshot.png \
--id 000000 \
--output segmentations
We use the original implementation provided by the authors. The provided inference scripts are suitable for use with Nvidia GPUs. By default, the container uses the first available GPU.
- Install Docker.
- Install the Nvidia Container Toolkit.
# Infer segments for page with ID 000000 (use 'infer.py' to segment all)
nvidia-docker run -it \
-v ${PWD}/webis-webseg-20/:/pages \
-v ${PWD}/segmentations/mmdetection:/out \
ghcr.io/webis-de/mmdetection19-web-page-segmentation:1.0.0 \
python infer_single.py 000000
# Fit segments
Rscript cikm20/src/main/r/fit-segmentations-to-dom-nodes.R \
--input segmentations/mmdetection/000000.json \
--segmentations mmdetection_segms \
--nodes webis-webseg-20/000000/nodes.csv \
--output segmentations/mmdetection.json
# Rename segmentation
sed -i 's/mmdetection_segms.fitted/mmdetection/' segmentations/mmdetection.json
# Convert hierarchical segmentation to a flat one
Rscript cikm20/src/main/r/flatten-segmentations.R \
--input segmentations/mmdetection.json \
--output segmentations/mmdetection-flattened.json
The neural network is implemented in Keras using the TensorFlow backend. We provide a Docker container that can be used to train the model and perform inference with Nvidia GPUs. By default, the container uses the first available GPU.
- Install Docker.
- Install the Nvidia Container Toolkit.
- In a shell, go to the directory that contains this README.
The algorithm expects a specific directory structure for training and testing:
- Download the
webis-webseg-20-folds.txt
andwebis-webseg-20-4096px.zip
from Zenodo and extract the ZIP file. - Use
./algorithms/meier/setup-directories.sh <path/to/extracted/webis-webseg-20> <path/to/webis-webseg-20-folds.txt>
. The directory structure will be created in./webis-webseg-20-meier
. - Download the
webis-webseg-20-meier-models.zip
from Zenodo and extract it into the created./webis-webseg-20-meier
directory.
Instructions to create the input files and to train the models are provided in our README for the algorithm.
# Run the algorithm on all screenshots of a fold, resizes the output to the original size, and extracts the segments from the masks.
gpu=0 # The ID of the GPU to use
fold=0 # Do this for each integer from 0 to 9
sudo nvidia-docker run \
-it --rm -u $(id -u):$(id -g) \
--env NVIDIA_VISIBLE_DEVICES=$gpu \
--env KERAS_BACKEND=tensorflow \
-v ${PWD}/webis-webseg-20-meier/:/src/workspace/data \
ghcr.io/webis-de/meier17-web-page-segmentation:1.0.4 \
./test.sh \
../data/input/test/ \
$fold \
../data/webis-webseg-20-meier-models/model-fold$fold-weights.h5 \
../data/segmentations-fold$fold
The Ensemble uses the segmentation fusion algorithm.
- If one of these files is missing in
segmentations
, run the corresponding algorithm as described above:vips.json
,heps.json
,cormier.json
, andmmdetection.json
. - In a shell, go to the directory that contains this README.
# Create one segmentation file that contains VIPS, HEPS, Cormier et al., and MMDetection segmentations
Rscript cikm20/src/main/r/combine-segmentation-files.R \
segmentations/vips.json \
segmentations/heps.json \
segmentations/cormier.json \
segmentations/mmdetection.json \
segmentations/all.json
# Create the ensemble segmentation for Min-vote@2
# Note that the implementation uses a *dis*agreement-threshold, thus use 1 - \theta_s of the paper!
Rscript cikm20/src/main/r/fuse-segmentations.R \
--input segmentations/all.json \
--segments-min-annotators 2 \
--size-function pixels \
--disagreement-threshold 0.625 \
--output segmentations/ensemble.json
The evaluation is here exemplified for the baseline
algorithm and for pixels
as atomic elements (the other options are edges-fine
, edges-coarse
, nodes
, and chars
).
- The segmentation of the algorithm should be contained in a JSON file
segmentations/baseline.json
. If not, run the algorithm as described above. - If it does not exist yet, create the directory
results
next to this README. - In a shell, go to the directory that contains this README.
# Get BCubed precision, recall, and F-measure
Rscript cikm20/src/main/r/evaluate-segmentation.R \
--algorithm segmentations/baseline.json \
--ground-truth webis-webseg-20/000000/ground-truth.json \
--size-function pixels \
--output results/baseline-pixels.csv
The agreement of two algorithms is calculated the same way, but with the segmentation of the second algorithm as the "ground-truth".
Rscript cikm20/src/main/r/plot-segmentations.R \
--input <path/to/segmentation>.json \
--color-per-segment \
--output <path/to/output-image>.png