This repository implements the following linguistic processing steps:
- POS tagging
- NER tagging
- improved lemmatization
We do this for the following languages:
- fr
- de
- lb (only POS tagging)
- en
The Luxembourgish language model is taken from https://github.com/PeterGilles/Luxembourgish-language-resources/tree/master/lux-tagger-July2023/model-best und unknown licencing status. We set the version and name in the model's meta data to lux-tagger-July2023.
The build process has been tested on modern Linux and macOS systems and requires Python 3.11. Under Ubuntu/Debian , make sure to have the following packages installed:
# install python3.11 according to your OS
sudo apt update
sudo apt upgrade -y
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.11 -y
sudo apt install python3.11-distutils -y
curl -sS https://bootstrap.pypa.io/get-pip.py | python3.11
sudo apt install git git-lfs make moreutils coreutils parallel # needed for building
sudo apt jq # needed for computing statistics
This repository uses pipenv
.
git clone https://github.com/impresso/impresso-linguistic-processing.git
cd impresso-linguistic-processing
python3.11 -mpip install pipenv
python3.11 -mpipenv install
python3.11 -mpipenv shell
For s3-based file processing, the following environment variables need to be set:
SE_ACCESS_KEY=
SE_SECRET_KEY=
SE_HOST_URL=
If your global environment does not contain these variables, you can set them in a local
.env
file. The python-dotenv
package is used to read these variables.
cp env.sample .env
edit .env
Adapt the local paths for the input and output directories in the
config.local.mk
(see config.local.mk.sample
for default settings.)
cp config.local.mk.sample config.local.mk
edit config.local.mk
The build process is controlled by the Makefile
.
make help # show available targets
make newspaper -j N # process specific newspaper/year pairs in parallel typically for testing
make collection MAKE_PARALLEL_OPTION=16 # process all newspapers using parallel processing within newspaper/year pairs
The spacy_linguistic_processing.py
script supports several command-line options:
--lid
: Path to the language identification file.--language
: Specify a language code to use for all items.-o
,--output-path
: Path to the output file (default:out.jsonl
).--min-doc-length
: Minimum document length to process (default: 50).--validate
: Validate the final language identification JSON against the schema.--text-property
: Specify the JSON property that contains the full text (default:ft
).--git-version
: Set the git version to include in the output. If not set, theGIT_VERSION
environment variable is used.--quit-if-s3-output-exists
: Quit if the output file already exists in the specified S3 bucket.--s3-output-path
: S3 path to upload the output file after processing or check if it already exists.--keep-timestamp-only
: After uploading to S3, keep only the timestamp of the local output file for data efficiency.
Ensure that the environment variables SE_ACCESS_KEY
and SE_SECRET_KEY
for access to the
S3 impresso infrastructure are set, e.g., by setting them in a local .env
file.
The build process uploads the processed data to the impresso S3 bucket.
Release notes:
- 2024-11-30: v1-0-4
- note: no change to spaCy pipelines and output content
- fix: upload to s3 was not compressed. This has been fixed.
- feat: separate s3 compression script to carefully compress uncompressed files on s3
- chore: small improvements
- 2024-11-27: v1-0-3
- chore: improve logging and add length limit for input text
- 2024-11-25: v1-0-1
- fix: POS tagging of lb was buggy (all tags set to X). This has been fixed.
- feat: Generate log files for each newspaper/year pair and upload it to s3.
- feat: Support agreed nameing convention for output files.
- feat: Process directly from s3 input data, on-the-fly mirroring per newspaper for slim builds
- note: no change to spaCy pipelines apart from lb POS tag mapping
- 2024-04-24: v1-0-0
- First public release of the impresso linguistic processing pipeline.