TueR – Tübingen Retrieval (Engine)

This is the engine for the TüR (Tübingen Retrieval) project, built with Python, Flask, DuckDB, and lots of motivation.

Requirements

Python 3
pip
virtualenv

Installation

Install Python 3:

Download and install the latest version of Python 3 from the official website.

Install virtualenv:
```
pip install virtualenv
```

Create and activate a virtual environment:

virtualenv --python=3.11 .venv
source .venv/bin/activate

Install requirements:

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Usage

Crawl pages:

python main.py --online

Important:

The online pipeline will run until you stop it manually, or it reaches the maximum number of sites.
You can adapt the configuration in the main.py. The crawler has alot of options to configure.
The online pipeline will start a lot of threads, so it can be quite resource-intensive. You can limit the number of
You need a lot of RAM (~20 GB of RAM) for the offline pipeline. threads in the main.py file.
Have fun crawling the web!

Start the server:

python server.py

Access the application:

Open your browser and navigate to http://localhost:8000/

Server

The server is built with Flask and runs on port 8000 by default. To start the server, use the following command:

python server.py

You can see a list of all available routes by navigating to http://localhost:8000/site-map.

Important:

The server will only work if you have crawled some pages before.
For the summarization you will need a strong CPU and a lot of RAM, as the summarization is done on the fly and can be quite resource-intensive.

Known Issues

The pipeline will not stop by itself, even if reached the maximum sites. You will have to stop it manually by pressing Ctrl + C in the terminal. But it will be able to resume from where it left off when you restart it.

When the offline pipeline runs, it will try to finish completely before stopping. If you force stop it, the pipeline will not save the state because it's saved in crawlies.db.wal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

TueR – Tübingen Retrieval (Engine)

Table of Contents

Requirements

Installation

Usage

Crawl pages:

Start the server:

Access the application:

Server

Known Issues

Files

README.md

Latest commit

History

README.md

File metadata and controls

TueR – Tübingen Retrieval (Engine)

Table of Contents

Requirements

Installation

Usage

Crawl pages:

Start the server:

Access the application:

Server

Known Issues