Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update docs #437

Merged
merged 3 commits into from
Nov 8, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/crawls.rst
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ On the CLI the crawler automatically works its way through a website, stopping a

$ trafilatura --crawl "https://www.example.org" > links.txt

It can also crawl websites in parallel by reading a list of target sites from a list (``-i``/``--inputfile`` option).
It can also crawl websites in parallel by reading a list of target sites from a list (``-i``/``--input-file`` option).

.. note::
The ``--list`` option does not apply here. Unlike with the ``--sitemap`` or ``--feed`` options, the URLs are simply returned as a list instead of being retrieved and processed. This happens in order to give a chance to examine the collected URLs prior to further downloads.
Expand Down
3 changes: 3 additions & 0 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,9 @@ This project is under active development, please make sure you keep it up-to-dat

On **Mac OS** it can be necessary to install certificates by hand if you get errors like ``[SSL: CERTIFICATE_VERIFY_FAILED]`` while downloading webpages: execute ``pip install certifi`` and perform the post-installation step by clicking on ``/Applications/Python 3.X/Install Certificates.command``. For more information see this `help page on SSL errors <https://stackoverflow.com/questions/27835619/urllib-and-ssl-certificate-verify-failed-error/42334357>`_.

.. hint::
Installation on MacOS is generally easier with `brew <https://formulae.brew.sh/formula/trafilatura>`_.


Older Python versions
~~~~~~~~~~~~~~~~~~~~~
Expand Down
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# with version specifier
sphinx>=7.2.6
pydata-sphinx-theme>=0.14.1
pydata-sphinx-theme>=0.14.3
docutils>=0.20.1
# without version specifier
trafilatura
Expand Down
2 changes: 1 addition & 1 deletion docs/tutorial-dwds.rst
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ Diese Linkliste kann zunächst gefiltert werden, um deutschsprachige, inhaltsrei

Die Ausgabe von *Trafilatura* erfolgt auf zweierlei Weise: die extrahierten Texte (TXT-Format) im Verzeichnis ``ausgabe`` und eine Kopie der heruntergeladenen Webseiten unter ``html-quellen`` (zur Archivierung und ggf. erneuten Verarbeitung):

``trafilatura --inputfile linkliste.txt --outputdir ausgabe/ --backup-dir html-quellen/``
``trafilatura --input-file linkliste.txt --outputdir ausgabe/ --backup-dir html-quellen/``

So werden TXT-Dateien ohne Metadaten ausgegeben. Wenn Sie ``--csv``, ``--json``, ``--xml`` oder ``--xmltei`` hinzufügen, werden Metadaten einbezogen und das entsprechende Format für die Ausgabe bestimmt. Zusätzliche Optionen sind verfügbar, siehe die passenden Dokumentationsseiten.

Expand Down
2 changes: 1 addition & 1 deletion docs/tutorial-epsilla.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Text embedding involves converting text into numerical vectors, and is commonly
- Anomaly detection (identify outliers)

In this tutorial, we will show you how to perform text embedding on results from Trafilatura. We will use
`Epsilla <https://www.epsilla.com/?ref=trafilatura>`_, an open source vector database for storing and searching vector embeddings. It is 10x faster than regular relational databases for vector operations.
`Epsilla <https://www.epsilla.com/?ref=trafilatura>`_, an open source vector database for storing and searching vector embeddings. It is 10x faster than regular vector databases for vector operations.

.. note::
For a hands-on version of this tutorial, try out the `Colab Notebook <https://colab.research.google.com/drive/1eFHO0dHyPhEF9Sm_HXcMFmJZnvP9a-aX?usp=sharing>`_.
Expand Down
6 changes: 3 additions & 3 deletions docs/tutorial0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -171,8 +171,8 @@ Seamless download and processing

Two major command line arguments are necessary here:

- ``-i`` or ``--inputfile`` to select an input list to read links from
- ``-o`` or ``--outputdir`` to define a directory to eventually store the results
- ``-i`` or ``--input-file`` to select an input list to read links from
- ``-o`` or ``--output-dir`` to define a directory to eventually store the results

An additional argument can be useful in this context:

Expand Down Expand Up @@ -213,6 +213,6 @@ Alternatively, you can download a series of web documents with generic command-l
# download if necessary
$ wget --directory-prefix=download/ --wait 5 --input-file=mylist.txt
# process a directory with archived HTML files
$ trafilatura --inputdir download/ --outputdir corpus/ --xmltei --nocomments
$ trafilatura --input-dir download/ --output-dir corpus/ --xmltei --no-comments


4 changes: 2 additions & 2 deletions docs/tutorial1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ For the collection and filtering of links see `this tutorial <tutorial0.html>`_

Two major options are necessary here:

- ``-i`` or ``--inputfile`` to select an input list to read links from
- ``-o`` or ``--outputdir`` to define a directory to eventually store the results
- ``-i`` or ``--input-file`` to select an input list to read links from
- ``-o`` or ``--output-dir`` to define a directory to eventually store the results

The input list will be read sequentially, and only lines beginning with a valid URL will be read; any other information contained in the file will be discarded.

Expand Down
Loading