prepare version 2.0.0 (#759)

* prepare version 2.0.0 * update setup and wording * docs: readme and structure * update dependabot and funding * update contributing and history files
adbar · Dec 3, 2024 · c6e8340 · c6e8340
1 parent b7bfcc3
commit c6e8340
Show file tree

Hide file tree

Showing 9 changed files with 85 additions and 111 deletions.
diff --git a/.github/FUNDING.yml b/.github/FUNDING.yml
@@ -1,6 +1,6 @@
 # These are supported funding model platforms
 
-github: # Replace with up to 4 GitHub Sponsors-enabled usernames e.g., [user1, user2]
+github: [adbar]
 patreon: # Replace with a single Patreon username
 open_collective: # Replace with a single Open Collective username
 ko_fi: adbarbaresi

diff --git a/.github/dependabot.yml b/.github/dependabot.yml
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,39 +1,41 @@
 ## How to contribute
 
-Thank you for considering contributing to Trafilatura! Your contributions make the software and its documentation better.
+Your contributions make the software and its documentation better. A special thanks to all the [contributors](https://github.com/adbar/trafilatura/graphs/contributors) who have played a part in Trafilatura.
 
 
 There are many ways to contribute, you could:
 
   * Improve the documentation: Write tutorials and guides, correct mistakes, or translate existing content.
-  * Find bugs and submit bug reports: Help making Trafilatura a robust and versatile tool.
+  * Find bugs and submit bug reports: Help making Trafilatura an even more robust tool.
   * Submit feature requests: Share your feedback and suggestions.
   * Write code: Fix bugs or add new features.
 
 
 Here are some important resources:
 
   * [List of currently open issues](https://github.com/adbar/trafilatura/issues) (no pretention to exhaustivity!)
-  * [Roadmap and milestones](https://github.com/adbar/trafilatura/milestones)
-  * [How to Contribute to Open Source](https://opensource.guide/how-to-contribute/)
+  * [How to contribute to open source](https://opensource.guide/how-to-contribute/)
 
 
-## Submitting changes
+## Testing and evaluating the code
 
-Please send a [GitHub Pull Request to trafilatura](https://github.com/adbar/trafilatura/pull/new/master) with a clear list of what you have done (read more about [pull requests](http://help.github.com/pull-requests/)).
+Here is how you can run the tests and code quality checks:
 
-**Working on your first Pull Request?** See this tutorial: [How To Create a Pull Request on GitHub](https://www.digitalocean.com/community/tutorials/how-to-create-a-pull-request-on-github)
+- Install the necessary packages with `pip install trafilatura[dev]`
+- Run `pytest` from trafilatura's directory, or select a particular test suite, for example `realworld_tests.py`, and run `pytest realworld_tests.py` or simply `python3 realworld_tests.py`
+- Run `mypy` on the directory: `mypy trafilatura/`
+- See also the [tests Readme](tests/README.rst) for information on the evaluation benchmark
 
+Pull requests will only be accepted if they there are no errors in pytest and mypy.
 
-A special thanks to all the [contributors](https://github.com/adbar/trafilatura/graphs/contributors) who have played a part in Trafilatura.
+If you work on text extraction it is useful to check if performance is equal or better on the benchmark.
 
 
-## Testing and evaluating the code
+## Submitting changes
 
-Here is how you can run the tests if you wish to correct the errors and further improve the code:
+Please send a pull request to Trafilatura with a list of what you have done (read more about [pull requests](http://help.github.com/pull-requests/)).
 
-- Run `pytest` from trafilatura's directory, or select a particular test suite, for example `realworld_tests.py`, and run `pytest realworld_tests.py` or simply `python3 realworld_tests.py`
-- See also the [tests Readme](tests/README.rst) for information on the evaluation
+**Working on your first Pull Request?** See this tutorial: [How To Create a Pull Request on GitHub](https://www.digitalocean.com/community/tutorials/how-to-create-a-pull-request-on-github)
 
 
 

diff --git a/HISTORY.md b/HISTORY.md
@@ -1,7 +1,7 @@
 ## History / Changelog
 
 
-## future v2.0.0
+## 2.0.0
 
 Breaking changes:
 - Python 3.6 and 3.7 deprecated (#709)
@@ -12,6 +12,7 @@ Breaking changes:
 - downloads: remove `decode` argument in `fetch_url()` → use `fetch_response` instead (#724)
 - deprecated graphical user interface now removed (#713)
 - extraction: move `max_tree_size` parameter to `settings.cfg` (#742)
+- use type hinting (#721, #723, #748)
 - see [Python](https://trafilatura.readthedocs.io/en/latest/usage-python.html#deprecations) and [CLI](https://trafilatura.readthedocs.io/en/latest/usage-cli.html#deprecations) deprecations in the docs
 
 Fixes:
@@ -20,11 +21,16 @@ Fixes:
 - more robust mapping for conversion to HTML (#721)
 - CLI downloads: use all information in settings file (#734)
 - downloads: cleaner urllib3 code (#736)
-- CLI: print URLs early for feeds and sitemaps with `--list` with @gremid (#744)
+- refine table markdown output by @unsleepy22 (#752)
+- extraction fix: images in text nodes by @unsleepy22 (#757)
 
 Metadata:
 - more robust URL extraction (#710)
 
+Command-line interface:
+- CLI: print URLs early for feeds and sitemaps with `--list` with @gremid (#744)
+- CLI: add 126 exit code for high error ratio (#747)
+
 Maintenance:
 - remove already deprecated functions and args (#716)
 - add type hints (#723, #728)
@@ -33,10 +39,12 @@ Maintenance:
 - better debug messages in `main_extractor` (#714)
 - evaluation: review data, update packages, add magic_html (#731)
 - setup: explicit exports through `__all__` (#740)
+- tests: extend coverage (#753)
 
 Documentation:
 - fix link in `docs/index.html` by @nzw0301 (#711)
 - remove docs from published packages (#743)
+- update docs (#745)
 
 
 ## 1.12.2

diff --git a/README.md b/README.md
@@ -32,15 +32,16 @@ required, the output can be converted to commonly used formats.
 
 Going from HTML bulk to essential parts can alleviate many problems
 related to text quality, by **focusing on the actual content**,
-**avoiding the noise** caused by recurring elements (headers, footers
-etc.), and **making sense of the data** with selected information. The
-extractor is designed to be **robust and reasonably fast**, it runs in
-production on millions of documents.
+**avoiding the noise** caused by recurring elements like headers and footers
+and by **making sense of the data and metadata** with selected information.
+The extractor strikes a balance between limiting noise (precision) and
+including all valid parts (recall). It is **robust and reasonably fast**.
 
-The tool's versatility makes it **useful for quantitative and
-data-driven approaches**. It is used in the academic domain and beyond
-(e.g. in natural language processing, computational social science,
-search engine optimization, and information security).
+Trafilatura is [widely used](https://trafilatura.readthedocs.io/en/latest/used-by.html)
+and integrated into [thousands of projects](https://github.com/adbar/trafilatura/network/dependents>)
+by companies like HuggingFace, IBM, and Microsoft Research as well as institutions like
+the Allen Institute, Stanford, the Tokyo Institute of Technology, and
+the University of Munich.
 
 
 ### Features
@@ -85,22 +86,6 @@ For more information see the [benchmark section](https://trafilatura.readthedocs
 and the [evaluation readme](https://github.com/adbar/trafilatura/blob/master/tests/README.rst)
 to run the evaluation with the latest data and packages.
 
-**750 documents, 2236 text & 2250 boilerplate segments (2022-05-18), Python 3.8**
-
-| Python Package | Precision | Recall | Accuracy | F-Score | Diff. |
-|----------------|-----------|--------|----------|---------|-------|
-| html_text 0.5.2 | 0.529 | **0.958** | 0.554 | 0.682 | 2.2x |
-| inscriptis 2.2.0 (html to txt) | 0.534 | **0.959** | 0.563 | 0.686 | 3.5x |
-| newspaper3k 0.2.8 | 0.895 | 0.593 | 0.762 | 0.713 | 12x |
-| justext 3.0.0 (custom) | 0.865 | 0.650 | 0.775 | 0.742 | 5.2x |
-| boilerpy3 1.0.6 (article mode) | 0.814 | 0.744 | 0.787 | 0.777 | 4.1x |
-| *baseline (text markup)* | 0.757 | 0.827 | 0.781 | 0.790 | **1x** |
-| goose3 3.1.9 | **0.934** | 0.690 | 0.821 | 0.793 | 22x |
-| readability-lxml 0.8.1 | 0.891 | 0.729 | 0.820 | 0.801 | 5.8x |
-| news-please 1.5.22 | 0.898 | 0.734 | 0.826 | 0.808 | 61x |
-| readabilipy 0.2.0 | 0.877 | 0.870 | 0.874 | 0.874 | 248x |
-| trafilatura 1.2.2 (standard) | 0.914 | 0.904 | **0.910** | **0.909** | 7.1x |
-
 
 #### Other evaluations:
 
@@ -138,7 +123,7 @@ This package is distributed under the [Apache 2.0 license](https://www.apache.or
 Versions prior to v1.8.0 are under GPLv3+ license.
 
 
-## Contributing
+### Contributing
 
 Contributions of all kinds are welcome. Visit the [Contributing
 page](https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md)
@@ -152,13 +137,17 @@ who extended the docs or submitted bug reports, features and bugfixes!
 
 ## Context
 
-Developed with practical applications of academic research in mind, this
-software is part of a broader effort to derive information from web
-documents. Extracting and pre-processing web texts to the exacting
-standards of scientific research presents a substantial challenge. This
-software package simplifies text data collection and enhances corpus
-quality, it is currently used to build [text databases for linguistic
-research](https://www.dwds.de/d/k-web).
+This work started as a PhD project at the crossroads of linguistics and
+NLP, this expertise has been instrumental in shaping Trafilatura over
+the years. Initially launched to create text databases for research purposes
+at the Berlin-Brandenburg Academy of Sciences (DWDS and ZDL units),
+this package continues to be maintained but its future development
+depends on community support.
+
+**If you value this software or depend on it for your product, consider
+sponsoring it and contributing to its codebase**. Your support will
+help maintain and enhance this popular package, ensuring its growth,
+robustness, and accessibility for developers and users around the world.
 
 *Trafilatura* is an Italian word for [wire
 drawing](https://en.wikipedia.org/wiki/Wire_drawing) symbolizing the
@@ -171,11 +160,6 @@ Reach out via ia the software repository or the [contact
 page](https://adrien.barbaresi.eu/) for inquiries, collaborations, or
 feedback. See also social networks for the latest updates.
 
-This work started as a PhD project at the crossroads of linguistics and
-NLP, this expertise has been instrumental in shaping Trafilatura over
-the years. It has first been released under its current form in 2019,
-its development is referenced in the following publications:
-
 -   Barbaresi, A. [Trafilatura: A Web Scraping Library and Command-Line
     Tool for Text Discovery and
     Extraction](https://aclanthology.org/2021.acl-demo.15/), Proceedings
@@ -212,18 +196,13 @@ acquisition. Here is how to cite it:
 
 ### Software ecosystem
 
-Case studies and publications are listed on the [Used By documentation
-page](https://trafilatura.readthedocs.io/en/latest/used-by.html).
-
 Jointly developed plugins and additional packages also contribute to the
 field of web data extraction and analysis:
 
 <img alt="Software ecosystem" src="https://raw.githubusercontent.com/adbar/htmldate/master/docs/software-ecosystem.png" align="center" width="65%"/>
 
 Corresponding posts can be found on [Bits of
-Language](https://adrien.barbaresi.eu/blog/tag/trafilatura.html). The
-blog covers a range of topics from technical how-tos, updates on new
-features, to discussions on text mining challenges and solutions.
+Language](https://adrien.barbaresi.eu/blog/tag/trafilatura.html).
 
 Impressive, you have reached the end of the page: Thank you for your
 interest!
diff --git a/docs/index.rst b/docs/index.rst
@@ -40,9 +40,9 @@ Description
 
 Trafilatura is a **Python package and command-line tool** designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are **web crawling, downloads, scraping, and extraction** of main texts, metadata and comments. It aims at staying **handy and modular**: no database is required, the output can be converted to commonly used formats.
 
-Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the **noise caused by recurring elements** (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to **make sense of the data**. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). It also has to be **robust and reasonably fast**, it runs in production on millions of documents.
+Going from raw HTML to essential parts can alleviate many problems related to text quality, by avoiding the **noise caused by recurring elements** like headers and footers and by **making sense of the data and metadata** with selected information. The extractor strikes a balance between limiting noise (precision) and including all valid parts (recall). It is **robust and reasonably fast**.
 
-This tool can be **useful for quantitative research** in corpus linguistics, natural language processing, computational social science and beyond: it is relevant to anyone interested in data science, information extraction, text mining, and scraping-intensive use cases like search engine optimization, business analytics or information security.
+Trafilatura is `widely used <used-by.html>`_ and integrated into `thousands of projects <https://github.com/adbar/trafilatura/network/dependents>`_ by companies like HuggingFace, IBM, and Microsoft Research as well as institutions like the Allen Institute, Stanford, the Tokyo Institute of Technology, and the University of Munich.
 
 
 Features
@@ -120,25 +120,27 @@ Versions prior to v1.8.0 are under GPLv3+ license.
 
 
 Contributing
-------------
+~~~~~~~~~~~~
 
 Contributions of all kinds are welcome. Visit the `Contributing page <https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md>`_ for more information. Bug reports can be filed on the `dedicated issue page <https://github.com/adbar/trafilatura/issues>`_.
 
 Many thanks to the `contributors <https://github.com/adbar/trafilatura/graphs/contributors>`_ who extended the docs or submitted bug reports, features and bugfixes!
 
 
-Changes
--------
-
-For version history and changes see the `changelog <https://github.com/adbar/trafilatura/blob/master/HISTORY.md>`_.
-
-
 Context
 -------
 
-Originally released to collect data for linguistic research and lexicography at the `Berlin-Brandenburg Academy of Sciences <https://www.dwds.de/d/k-web>`_, Trafilatura is now `widely used <used-by.html>`_.
+This work started as a PhD project at the crossroads of linguistics and NLP,
+this expertise has been instrumental in shaping Trafilatura over the years. 
+Initially launched to create text databases for research purposes
+at the Berlin-Brandenburg Academy of Sciences (DWDS and ZDL units),
+this package continues to be maintained but its future development
+depends on community support.
 
-Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge. These documentation pages also provide information on `concepts behind data collection <background.html>`_ as well as `tutorials <tutorials.html>`_ on how to gather web texts.
+**If you value this software or depend on it for your product, consider
+sponsoring it and contributing to its codebase**. Your support will
+help maintain and enhance this popular package, ensuring its growth,
+robustness, and accessibility for developers and users around the world.
 
 *Trafilatura* is an Italian word for `wire drawing <https://en.wikipedia.org/wiki/Wire_drawing>`_ symbolizing the refinement and conversion process. It is also the way shapes of pasta are formed.
 
@@ -148,9 +150,6 @@ Author
 
 Reach out via the software repository or the `contact page <https://adrien.barbaresi.eu/>`_ for inquiries, collaborations, or feedback. See also social networks for the latest updates.
 
-This work started as a PhD project at the crossroads of linguistics and NLP, this expertise has been instrumental in shaping Trafilatura over the years. It has first been released under its current form in 2019, its development is referenced in the following publications:
-
-
 - Barbaresi, A. `Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction <https://aclanthology.org/2021.acl-demo.15/>`_, Proceedings of ACL/IJCNLP 2021: System Demonstrations, 2021, p. 122-131.
 -  Barbaresi, A. "`Generic Web Content Extraction with Open-Source Software <https://hal.archives-ouvertes.fr/hal-02447264/document>`_", Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.
 -  Barbaresi, A. "`Efficient construction of metadata-enhanced web corpora <https://hal.archives-ouvertes.fr/hal-01371704v2/document>`_", Proceedings of the `10th Web as Corpus Workshop (WAC-X) <https://www.sigwac.org.uk/wiki/WAC-X>`_, 2016.
@@ -186,16 +185,17 @@ Trafilatura is widely used in the academic domain, chiefly for data acquisition.
 Software ecosystem
 ~~~~~~~~~~~~~~~~~~
 
-Case studies and publications are listed on the `Used By documentation page <used-by.html>`_.
-
 Jointly developed plugins and additional packages also contribute to the field of web data extraction and analysis:
 
 .. image:: software-ecosystem.png
     :alt: Software ecosystem 
     :align: center
     :width: 65%
 
-Corresponding posts on `Bits of Language <https://adrien.barbaresi.eu/blog/tag/trafilatura.html>`_ (blog).
+Corresponding posts can be found on
+`Bits of Language <https://adrien.barbaresi.eu/blog/tag/trafilatura.html>`_.
+The blog covers a range of topics from technical how-tos, updates on new
+features, to discussions on text mining challenges and solutions.
 
 
 Building the docs
@@ -208,6 +208,13 @@ Starting from the ``docs/`` folder of the repository:
 
 
 
+Changes
+-------
+
+For version history and changes see the `changelog <https://github.com/adbar/trafilatura/blob/master/HISTORY.md>`_.
+
+
+
 Further documentation
 =====================
 
@@ -222,4 +229,4 @@ Further documentation
    used-by
    background
 
-* :ref:`genindex`
+:ref:`genindex`