Skip to content

Commit

Permalink
docs: readme and structure
Browse files Browse the repository at this point in the history
  • Loading branch information
adbar committed Dec 2, 2024
1 parent 458700c commit 857d6fd
Show file tree
Hide file tree
Showing 3 changed files with 16 additions and 20 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,8 @@ including all valid parts (recall). It is **robust and reasonably fast**.

Trafilatura is [widely used](https://trafilatura.readthedocs.io/en/latest/used-by.html)
and integrated into [thousands of projects](https://github.com/adbar/trafilatura/network/dependents>)
by companies like HuggingFace and IBM as well as research centers at
the Allen Institute, Stanford, the Tokyo Institute of Technology, or
by companies like HuggingFace, IBM, and Microsoft Research as well as institutions like
the Allen Institute, Stanford, the Tokyo Institute of Technology, and
the University of Munich.


Expand Down Expand Up @@ -123,7 +123,7 @@ This package is distributed under the [Apache 2.0 license](https://www.apache.or
Versions prior to v1.8.0 are under GPLv3+ license.


## Contributing
### Contributing

Contributions of all kinds are welcome. Visit the [Contributing
page](https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md)
Expand Down
19 changes: 10 additions & 9 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Trafilatura is a **Python package and command-line tool** designed to gather tex

Going from raw HTML to essential parts can alleviate many problems related to text quality, by avoiding the **noise caused by recurring elements** like headers and footers and by **making sense of the data and metadata** with selected information. The extractor strikes a balance between limiting noise (precision) and including all valid parts (recall). It is **robust and reasonably fast**.

Trafilatura is `widely used <used-by.html>`_ and integrated into `thousands of projects <https://github.com/adbar/trafilatura/network/dependents>`_ by companies like HuggingFace and IBM as well as research centers at the Allen Institute, Stanford, the Tokyo Institute of Technology, or the University of Munich.
Trafilatura is `widely used <used-by.html>`_ and integrated into `thousands of projects <https://github.com/adbar/trafilatura/network/dependents>`_ by companies like HuggingFace, IBM, and Microsoft Research as well as institutions like the Allen Institute, Stanford, the Tokyo Institute of Technology, and the University of Munich.


Features
Expand Down Expand Up @@ -120,19 +120,13 @@ Versions prior to v1.8.0 are under GPLv3+ license.


Contributing
------------
~~~~~~~~~~~~

Contributions of all kinds are welcome. Visit the `Contributing page <https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md>`_ for more information. Bug reports can be filed on the `dedicated issue page <https://github.com/adbar/trafilatura/issues>`_.

Many thanks to the `contributors <https://github.com/adbar/trafilatura/graphs/contributors>`_ who extended the docs or submitted bug reports, features and bugfixes!


Changes
-------

For version history and changes see the `changelog <https://github.com/adbar/trafilatura/blob/master/HISTORY.md>`_.


Context
-------

Expand Down Expand Up @@ -214,6 +208,13 @@ Starting from the ``docs/`` folder of the repository:



Changes
-------

For version history and changes see the `changelog <https://github.com/adbar/trafilatura/blob/master/HISTORY.md>`_.



Further documentation
=====================

Expand All @@ -228,4 +229,4 @@ Further documentation
used-by
background

* :ref:`genindex`
:ref:`genindex`
11 changes: 3 additions & 8 deletions docs/used-by.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Uses & citations
Trafilatura now widely used, integrated into other software packages and cited in research publications. Notable projects and institutional users are listed on this page.


Originally released to collect data for linguistic research and lexicography at the `Berlin-Brandenburg Academy of Sciences <https://www.dwds.de/d/k-web>`_, Trafilatura is used by numerous institutions, integrated into other software packages and cited in research publications across fields such as linguistics, natural language processing, computational social science, search engine optimization, information security, and artificial intelligence (large language models).
Initially released to collect data for linguistic research and lexicography at the Berlin-Brandenburg Academy of Sciences, Trafilatura is used by numerous institutions, integrated into other software packages and cited in research publications across fields such as linguistics, natural language processing, computational social science, search engine optimization, information security, and artificial intelligence (large language models).

The tool earns accolades as the most efficient open-source library in benchmarks and academic evaluations. It supports language modeling by providing high-quality text data, aids data mining with efficient web data retrieval, and streamlines information extraction from unstructured content. In SEO and business analytics it gathers online data for insights and in information security, it monitors websites for threat detection.

Expand Down Expand Up @@ -34,6 +34,8 @@ Companies and research centers
- Turku University, NLP department with `FinGPT <https://turkunlp.org/gpt3-finnish>`_ models
- University of Munich (LMU), Center for Language and Information Processing, `GlotWeb project <https://github.com/cisnlp/GlotWeb>`_

The Go port `go-trafilatura <https://github.com/markusmobius/go-trafilatura>`_ is used at Microsoft Research.


Various software repositories
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -200,10 +202,3 @@ Publications citing Htmldate

See `citation page of htmldate's documentation <https://htmldate.readthedocs.io/en/latest/used-by.html>`_.



Ports
-----

Go port
`go-trafilatura <https://github.com/markusmobius/go-trafilatura>`_

0 comments on commit 857d6fd

Please sign in to comment.