Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update docs: theme, text embeddings, used-by #447

Merged
merged 9 commits into from
Nov 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# with version specifier
sphinx>=7.2.6
pydata-sphinx-theme>=0.14.3
pydata-sphinx-theme>=0.14.4
docutils>=0.20.1
# without version specifier
trafilatura
Expand Down
14 changes: 11 additions & 3 deletions docs/tutorial-epsilla.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Text embedding
===============
Tutorial: Text embedding
========================
adbar marked this conversation as resolved.
Show resolved Hide resolved

.. meta::
:description lang=en:
Expand Down Expand Up @@ -28,7 +28,11 @@ In this tutorial, we will show you how to perform text embedding on results from
Setup Epsilla
------------------------------------------------

In this tutorial, we will run an Epsilla databse server. You can start one locally with a `Docker <https://docs.docker.com/get-started/>`_ image.
In this tutorial, we will need an Epsilla database server. There are two ways to get one: use the free cloud version or start one locally.

Epsilla has a `cloud version <https://cloud.epsilla.com//?ref=trafilatura>`_ with a free tier. You can sign up and get a server running in a few steps.

Alternatively, you can start one locally with a `Docker <https://docs.docker.com/get-started/>`_ image.

.. code-block:: bash

Expand All @@ -37,6 +41,8 @@ In this tutorial, we will run an Epsilla databse server. You can start one local

See `Epsilla documentation <https://epsilla-inc.gitbook.io/epsilladb/quick-start>`_ for a full quick start guide.

The rest of this guide assumes you are running a local Epsilla server on port 8888. If you are using the cloud version, replace the host and port with the cloud server address.

We need to install the database client. You can do this with pip:

.. code-block:: bash
Expand Down Expand Up @@ -145,5 +151,7 @@ We can now perform a vector search to find the most relevant project based on a

You will see the returned response is React! That is the correct answer. React is a modern frontend library, but PyTorch and Tensorflow are not.

.. image:: https://static.scarf.sh/a.png?x-pxid=51f549d1-aabf-473c-b971-f8d9c3ac8ac5
:alt:
adbar marked this conversation as resolved.
Show resolved Hide resolved


13 changes: 9 additions & 4 deletions docs/used-by.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,12 @@ Notable projects using this software
Known institutional users
^^^^^^^^^^^^^^^^^^^^^^^^^

- `Data against Feminicide <https://datoscontrafeminicidio.net/>`_
- `Kagi search engine <https://kagi.com/>`_ (notably Teclis component)
- `Media Cloud platform <https://mediacloud.org>`_ for media analysis
- The Internet Archive's `sandcrawler <https://github.com/internetarchive/sandcrawler>`_ which crawls and processes the scholarly web for the `Fatcat catalog <https://fatcat.wiki/>`_ of research publications
- Falcon LLM (TII UAE) and its underlying `RefinedWeb Dataset <https://arxiv.org/abs/2306.01116>`_
- `FinGPT <https://arxiv.org/abs/2311.05640>`_ (Finland)
- `Media Cloud platform <https://mediacloud.org>`_ for media analysis, e.g. `Data against Feminicide <https://datoscontrafeminicidio.net/>`_
- `SciencesPo médialab <https://medialab.sciencespo.fr>`_ through its `Minet <https://github.com/medialab/minet>`_ webmining package
- `Teclis component <https://teclis.com/>`_ of the Kagi search engine
- The Internet Archive's `sandcrawler <https://github.com/internetarchive/sandcrawler>`_ which crawls and processes the scholarly web for the `Fatcat catalog <https://fatcat.wiki/>`_ of research publications


Various software repositories
Expand All @@ -32,6 +33,7 @@ Various software repositories
- `CommonCrawl downloader <https://github.com/leogao2/commoncrawl_downloader>`_, to derive massive amounts of language data
- `GLAM Workbench <https://glam-workbench.github.io/web-archives/>`_ for cultural heritage (web archives section)
- `llama-hub <https://github.com/emptycrown/llama-hub>`_, a library of data loaders for large language models
- `LlamaIndex <https://github.com/run-llama/llama_index>`_, a data framework for LLM applications
- `Obsei <https://obsei.com/>`_, a text collection and analysis tool
- `Vulristics <https://github.com/leonov-av/vulristics>`_, a framework for analyzing publicly available information about vulnerabilities
- `Website-to-Chatbot <https://github.com/Anil-matcha/Website-to-Chatbot>`_, a personalized chatbot
Expand Down Expand Up @@ -114,6 +116,7 @@ Publications citing Trafilatura
- Brandon, C., Doherty, A. J., Kelly, D., Leddin, D., & Margaria, T. (2023). HIPPP: Health Information Portal for Patients and Public. Applied Sciences, 13(16), 9453.
- Braun, D. (2021). "Automated Semantic Analysis, Legal Assessment, and Summarization of Standard Form Contracts", PhD Thesis, Technische Universität München.
- Chen, X., Zeynali, A., Camargo, C., Flöck, F., Gaffney, D., Grabowicz, P., ... & Samory, M. (2022). SemEval-2022 Task 8: Multilingual news article similarity. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) (pp. 1094-1106).
- De Cesare, A. M. (2023). Assessing the quality of ChatGPT’s generated output in light of human-written texts: A corpus study based on textual parameters. CHIMERA: Revista de Corpus de Lenguas Romances y Estudios Lingüísticos, 10, 179-210.
- Di Giovanni, M., Tasca, T., & Brambilla, M. (2022). DataScience-Polimi at SemEval-2022 Task 8: Stacking Language Models to Predict News Article Similarity. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) (pp. 1229-1234).
- Dumitru, V., Iorga, D., Ruseti, S., & Dascalu, M. (2023). Garbage in, garbage out: An analysis of HTML text extractors and their impact on NLP performance. In 2023 24th International Conference on Control Systems and Computer Science (CSCS) (pp. 403-410). IEEE.
- Fröbe, M., Hagen, M., Bevendorff, J., Völske, M., Stein, B., Schröder, C., ... & Potthast, M. (2021). "The Impact of Main Content Extraction on Near-Duplicate Detection". arXiv preprint arXiv:2111.10864.
Expand All @@ -131,6 +134,7 @@ Publications citing Trafilatura
- Kuehn, P., Schmidt, M., & Reuter, C. (2023). ThreatCrawl: A BERT-based Focused Crawler for the Cybersecurity Domain. arXiv preprint arXiv:2304.11960.
- Laippala, V., Rönnqvist, S., Hellström, S., Luotolahti, J., Repo, L., Salmela, A., ... & Pyysalo, S. (2020). "From Web Crawl to Clean Register-Annotated Corpora", Proceedings of the 12th Web as Corpus Workshop (pp. 14-22).
- Laippala, V., Salmela, A., Rönnqvist, S., Aji, A. F., Chang, L. H., Dhifallah, A., ... & Pyysalo, S. (2022). Towards better structured and less noisy Web data: Oscar with Register annotations. In Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022) (pp. 215-221).
- Luukkonen, R., Komulainen, V., Luoma, J., Eskelinen, A., Kanerva, J., Kupari, H. M., ... & Pyysalo, S. (2023). FinGPT: Large Generative Models for a Small Language. arXiv preprint arXiv:2311.05640.
- Madrid-Morales, D. (2021). "Who Set the Narrative? Assessing the Influence of Chinese Media in News Coverage of COVID-19 in 30 African Countries", Global Media and China, 6(2), 129-151.
- Meier-Vieracker, S. (2022). "Fußballwortschatz digital–Korpuslinguistische Ressourcen für den Sprachunterricht." Korpora Deutsch als Fremdsprache (KorDaF), 2022/01 (pre-print).
- Meng, K. (2021). "An End-to-End Computational System for Monitoring and Verifying Factual Claims" (pre-print).
Expand All @@ -140,6 +144,7 @@ Publications citing Trafilatura
- Öhman, J., Verlinden, S., Ekgren, A., Gyllensten, A. C., Isbister, T., Gogoulou, E., ... & Sahlgren, M. (2023). The Nordic Pile: A 1.2 TB Nordic Dataset for Language Modeling. arXiv preprint arXiv:2303.17183.
- Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Pannier, B., ... & Launay, J. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only.
- Piskorski, J., Stefanovitch, N., Da San Martino, G., & Nakov, P. (2023). Semeval-2023 task 3: Detecting the category, the framing, and the persuasion techniques in online news in a multi-lingual setup. In Proceedings of the the 17th International Workshop on Semantic Evaluation (SemEval-2023) (pp. 2343-2361).
- Pohlmann, J., Barbaresi, A., & Leinen, P. (2023). Platform regulation and “overblocking”–The NetzDG discourse in Germany. Communications, 48(3), 395-419.
- Robertson, F., Lagus, J., & Kajava, K. (2021). "A COVID-19 news coverage mood map of Europe", Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation (pp. 110-115).
- Salmela, A. (2022). "Distinguishing Noise and Main Text Content from Web-Sourced Plain Text Documents Using Sequential Neural Networks", Master's thesis, University of Turku.
- Sawczyn, A., Binkowski, J., Janiak, D., Augustyniak, Ł., & Kajdanowicz, T. (2021). "Fact-checking: relevance assessment of references in the Polish political domain", Procedia Computer Science, 192, 1285-1293.
Expand Down
1 change: 1 addition & 0 deletions tests/metadata_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,7 @@ def test_dates():
mystring = '<html><body><p>Veröffentlicht am 1.9.17</p></body></html>'
metadata = extract_metadata(mystring, fastmode=False)
assert metadata.date == '2017-09-01'
# behavior for fastmode=True changed in htmldate==1.6.0. On 1.5.2 and earlier, result was None
metadata = extract_metadata(mystring, fastmode=True)
assert metadata.date == '2017-09-01'

Expand Down
Loading