Skip to content

Commit

Permalink
prepare v1.6.3 (setup and docs) (#448)
Browse files Browse the repository at this point in the history
  • Loading branch information
adbar authored Nov 29, 2023
1 parent e7b5723 commit aabfdec
Show file tree
Hide file tree
Showing 6 changed files with 31 additions and 5 deletions.
20 changes: 20 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,26 @@
## History / Changelog


### 1.6.3

Extraction:
- preserve space in certain elements with @idoshamun (#429)
- optional list of xPaths to prune by @HeLehm (#414)

Metadata:
- more precise date extraction (see [htmldate](https://github.com/adbar/htmldate/releases/tag/v1.6.0))
- new `htmldate` extensive search parameter in config (#434)
- changes in URLs: normalization, trackers removed (see [courlan](https://github.com/adbar/courlan/releases/tag/v0.9.5))

Navigation:
- reviewed code for feeds (#443)
- new config option: external URLs for feeds/sitemaps (#441)

Documentation:
- update, add page on text embeddings with @tonyyanga (#428, #435, #447)
- fix quickstart by @sashkab (#419)


### 1.6.2

Extraction:
Expand Down
6 changes: 5 additions & 1 deletion docs/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,14 @@ The default file included in the package is `settings.cfg <https://github.com/ad
* ``MIN_EXTRACTED_SIZE = 250`` acceptable size in characters (used to trigger fallbacks)
* ``MIN_OUTPUT_SIZE = 1`` absolute acceptable minimum for main text output
* ``MIN_EXTRACTED_COMM_SIZE`` and ``MIN_OUTPUT_COMM_SIZE`` work the same for comment extraction
* ``EXTRACTION_TIMEOUT = 30`` drop extraction after 30 seconds to prevent malicious HTML bombs, set to 0 if you see errors related to the ``signal`` module and/or use a module such as `defusedxml <https://github.com/tiran/defusedxml>`_
* ``EXTRACTION_TIMEOUT = 30`` now only affects processing on the command-line: drop extraction after 30 seconds to prevent malicious HTML bombs. Set to 0 if you see errors related to the ``signal`` module and/or use a module such as `defusedxml <https://github.com/tiran/defusedxml>`_
- Deduplication (not active by default)
* ``MIN_DUPLCHECK_SIZE = 100`` minimum size in characters to run deduplication on
* ``MAX_REPETITIONS = 2`` maximum number of duplicates allowed
- Metadata
* ``EXTENSIVE_DATE_SEARCH = on`` set to ``off`` to deactivate ``htmldate``'s opportunistic search (lower recall, higher precision)
- Navigation
* ``EXTERNAL_URLS = off`` do not take URLs from other websites in feeds and sitemaps (CLI mode)


Using a custom file on the command-line
Expand Down
2 changes: 1 addition & 1 deletion docs/usage-python.rst
Original file line number Diff line number Diff line change
Expand Up @@ -248,7 +248,7 @@ Among metadata extraction, dates are handled by an external module: `htmldate <h

`Custom parameters <https://htmldate.readthedocs.io/en/latest/corefunctions.html#handling-date-extraction>`_ can be passed through the extraction function or through the ``extract_metadata`` function in ``trafilatura.metadata``, most notably:

- ``extensive_search`` (boolean), to activate pattern-based opportunistic text search,
- ``extensive_search`` (boolean), to activate further heuristics (higher recall, lower precision)
- ``original_date`` (boolean) to look for the original publication date,
- ``outputformat`` (string), to provide a custom datetime format,
- ``max_date`` (string), to set the latest acceptable date manually (YYYY-MM-DD format).
Expand Down
4 changes: 3 additions & 1 deletion docs/used-by.rst
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,7 @@ Publications citing Trafilatura
- Alhamzeh, A., Bouhaouel, M., Egyed-Zsigmond, E., & Mitrović, J. (2021). "DistilBERT-based Argumentation Retrieval for Answering Comparative Questions", Proceedings of CLEF 2021 – Conference and Labs of the Evaluation Forum.
- Bender, M., Bubenhofer, N., Dreesen, P., Georgi, C., Rüdiger, J. O., & Vogel, F. (2022). Techniken und Praktiken der Verdatung. Diskurse–digital, 135-158.
- Bevendorff, J., Gupta, S., Kiesel, J., & Stein, B. (2023). An Empirical Comparison of Web Content Extraction Algorithms.
- Book, L. (2023). Evaluating and comparing different key phrase-based web scraping methods for training domain-specific fasttext models, Master's thesis, KTH Royal Institute of Technology.
- Bozarth, L., & Budak, C. (2021). "An Analysis of the Partnership between Retailers and Low-credibility News Publishers", Journal of Quantitative Description: Digital Media, 1.
- Brandon, C., Doherty, A. J., Kelly, D., Leddin, D., & Margaria, T. (2023). HIPPP: Health Information Portal for Patients and Public. Applied Sciences, 13(16), 9453.
- Braun, D. (2021). "Automated Semantic Analysis, Legal Assessment, and Summarization of Standard Form Contracts", PhD Thesis, Technische Universität München.
Expand All @@ -127,7 +128,8 @@ Publications citing Trafilatura
- Hunter, B., Mathews, F., & Weeds, J. (2023). Using hierarchical text classification to investigate the utility of machine learning in automating online analyses of wildlife exploitation. Ecological Informatics, 102076.
- Indig, B., Sárközi-Lindner, Z., & Nagy, M. (2022). Use the Metadata, Luke!–An Experimental Joint Metadata Search and N-gram Trend Viewer for Personal Web Archives. In Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities (pp. 47-52).
- Johannsen, B. (2023). Fußball und safety: Eine framesemantische Perspektive auf Diskurse über trans Sportler* innen. Queere Vielfalt im Fußball, 176.
- Jung, G., Han, S., Kim, H., Kim, K., & Cha, J. (2022). Extracting the Main Content of Web Pages Using the First Impression Area. IEEE Access.
- Jung, G., Han, S., Kim, H., Kim, K., & Cha, J. (2022). Extracting the Main Content of Web Pages Using the First Impression Area. IEEE Access, 10, 129958-129969
- Jung, G., Cha, J. (2023). New Visual Features for HTML Main Content Extraction. Journal of Digital Contents Society.
- Karabulut, M., & Mayda, İ. (2020). "Development of Browser Extension for HTML Web Page Content Extraction", In 2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA) (pp. 1-6). IEEE.
- Khusainov, A., Suleymanov, D., Gilmullin, R., Minsafina, A., Kubedinova, L., & Abdurakhmonova, N. "First Results of the “TurkLang-7” Project: Creating Russian-Turkic Parallel Corpora and MT Systems", In CMCL (pp. 90-101).
- Küehn, P., Relke, D. N., & Reuter, C. (2023). Common Vulnerability Scoring System Prediction based on Open Source Intelligence Information Sources. Computers & Security, 103286.
Expand Down
2 changes: 1 addition & 1 deletion trafilatura/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
__author__ = 'Adrien Barbaresi and contributors'
__license__ = 'GNU GPL v3+'
__copyright__ = 'Copyright 2019-2023, Adrien Barbaresi'
__version__ = '1.6.2'
__version__ = '1.6.3'


import logging
Expand Down
2 changes: 1 addition & 1 deletion trafilatura/settings.cfg
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Defines settings for trafilatura (https://github.com/adbar/trafilatura)
# Defines settings for trafilatura (https://trafilatura.readthedocs.io/en/latest/settings.html)

[DEFAULT]

Expand Down

0 comments on commit aabfdec

Please sign in to comment.