prepare v1.6.3 (setup and docs) (#448)

adbar · Nov 29, 2023 · aabfdec · aabfdec
1 parent e7b5723
commit aabfdec
Show file tree

Hide file tree

Showing 6 changed files with 31 additions and 5 deletions.
diff --git a/HISTORY.md b/HISTORY.md
@@ -1,6 +1,26 @@
 ## History / Changelog
 
 
+### 1.6.3
+
+Extraction:
+- preserve space in certain elements with @idoshamun (#429)
+- optional list of xPaths to prune by @HeLehm (#414)
+
+Metadata:
+- more precise date extraction (see [htmldate](https://github.com/adbar/htmldate/releases/tag/v1.6.0))
+- new `htmldate` extensive search parameter in config (#434)
+- changes in URLs: normalization, trackers removed (see [courlan](https://github.com/adbar/courlan/releases/tag/v0.9.5))
+
+Navigation:
+- reviewed code for feeds (#443)
+- new config option: external URLs for feeds/sitemaps (#441)
+
+Documentation:
+- update, add page on text embeddings with @tonyyanga (#428, #435, #447)
+- fix quickstart by @sashkab (#419)
+
+
 ### 1.6.2
 
 Extraction:

diff --git a/docs/settings.rst b/docs/settings.rst
@@ -35,10 +35,14 @@ The default file included in the package is `settings.cfg <https://github.com/ad
    * ``MIN_EXTRACTED_SIZE = 250`` acceptable size in characters (used to trigger fallbacks)
    * ``MIN_OUTPUT_SIZE = 1`` absolute acceptable minimum for main text output
    * ``MIN_EXTRACTED_COMM_SIZE`` and ``MIN_OUTPUT_COMM_SIZE`` work the same for comment extraction
-   * ``EXTRACTION_TIMEOUT = 30`` drop extraction after 30 seconds to prevent malicious HTML bombs, set to 0 if you see errors related to the ``signal`` module and/or use a module such as `defusedxml <https://github.com/tiran/defusedxml>`_
+   * ``EXTRACTION_TIMEOUT = 30`` now only affects processing on the command-line: drop extraction after 30 seconds to prevent malicious HTML bombs. Set to 0 if you see errors related to the ``signal`` module and/or use a module such as `defusedxml <https://github.com/tiran/defusedxml>`_
 - Deduplication (not active by default)
    * ``MIN_DUPLCHECK_SIZE = 100`` minimum size in characters to run deduplication on
    * ``MAX_REPETITIONS = 2`` maximum number of duplicates allowed
+- Metadata
+   * ``EXTENSIVE_DATE_SEARCH = on`` set to ``off`` to deactivate ``htmldate``'s opportunistic search (lower recall, higher precision)
+- Navigation
+   * ``EXTERNAL_URLS = off`` do not take URLs from other websites in feeds and sitemaps (CLI mode)
 
 
 Using a custom file on the command-line

diff --git a/docs/usage-python.rst b/docs/usage-python.rst
@@ -248,7 +248,7 @@ Among metadata extraction, dates are handled by an external module: `htmldate <h
 
 `Custom parameters <https://htmldate.readthedocs.io/en/latest/corefunctions.html#handling-date-extraction>`_ can be passed through the extraction function or through the ``extract_metadata`` function in ``trafilatura.metadata``, most notably:
 
--  ``extensive_search`` (boolean), to activate pattern-based opportunistic text search,
+-  ``extensive_search`` (boolean), to activate further heuristics (higher recall, lower precision)
 -  ``original_date`` (boolean) to look for the original publication date,
 -  ``outputformat`` (string), to provide a custom datetime format,
 -  ``max_date`` (string), to set the latest acceptable date manually (YYYY-MM-DD format).

diff --git a/docs/used-by.rst b/docs/used-by.rst
@@ -112,6 +112,7 @@ Publications citing Trafilatura
 - Alhamzeh, A., Bouhaouel, M., Egyed-Zsigmond, E., & Mitrović, J. (2021). "DistilBERT-based Argumentation Retrieval for Answering Comparative Questions", Proceedings of CLEF 2021 – Conference and Labs of the Evaluation Forum.
 - Bender, M., Bubenhofer, N., Dreesen, P., Georgi, C., Rüdiger, J. O., & Vogel, F. (2022). Techniken und Praktiken der Verdatung. Diskurse–digital, 135-158.
 - Bevendorff, J., Gupta, S., Kiesel, J., & Stein, B. (2023). An Empirical Comparison of Web Content Extraction Algorithms.
+- Book, L. (2023). Evaluating and comparing different key phrase-based web scraping methods for training domain-specific fasttext models, Master's thesis, KTH Royal Institute of Technology.
 - Bozarth, L., & Budak, C. (2021). "An Analysis of the Partnership between Retailers and Low-credibility News Publishers", Journal of Quantitative Description: Digital Media, 1.
 - Brandon, C., Doherty, A. J., Kelly, D., Leddin, D., & Margaria, T. (2023). HIPPP: Health Information Portal for Patients and Public. Applied Sciences, 13(16), 9453.
 - Braun, D. (2021). "Automated Semantic Analysis, Legal Assessment, and Summarization of Standard Form Contracts", PhD Thesis, Technische Universität München.
@@ -127,7 +128,8 @@ Publications citing Trafilatura
 - Hunter, B., Mathews, F., & Weeds, J. (2023). Using hierarchical text classification to investigate the utility of machine learning in automating online analyses of wildlife exploitation. Ecological Informatics, 102076.
 - Indig, B., Sárközi-Lindner, Z., & Nagy, M. (2022). Use the Metadata, Luke!–An Experimental Joint Metadata Search and N-gram Trend Viewer for Personal Web Archives. In Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities (pp. 47-52).
 - Johannsen, B. (2023). Fußball und safety: Eine framesemantische Perspektive auf Diskurse über trans Sportler* innen. Queere Vielfalt im Fußball, 176.
-- Jung, G., Han, S., Kim, H., Kim, K., & Cha, J. (2022). Extracting the Main Content of Web Pages Using the First Impression Area. IEEE Access.
+- Jung, G., Han, S., Kim, H., Kim, K., & Cha, J. (2022). Extracting the Main Content of Web Pages Using the First Impression Area. IEEE Access, 10, 129958-129969
+- Jung, G., Cha, J. (2023). New Visual Features for HTML Main Content Extraction. Journal of Digital Contents Society.
 - Karabulut, M., & Mayda, İ. (2020). "Development of Browser Extension for HTML Web Page Content Extraction", In 2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA) (pp. 1-6). IEEE.
 - Khusainov, A., Suleymanov, D., Gilmullin, R., Minsafina, A., Kubedinova, L., & Abdurakhmonova, N. "First Results of the “TurkLang-7” Project: Creating Russian-Turkic Parallel Corpora and MT Systems", In CMCL (pp. 90-101).
 - Küehn, P., Relke, D. N., & Reuter, C. (2023). Common Vulnerability Scoring System Prediction based on Open Source Intelligence Information Sources. Computers & Security, 103286.

diff --git a/trafilatura/__init__.py b/trafilatura/__init__.py
@@ -9,7 +9,7 @@
 __author__ = 'Adrien Barbaresi and contributors'
 __license__ = 'GNU GPL v3+'
 __copyright__ = 'Copyright 2019-2023, Adrien Barbaresi'
-__version__ = '1.6.2'
+__version__ = '1.6.3'
 
 
 import logging

diff --git a/trafilatura/settings.cfg b/trafilatura/settings.cfg
@@ -1,4 +1,4 @@
-# Defines settings for trafilatura (https://github.com/adbar/trafilatura)
+# Defines settings for trafilatura (https://trafilatura.readthedocs.io/en/latest/settings.html)
 
 [DEFAULT]