Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prepare v1.6.3 (setup and docs) #448

Merged
merged 1 commit into from
Nov 29, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,26 @@
## History / Changelog


### 1.6.3

Extraction:
- preserve space in certain elements with @idoshamun (#429)
- optional list of xPaths to prune by @HeLehm (#414)

Metadata:
- more precise date extraction (see [htmldate](https://github.com/adbar/htmldate/releases/tag/v1.6.0))
- new `htmldate` extensive search parameter in config (#434)
- changes in URLs: normalization, trackers removed (see [courlan](https://github.com/adbar/courlan/releases/tag/v0.9.5))

Navigation:
- reviewed code for feeds (#443)
- new config option: external URLs for feeds/sitemaps (#441)

Documentation:
- update, add page on text embeddings with @tonyyanga (#428, #435, #447)
- fix quickstart by @sashkab (#419)


### 1.6.2

Extraction:
Expand Down
6 changes: 5 additions & 1 deletion docs/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,14 @@ The default file included in the package is `settings.cfg <https://github.com/ad
* ``MIN_EXTRACTED_SIZE = 250`` acceptable size in characters (used to trigger fallbacks)
* ``MIN_OUTPUT_SIZE = 1`` absolute acceptable minimum for main text output
* ``MIN_EXTRACTED_COMM_SIZE`` and ``MIN_OUTPUT_COMM_SIZE`` work the same for comment extraction
* ``EXTRACTION_TIMEOUT = 30`` drop extraction after 30 seconds to prevent malicious HTML bombs, set to 0 if you see errors related to the ``signal`` module and/or use a module such as `defusedxml <https://github.com/tiran/defusedxml>`_
* ``EXTRACTION_TIMEOUT = 30`` now only affects processing on the command-line: drop extraction after 30 seconds to prevent malicious HTML bombs. Set to 0 if you see errors related to the ``signal`` module and/or use a module such as `defusedxml <https://github.com/tiran/defusedxml>`_
- Deduplication (not active by default)
* ``MIN_DUPLCHECK_SIZE = 100`` minimum size in characters to run deduplication on
* ``MAX_REPETITIONS = 2`` maximum number of duplicates allowed
- Metadata
* ``EXTENSIVE_DATE_SEARCH = on`` set to ``off`` to deactivate ``htmldate``'s opportunistic search (lower recall, higher precision)
- Navigation
* ``EXTERNAL_URLS = off`` do not take URLs from other websites in feeds and sitemaps (CLI mode)


Using a custom file on the command-line
Expand Down
2 changes: 1 addition & 1 deletion docs/usage-python.rst
Original file line number Diff line number Diff line change
Expand Up @@ -248,7 +248,7 @@ Among metadata extraction, dates are handled by an external module: `htmldate <h

`Custom parameters <https://htmldate.readthedocs.io/en/latest/corefunctions.html#handling-date-extraction>`_ can be passed through the extraction function or through the ``extract_metadata`` function in ``trafilatura.metadata``, most notably:

- ``extensive_search`` (boolean), to activate pattern-based opportunistic text search,
- ``extensive_search`` (boolean), to activate further heuristics (higher recall, lower precision)
- ``original_date`` (boolean) to look for the original publication date,
- ``outputformat`` (string), to provide a custom datetime format,
- ``max_date`` (string), to set the latest acceptable date manually (YYYY-MM-DD format).
Expand Down
4 changes: 3 additions & 1 deletion docs/used-by.rst
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,7 @@ Publications citing Trafilatura
- Alhamzeh, A., Bouhaouel, M., Egyed-Zsigmond, E., & Mitrović, J. (2021). "DistilBERT-based Argumentation Retrieval for Answering Comparative Questions", Proceedings of CLEF 2021 – Conference and Labs of the Evaluation Forum.
- Bender, M., Bubenhofer, N., Dreesen, P., Georgi, C., Rüdiger, J. O., & Vogel, F. (2022). Techniken und Praktiken der Verdatung. Diskurse–digital, 135-158.
- Bevendorff, J., Gupta, S., Kiesel, J., & Stein, B. (2023). An Empirical Comparison of Web Content Extraction Algorithms.
- Book, L. (2023). Evaluating and comparing different key phrase-based web scraping methods for training domain-specific fasttext models, Master's thesis, KTH Royal Institute of Technology.
- Bozarth, L., & Budak, C. (2021). "An Analysis of the Partnership between Retailers and Low-credibility News Publishers", Journal of Quantitative Description: Digital Media, 1.
- Brandon, C., Doherty, A. J., Kelly, D., Leddin, D., & Margaria, T. (2023). HIPPP: Health Information Portal for Patients and Public. Applied Sciences, 13(16), 9453.
- Braun, D. (2021). "Automated Semantic Analysis, Legal Assessment, and Summarization of Standard Form Contracts", PhD Thesis, Technische Universität München.
Expand All @@ -127,7 +128,8 @@ Publications citing Trafilatura
- Hunter, B., Mathews, F., & Weeds, J. (2023). Using hierarchical text classification to investigate the utility of machine learning in automating online analyses of wildlife exploitation. Ecological Informatics, 102076.
- Indig, B., Sárközi-Lindner, Z., & Nagy, M. (2022). Use the Metadata, Luke!–An Experimental Joint Metadata Search and N-gram Trend Viewer for Personal Web Archives. In Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities (pp. 47-52).
- Johannsen, B. (2023). Fußball und safety: Eine framesemantische Perspektive auf Diskurse über trans Sportler* innen. Queere Vielfalt im Fußball, 176.
- Jung, G., Han, S., Kim, H., Kim, K., & Cha, J. (2022). Extracting the Main Content of Web Pages Using the First Impression Area. IEEE Access.
- Jung, G., Han, S., Kim, H., Kim, K., & Cha, J. (2022). Extracting the Main Content of Web Pages Using the First Impression Area. IEEE Access, 10, 129958-129969
- Jung, G., Cha, J. (2023). New Visual Features for HTML Main Content Extraction. Journal of Digital Contents Society.
- Karabulut, M., & Mayda, İ. (2020). "Development of Browser Extension for HTML Web Page Content Extraction", In 2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA) (pp. 1-6). IEEE.
- Khusainov, A., Suleymanov, D., Gilmullin, R., Minsafina, A., Kubedinova, L., & Abdurakhmonova, N. "First Results of the “TurkLang-7” Project: Creating Russian-Turkic Parallel Corpora and MT Systems", In CMCL (pp. 90-101).
- Küehn, P., Relke, D. N., & Reuter, C. (2023). Common Vulnerability Scoring System Prediction based on Open Source Intelligence Information Sources. Computers & Security, 103286.
Expand Down
2 changes: 1 addition & 1 deletion trafilatura/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
__author__ = 'Adrien Barbaresi and contributors'
__license__ = 'GNU GPL v3+'
__copyright__ = 'Copyright 2019-2023, Adrien Barbaresi'
__version__ = '1.6.2'
__version__ = '1.6.3'


import logging
Expand Down
2 changes: 1 addition & 1 deletion trafilatura/settings.cfg
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Defines settings for trafilatura (https://github.com/adbar/trafilatura)
# Defines settings for trafilatura (https://trafilatura.readthedocs.io/en/latest/settings.html)

[DEFAULT]

Expand Down
Loading