Skip to content

Releases: adbar/trafilatura

trafilatura-2.0.0

03 Dec 15:23
c6e8340
Compare
Choose a tag to compare

Breaking changes:

  • Python 3.6 and 3.7 deprecated (#709)
  • bare_extraction():
    • now returns an instance of the Document class by default
    • as_dict deprecation warning → use .as_dict() method on return value (#730)
  • bare_extraction() and extract(): no_fallback deprecation warning → use fast instead (#730)
  • downloads: remove decode argument in fetch_url() → use fetch_response instead (#724)
  • deprecated graphical user interface now removed (#713)
  • extraction: move max_tree_size parameter to settings.cfg (#742)
  • use type hinting (#721, #723, #748)
  • see Python and CLI deprecations in the docs

Fixes:

  • set options.source before raising error on empty doc tree by @dmoklaf (#707)
  • robust encoding in options.source (#717)
  • more robust mapping for conversion to HTML (#721)
  • CLI downloads: use all information in settings file (#734)
  • downloads: cleaner urllib3 code (#736)
  • refine table markdown output by @unsleepy22 (#752)
  • extraction fix: images in text nodes by @unsleepy22 (#757)

Metadata:

  • more robust URL extraction (#710)

Command-line interface:

  • CLI: print URLs early for feeds and sitemaps with --list with @gremid (#744)
  • CLI: add 126 exit code for high error ratio (#747)

Maintenance:

  • remove already deprecated functions and args (#716)
  • add type hints (#723, #728)
  • setup: use pyproject.toml file (#715)
  • simplify code (#708, #709, #727)
  • better debug messages in main_extractor (#714)
  • evaluation: review data, update packages, add magic_html (#731)
  • setup: explicit exports through __all__ (#740)
  • tests: extend coverage (#753)

Documentation:

  • fix link in docs/index.html by @nzw0301 (#711)
  • remove docs from published packages (#743)
  • update docs (#745)

trafilatura-1.12.2

10 Sep 12:43
f57ef0b
Compare
Choose a tag to compare
  • downloads: add support for SOCKS proxies with @gremid (#682)
  • extraction fix: ValueError in table spans (#685)
  • spider: prune_xpath parameter added by @felipehertzer (#684)
  • spider: relax strict parameter for link extraction (#687)
  • sitemaps: max_sitemaps parameter added by @felipehertzer (#690)
  • maintenance: make compression libraries optional (#691)
  • metadata: review and lint code (#694)

trafilatura-1.12.1

20 Aug 10:58
14c79c0
Compare
Choose a tag to compare

Navigation:

  • spider: restrict search to sections containing URL path (#673)
  • crawler: add parameter class and types, breaking change for undocumented functions (#675)
  • maintenance: simplify link discovery and extend tests (#674)
  • CLI: review code, add types and tests (#677)

Bugfixes:

  • fix AttributeError in element deletion (#668)
  • fix MemoryError in table header columns (#665)

Docs:

  • docs: fix variable name for extract_metadata in quickstart by @jpigla in #678

trafilatura-1.12.0

30 Jul 14:56
c60395c
Compare
Choose a tag to compare

Breaking change:

  • enforce fixed list of output formats, deprecate -out on the CLI (#647)

Faster, more accurate extraction:

  • review link and structure checks (#653)
  • improve justext fallback (#652)
  • baseline: prevent LXML error in JSON-LD (#643), do not use as backup extraction (#646)
  • review XPaths for undesirable content (#645)

Bugfixes and maintenance:

  • CLI fix: markdown format should trigger include_formatting (#649)
  • images fix: use a length threshold on src attribute (#654)
  • XML-TEI: replace RelaxNG by DTD, remove pickle, and update (#655)
  • formatting & markdown fix: add newlines (#656)
  • table fix: prevent MemoryError & ValueError during conversion to text (#658)

Documentation:

  • update crawls.rst: known is an unexpected argument, by @tommytyc in #638

trafilatura-1.11.0

27 Jun 14:04
60647e5
Compare
Choose a tag to compare

Breaking change:

  • metadata now skipped by default (#613), to trigger inclusion in all output formats:
    • with_metadata=True (Python)
    • --with-metadata (CLI)

Extraction:

  • add HTML as output format (#614)
  • better and faster baseline extraction (#619)
  • better handling of HTML/XML elements (#628)
  • XPath rules added with @felipehertzer (#540)
  • fix: avoid faulty readability_lxml content (#635)

Evaluation:

Maintenance:

  • docs extended and updated, added page on deduplication (#618)
  • review code, add tests and types in part of the submodules (#620, #623, #624, #625)

trafilatura-1.10.0

30 May 15:45
b36b6fa
Compare
Choose a tag to compare

Breaking changes:

  • raise errors on deprecated CLI and function arguments (#581)
  • regroup classes and functions linked to deduplication (#582)
    trafilatura.hashingtrafilatura.deduplication

Extraction:

  • port of is_probably_readerable from readability.js by @zirkelc in #587
  • Markdown table fixes by @naktinis in #601
  • fix list spacing in TXT output (#598)
  • CLI fixes: file processing options, mtime, and tests (#605)
  • CLI fix: read standard input as binary (#607)

Downloads:

  • fix deflate and add optional zstd to accepted encodings (#594)
  • spider fix: use internal download utilities for robots.txt (#590)

Maintenance:

  • add author XPaths (#567)
  • update justext and lxml dependencies (#593)
  • simplify code: unique function for length tests (#591)

Docs:

trafilatura-1.9.0

02 May 10:18
11255bd
Compare
Choose a tag to compare

Extraction:

  • add markdown as explicit output (#550)
  • improve recall preset (#571)
  • speedup for readability-lxml (#547)
  • add global options object for extraction and use it in CLI (#552)
  • fix: better encoding detection (#548)
  • recall: fix for lists inside tables with @mikhainin (#534)
  • add symbol to preserve vertical spacing in Markdown (#499)
  • fix: table cell separators in non-XML output (#563)
  • slightly better accuracy and execution speed overall

Metadata:

  • add file creation date (date extraction, JSON & XML-TEI) (#561)
  • fix: empty content in meta tag by @felipehertzer (#545)

Maintenance:

  • restructure and simplify code (#543, #556)
  • CLI & downloads: revamp and use global options (#565)
  • eval: review code, add guidelines and small benchmark (#542)
  • fix: raise error if config file does not exist (#554)
  • deprecate process_record() (#549)
  • docs: convert readme to markdown and update info (#564, #578)

trafilatura-1.8.1

03 Apr 11:47
d9d47a7
Compare
Choose a tag to compare

Maintenance:

  • Pin LXML to prevent broken dependency (#535)

Extraction:

  • Improve extraction accuracy for major news outlets (#530)
  • Fix formatting by correcting order of element generation and space handling with @dlwh (#528)
  • Fix: prevent tail insertion before children in nested elements by @knit-bee (#536)

trafilatura-1.8.0

20 Mar 15:24
ff38644
Compare
Choose a tag to compare

Extraction:

  • Better precision by @felipehertzer (#509, #520)
  • Code formatting in TXT/Markdown output added (#498)
  • Improved CSV output (#496)
  • LXML: compile XPath expressions (#504)
  • Overall speedup about +5%

Downloads and Navigation:

  • More robust scans with is_live_page() (#501)
  • Better sitemap start and safeguards (#503, #506)
  • Fix for headers in response object (#513)

Maintenance:

  • License changed to Apache 2.0
  • Response class: convenience functions added (#497)
  • lxml.html.Cleaner removed (#491)
  • CLI fixes: parallel cores and processing (#524)

trafilatura-1.7.0

25 Jan 13:05
97dc088
Compare
Choose a tag to compare

Extraction:

  • improved html2txt() function (#483)

Downloads:

  • add advanced fetch_response() function
    → pending deprecation for fetch_url(decode=False)

Maintenance: