Releases: adbar/trafilatura
Releases · adbar/trafilatura
trafilatura-2.0.0
Breaking changes:
- Python 3.6 and 3.7 deprecated (#709)
bare_extraction()
:- now returns an instance of the
Document
class by default as_dict
deprecation warning → use.as_dict()
method on return value (#730)
- now returns an instance of the
bare_extraction()
andextract()
:no_fallback
deprecation warning → usefast
instead (#730)- downloads: remove
decode
argument infetch_url()
→ usefetch_response
instead (#724) - deprecated graphical user interface now removed (#713)
- extraction: move
max_tree_size
parameter tosettings.cfg
(#742) - use type hinting (#721, #723, #748)
- see Python and CLI deprecations in the docs
Fixes:
- set
options.source
before raising error on empty doc tree by @dmoklaf (#707) - robust encoding in
options.source
(#717) - more robust mapping for conversion to HTML (#721)
- CLI downloads: use all information in settings file (#734)
- downloads: cleaner urllib3 code (#736)
- refine table markdown output by @unsleepy22 (#752)
- extraction fix: images in text nodes by @unsleepy22 (#757)
Metadata:
- more robust URL extraction (#710)
Command-line interface:
- CLI: print URLs early for feeds and sitemaps with
--list
with @gremid (#744) - CLI: add 126 exit code for high error ratio (#747)
Maintenance:
- remove already deprecated functions and args (#716)
- add type hints (#723, #728)
- setup: use
pyproject.toml
file (#715) - simplify code (#708, #709, #727)
- better debug messages in
main_extractor
(#714) - evaluation: review data, update packages, add magic_html (#731)
- setup: explicit exports through
__all__
(#740) - tests: extend coverage (#753)
Documentation:
trafilatura-1.12.2
- downloads: add support for SOCKS proxies with @gremid (#682)
- extraction fix: ValueError in table spans (#685)
- spider:
prune_xpath
parameter added by @felipehertzer (#684) - spider: relax strict parameter for link extraction (#687)
- sitemaps:
max_sitemaps
parameter added by @felipehertzer (#690) - maintenance: make compression libraries optional (#691)
- metadata: review and lint code (#694)
trafilatura-1.12.1
trafilatura-1.12.0
Breaking change:
- enforce fixed list of output formats, deprecate
-out
on the CLI (#647)
Faster, more accurate extraction:
- review link and structure checks (#653)
- improve justext fallback (#652)
- baseline: prevent LXML error in JSON-LD (#643), do not use as backup extraction (#646)
- review XPaths for undesirable content (#645)
Bugfixes and maintenance:
- CLI fix: markdown format should trigger
include_formatting
(#649) - images fix: use a length threshold on src attribute (#654)
- XML-TEI: replace RelaxNG by DTD, remove pickle, and update (#655)
- formatting & markdown fix: add newlines (#656)
- table fix: prevent
MemoryError
&ValueError
during conversion to text (#658)
Documentation:
trafilatura-1.11.0
Breaking change:
- metadata now skipped by default (#613), to trigger inclusion in all output formats:
with_metadata=True
(Python)--with-metadata
(CLI)
Extraction:
- add HTML as output format (#614)
- better and faster baseline extraction (#619)
- better handling of HTML/XML elements (#628)
- XPath rules added with @felipehertzer (#540)
- fix: avoid faulty readability_lxml content (#635)
Evaluation:
- new scripts and data with @LydiaKoerber (#606, #615)
- additional data with @swetepete (#197)
Maintenance:
trafilatura-1.10.0
Breaking changes:
- raise errors on deprecated CLI and function arguments (#581)
- regroup classes and functions linked to deduplication (#582)
trafilatura.hashing
→trafilatura.deduplication
Extraction:
- port of is_probably_readerable from readability.js by @zirkelc in #587
- Markdown table fixes by @naktinis in #601
- fix list spacing in TXT output (#598)
- CLI fixes: file processing options, mtime, and tests (#605)
- CLI fix: read standard input as binary (#607)
Downloads:
- fix deflate and add optional zstd to accepted encodings (#594)
- spider fix: use internal download utilities for robots.txt (#590)
Maintenance:
- add author XPaths (#567)
- update justext and lxml dependencies (#593)
- simplify code: unique function for length tests (#591)
Docs:
trafilatura-1.9.0
Extraction:
- add markdown as explicit output (#550)
- improve recall preset (#571)
- speedup for readability-lxml (#547)
- add global options object for extraction and use it in CLI (#552)
- fix: better encoding detection (#548)
- recall: fix for lists inside tables with @mikhainin (#534)
- add symbol to preserve vertical spacing in Markdown (#499)
- fix: table cell separators in non-XML output (#563)
- slightly better accuracy and execution speed overall
Metadata:
- add file creation date (date extraction, JSON & XML-TEI) (#561)
- fix: empty content in meta tag by @felipehertzer (#545)
Maintenance:
- restructure and simplify code (#543, #556)
- CLI & downloads: revamp and use global options (#565)
- eval: review code, add guidelines and small benchmark (#542)
- fix: raise error if config file does not exist (#554)
- deprecate
process_record()
(#549) - docs: convert readme to markdown and update info (#564, #578)
trafilatura-1.8.1
trafilatura-1.8.0
Extraction:
- Better precision by @felipehertzer (#509, #520)
- Code formatting in TXT/Markdown output added (#498)
- Improved CSV output (#496)
- LXML: compile XPath expressions (#504)
- Overall speedup about +5%
Downloads and Navigation:
- More robust scans with
is_live_page()
(#501) - Better sitemap start and safeguards (#503, #506)
- Fix for headers in response object (#513)
Maintenance: