Releases · adbar/trafilatura

03 Dec 15:23

adbar

v2.0.0

c6e8340

trafilatura-2.0.0 Latest

Latest

Breaking changes:

Python 3.6 and 3.7 deprecated (#709)
bare_extraction():
- now returns an instance of the Document class by default
- as_dict deprecation warning → use .as_dict() method on return value (#730)
bare_extraction() and extract(): no_fallback deprecation warning → use fast instead (#730)
downloads: remove decode argument in fetch_url() → use fetch_response instead (#724)
deprecated graphical user interface now removed (#713)
extraction: move max_tree_size parameter to settings.cfg (#742)
use type hinting (#721, #723, #748)
see Python and CLI deprecations in the docs

Fixes:

set options.source before raising error on empty doc tree by @dmoklaf (#707)
robust encoding in options.source (#717)
more robust mapping for conversion to HTML (#721)
CLI downloads: use all information in settings file (#734)
downloads: cleaner urllib3 code (#736)
refine table markdown output by @unsleepy22 (#752)
extraction fix: images in text nodes by @unsleepy22 (#757)

Metadata:

more robust URL extraction (#710)

Command-line interface:

CLI: print URLs early for feeds and sitemaps with --list with @gremid (#744)
CLI: add 126 exit code for high error ratio (#747)

Maintenance:

remove already deprecated functions and args (#716)
add type hints (#723, #728)
setup: use pyproject.toml file (#715)
simplify code (#708, #709, #727)
better debug messages in main_extractor (#714)
evaluation: review data, update packages, add magic_html (#731)
setup: explicit exports through __all__ (#740)
tests: extend coverage (#753)

Documentation:

fix link in docs/index.html by @nzw0301 (#711)
remove docs from published packages (#743)
update docs (#745)

Contributors

unsleepy22, gremid, and nzw0301

Assets 2

10 Sep 12:43

adbar

v1.12.2

f57ef0b

trafilatura-1.12.2

downloads: add support for SOCKS proxies with @gremid (#682)
extraction fix: ValueError in table spans (#685)
spider: prune_xpath parameter added by @felipehertzer (#684)
spider: relax strict parameter for link extraction (#687)
sitemaps: max_sitemaps parameter added by @felipehertzer (#690)
maintenance: make compression libraries optional (#691)
metadata: review and lint code (#694)

Contributors

gremid and felipehertzer

Assets 2

20 Aug 10:58

adbar

v1.12.1

14c79c0

trafilatura-1.12.1

Navigation:

spider: restrict search to sections containing URL path (#673)
crawler: add parameter class and types, breaking change for undocumented functions (#675)
maintenance: simplify link discovery and extend tests (#674)
CLI: review code, add types and tests (#677)

Bugfixes:

fix AttributeError in element deletion (#668)
fix MemoryError in table header columns (#665)

Docs:

docs: fix variable name for extract_metadata in quickstart by @jpigla in #678

Contributors

jpigla

Assets 2

30 Jul 14:56

adbar

v1.12.0

c60395c

trafilatura-1.12.0

Breaking change:

enforce fixed list of output formats, deprecate -out on the CLI (#647)

Faster, more accurate extraction:

review link and structure checks (#653)
improve justext fallback (#652)
baseline: prevent LXML error in JSON-LD (#643), do not use as backup extraction (#646)
review XPaths for undesirable content (#645)

Bugfixes and maintenance:

CLI fix: markdown format should trigger include_formatting (#649)
images fix: use a length threshold on src attribute (#654)
XML-TEI: replace RelaxNG by DTD, remove pickle, and update (#655)
formatting & markdown fix: add newlines (#656)
table fix: prevent MemoryError & ValueError during conversion to text (#658)

Documentation:

update crawls.rst: known is an unexpected argument, by @tommytyc in #638

Contributors

tommytyc

Assets 2

27 Jun 14:04

adbar

v1.11.0

60647e5

trafilatura-1.11.0

Breaking change:

metadata now skipped by default (#613), to trigger inclusion in all output formats:
- with_metadata=True (Python)
- --with-metadata (CLI)

Extraction:

add HTML as output format (#614)
better and faster baseline extraction (#619)
better handling of HTML/XML elements (#628)
XPath rules added with @felipehertzer (#540)
fix: avoid faulty readability_lxml content (#635)

Evaluation:

new scripts and data with @LydiaKoerber (#606, #615)
additional data with @swetepete (#197)

Maintenance:

docs extended and updated, added page on deduplication (#618)
review code, add tests and types in part of the submodules (#620, #623, #624, #625)

Contributors

felipehertzer, LydiaKoerber, and swetepete

Assets 2

30 May 15:45

adbar

v1.10.0

b36b6fa

trafilatura-1.10.0

Breaking changes:

raise errors on deprecated CLI and function arguments (#581)
regroup classes and functions linked to deduplication (#582)
trafilatura.hashing → trafilatura.deduplication

Extraction:

port of is_probably_readerable from readability.js by @zirkelc in #587
Markdown table fixes by @naktinis in #601
fix list spacing in TXT output (#598)
CLI fixes: file processing options, mtime, and tests (#605)
CLI fix: read standard input as binary (#607)

Downloads:

fix deflate and add optional zstd to accepted encodings (#594)
spider fix: use internal download utilities for robots.txt (#590)

Maintenance:

add author XPaths (#567)
update justext and lxml dependencies (#593)
simplify code: unique function for length tests (#591)

Docs:

fix typos by @RainRat in #603

Contributors

naktinis, zirkelc, and RainRat

Assets 2

02 May 10:18

adbar

v1.9.0

11255bd

trafilatura-1.9.0

Extraction:

add markdown as explicit output (#550)
improve recall preset (#571)
speedup for readability-lxml (#547)
add global options object for extraction and use it in CLI (#552)
fix: better encoding detection (#548)
recall: fix for lists inside tables with @mikhainin (#534)
add symbol to preserve vertical spacing in Markdown (#499)
fix: table cell separators in non-XML output (#563)
slightly better accuracy and execution speed overall

Metadata:

add file creation date (date extraction, JSON & XML-TEI) (#561)
fix: empty content in meta tag by @felipehertzer (#545)

Maintenance:

restructure and simplify code (#543, #556)
CLI & downloads: revamp and use global options (#565)
eval: review code, add guidelines and small benchmark (#542)
fix: raise error if config file does not exist (#554)
deprecate process_record() (#549)
docs: convert readme to markdown and update info (#564, #578)

Contributors

mikhainin and felipehertzer

Assets 2

03 Apr 11:47

adbar

v1.8.1

d9d47a7

trafilatura-1.8.1

Maintenance:

Pin LXML to prevent broken dependency (#535)

Extraction:

Improve extraction accuracy for major news outlets (#530)
Fix formatting by correcting order of element generation and space handling with @dlwh (#528)
Fix: prevent tail insertion before children in nested elements by @knit-bee (#536)

Contributors

dlwh and knit-bee

Assets 2

20 Mar 15:24

adbar

v1.8.0

ff38644

trafilatura-1.8.0

Extraction:

Better precision by @felipehertzer (#509, #520)
Code formatting in TXT/Markdown output added (#498)
Improved CSV output (#496)
LXML: compile XPath expressions (#504)
Overall speedup about +5%

Downloads and Navigation:

More robust scans with is_live_page() (#501)
Better sitemap start and safeguards (#503, #506)
Fix for headers in response object (#513)

Maintenance:

License changed to Apache 2.0
Response class: convenience functions added (#497)
lxml.html.Cleaner removed (#491)
CLI fixes: parallel cores and processing (#524)

Contributors

felipehertzer

Assets 2

25 Jan 13:05

adbar

v1.7.0

97dc088

trafilatura-1.7.0

Extraction:

improved html2txt() function (#483)

Downloads:

add advanced fetch_response() function
→ pending deprecation for fetch_url(decode=False)

Maintenance:

support for LXML v5+ (#484 by @knit-bee, #485)
update htmldate

Contributors

knit-bee

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributors

Contributors

Contributors

Contributors

Contributors

Contributors

Contributors

Contributors

Contributors

Contributors

Releases: adbar/trafilatura

trafilatura-2.0.0

Contributors

trafilatura-1.12.2

Contributors

trafilatura-1.12.1

Contributors

trafilatura-1.12.0

Contributors

trafilatura-1.11.0

Contributors

trafilatura-1.10.0

Contributors

trafilatura-1.9.0

Contributors

trafilatura-1.8.1

Contributors

trafilatura-1.8.0

Contributors

trafilatura-1.7.0

Contributors