You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I can't parse comments on Reddit pages (old.reddit.com). The issue is likely due to the comments being encapsulated within form elements, which the parser may not be handling correctly.
Steps to reproduce
Run trafilatura -u "https://old.reddit.com/r/programming/comments/1cnvy7y/how_stripe_prevents_double_payment_using/"
Expected Behavior
trafilatura should successfully parse and extract all visible Reddit comments.
Actual Behavior
Only user names, points, and number of children are extracted:
You can add XPath expressions to remove elements but you cannot explicitly add elements, that could be a useful improvement.
As for Reddit the extractor is not made for social networks, you could directly use Reddit datasets.
You can add XPath expressions to remove elements but you cannot explicitly add elements, that could be a useful improvement.
Yes, that could be a great improvement. I see other issues with unusual elements that have desirable content (#573). Could be better instead of hard-coding edge cases.
adbar
changed the title
Can't parse Reddit comments on old.reddit.com through CLI
Add option to provide XPaths for content extraction
May 21, 2024
I can't parse comments on Reddit pages (old.reddit.com). The issue is likely due to the comments being encapsulated within
form
elements, which the parser may not be handling correctly.Steps to reproduce
Run
trafilatura -u "https://old.reddit.com/r/programming/comments/1cnvy7y/how_stripe_prevents_double_payment_using/"
Expected Behavior
trafilatura should successfully parse and extract all visible Reddit comments.
Actual Behavior
Only user names, points, and number of children are extracted:
Adding a
--recall
flag doesn't change anything.Is it possible to manually specify which elements should be parsed?
The text was updated successfully, but these errors were encountered: