Expand shortened urls #42

abaumg · 2022-11-13T23:11:30Z

Starting from the code uploaded by Atari-Frosch in #9, I added functionality to expand shortened URLs from known shorteners. This PR would fix #36 and #38 as well.

rbairwell · 2022-11-13T23:19:50Z

I would actually prefer this to be a separate script - I've got Tweets going back to 2009 and I can imagine if I hammer some of those shorteners with requests, it's going to block the script. If it just looped through the generated .md files looking for the shortened links and then expanded them, it'll allow the script to abort/die/be cancelled and restarted later (perhaps from a different IP address) without any loss of progress.

rbairwell · 2022-11-13T23:24:38Z

parser.py

@@ -86,6 +92,32 @@ def tweet_json_to_markdown(tweet, username, archive_media_folder, output_media_f
    body += f'\n\n(Originally on Twitter: [{timestamp_str}](https://twitter.com/{username}/status/{tweet_id_str}))'
    return timestamp, body

+def is_short_url(url):
+    hostname = urlparse(url).hostname
+    shorteners = ['t.co', '7ax.de', 'bit.ly', 'buff.ly', 'cnn.it', 'ct.de', 'flic.kr', 'go.shr.lc', 'ift.tt', 'instagr.am', 'is.gd', 'j.mp', 'ku-rz.de', 'p.dw.com', 'pl0p.de', 'spon.de', 'sz.de', 'tiny.cc', 'tinyurl.com', 'trib.al', 'wp.me', 'www.sz.de', 'yfrog.com']


v.gd is another one ("partner" of is.gd) and https://goo.gl/ (Google's own) would be two major ones I feel are missing.

timhutton · 2022-11-13T23:42:07Z

I agree with @rbairwell. Let's keep parser.py as a tool that parses the archive using only local data. Extended functionality like download_better_images.py and URL expansion can be handled by separate tools.

@abaumg How do you feel about changing the PR to:

~~move the code into a new script expand_urls.py that updates *.md in its local folder~~
~~add a comment to the end of parser.py to tell the user about the possibility of expanding the URLs~~

Do we want to add a small sleep into the loop to decrease the likelihood of being blocked? We do that for the image downloads.

[Edited to update suggestion]
[Edited because I now want everything in parser.py to make things easier for the users]

abaumg · 2022-11-14T05:15:37Z

Sounds reasonable to me. I‘ll update the PR accordingly.

Atari-Frosch · 2022-11-14T15:42:33Z

I would actually prefer this to be a separate script - I've got Tweets going back to 2009 and I can imagine if I hammer some of those shorteners with requests, it's going to block the script. If it just looped through the generated .md files looking for the shortened links and then expanded them, it'll allow the script to abort/die/be cancelled and restarted later (perhaps from a different IP address) without any loss of progress.

It can be, and originally is, separate. :-) In that state it should be run over your archived tweets before using the parser. It is also easier to expand the list of shorteners, as I had stored them in a separate config file. I'm pretty sure there are more shorteners than I had listed.

Sjors · 2022-11-14T16:09:22Z

In that state it should be run over your archived tweets before using the parser.

This also makes more sense to me, because I'd rather not have to run the expansion again in order to tweak the markdown layout. Even better would be to store the mapping in a separate file, so the archive remains untouched (but I can just make a backup).

abaumg · 2022-11-14T21:31:21Z

Oops, closed this PR by accident while syncing my fork with upstream and switching branches. Will reopen.

This reverts commit 4034b4d.

abaumg · 2022-11-14T23:42:01Z

As suggested, I moved everything to a separate file. A map of the expanded links is saved as *.ini file, although I'm not sure if ConfigParser is the best approach here.

In addition, the script tries to expand links in really old tweets as well, where there is no meta information, but only plain text.

abaumg · 2022-11-14T23:44:42Z

TODOs:

fix: the mapping file is regenerated on each run, as I struggled to find existing records with ConfigParser
use mapping file in parser.py for generating the markdown

timhutton · 2022-11-15T00:06:24Z

@abaumg Do you need to read tweet*.js in this script? We already expand the URLs in parser.py using the JSON. I had imagined expand_urls.py would work by read the *.md files and expanding URLs found there using requests.

expand_urls.py

jwildeboer · 2022-11-15T10:14:41Z

Should we add other output formats besides markdown as an option, it makes sense to go over tweets*.js or else we need to create new code to add the expanded URLs per output format. Working with mapping files both for media and URLs makes life easier down the road ;)

timhutton · 2022-11-15T12:17:12Z

@jwildeboer I agree that we may want mapping files for media and URLs at some point. For now I think let's adopt the simplest possible solution: it searches files for expandable URLs and replaces them.

[Edit: We now output html too, so updated comment to be more general.]

abaumg · 2022-11-19T17:24:36Z

@timhutton IMO it's easier to parse structured tweet*.js JSON than to extract links from Markdown files, let alone parsing HTML. As @jwildeboer pointed out, mapping files make life easier. But you're the maintainer, you decide. So now that we have already two output formats, do we stick with modifying MD output files or should we go for mapping files?

ixs · 2022-11-19T20:00:59Z

I opened up PR #85 as the link handling for old tweets before the t.co shortener introduction is pretty much absent.
Links are just plaintext in the full_text key and the entities.urls as well as the entities.media keys are just empty.

The PR adds handling for these links by extracting them from the tweet and stores them in the in-memory tweets structure. With the link-expander as an external script, these in-memory links are not visible of course as the memory structure got destroyed...

So this might be a good idea to think about here: How do we want to handle such a situation? Should we just write out a json list of links in the main parser.py? And then iterate over that list in the expand_urls.py and expand links?
A second run of parser.py could then check for expanded links in the json file and export the tweets correctly?

yoshimo · 2022-11-20T11:16:08Z

I've also seen zpr.io

flauschzelle · 2022-11-21T21:33:36Z

I've just read through this discussion and for me, it looks like a good solution would be like this:

the parser asks 'do you want to expand shortened urls (using online lookup)?' before all the other parsing happens.
If the mapping file (maybe just a JSON file if the config/ini format is difficult to handle?) doesn't exist yet, it is created (empty) and filled by local lookup in the archive files (the parser already does this lookup when parsing tweets, so it would only have to be moved to an earlier point in time, and the 'save to separate file' part would be new).
if the user says yes to 1., then the js files are searched for shortenend urls that can't be resolved locally (i.e. they don't already appear in the mapping file), resolves them by online lookup, and adds them to the mapping file.
whenever a shortened URL ist encountered while parsing the archive, the parser can replace it with its expanded version by looking it up from the mapping file.

If you run the parser again later, it would not do new lookups for any urls that are already saved in the mapping file, so the traffic load is kept to a minimum.

timhutton · 2022-11-23T02:58:56Z

Running these remote lookups is super slow (mostly because we sleep for 0.75s between each for fear of being limited). And we might have a lot of them to do. Is there any possibility of a bulk lookup feature being available somewhere?

timhutton · 2022-11-23T18:44:30Z

Someone on mastodon said they hammer t.co to get the redirects in parallel and have never been rate-limited. If that's true with all of the shorteners then maybe we can make this workable.

On my archive of 1371 tweets I have 267 URLs to expand. With the sleep turned off this takes 67 seconds. For someone with 10x more tweets they might be looking at 10mins to retrieve the URLs. If we code it in such a way that's it's not crucial that it finishes (as with the media downloads), and if we can paralellize it a bit or whatever, then I think we can run this at the end of parser.py.

(I'm getting the feeling that people don't mind leaving things running if they see the benefit. Someone today told me their media download took 14 hours and retrieved 12439 of 12459 files, with the missing 20 being 404s. They were delighted it had worked.)

@abaumg So to answer your question, maybe let's do the following:

If there's an existing cache urls_unshortened.txt that maps URLs to unshortened then we load and use that.
On the first pass we call parse_tweets(), parse_direct_messages()., passing the cache dict. It writes *.md *.html with the existing URLs (or the ones found in the cache) and collects the ones that need unshortening.
We then ask the user if they want to try un-shortening N URLs (estimated time and KB). (Maybe we ask about all lengthy downloads before starting any of them.)
We run the retrieval, updating the urls_unshortened.txt cache file.
We then call parse_tweets(), parse_direct_messages() again, writing out *.md *.html again, with updated urls.
All the code is in parser.py.

So this way there's only one script to run, we don't duplicate the tweet-parsing code, and we don't need to parse mds or html. If they run the script again then it will take less time (because of the cache). If the retrievals crash or the computer gets turned off then no big deal, we can carry on from where we left off.

Thoughts? Sorry for the churn on the thinking in the discussion above. We've come a long way in 10 days.

timhutton · 2022-11-24T14:11:41Z

Ways to break up the work into smaller PRs:

just the function to attempt to unshorten a URL
just the code to make use of an unshortening cache dict in parse_tweets
just the code to return from parse_tweets/DMs a list of URLs that can be unshortened
~~just the code to load and save the unshortening cache~~

[Edit: I've just seen that @ixs's #83 also caches user handles, and would be easily extended to URLs.]

slorquet · 2023-02-02T12:01:59Z

Hi, I also found URLs shortened via wp.me in my archive, is there a repository of shortening services to expand?

rbairwell reviewed Nov 13, 2022

View reviewed changes

timhutton mentioned this pull request Nov 13, 2022

Expand t.co URLs for all links #9

Closed

Sjors mentioned this pull request Nov 14, 2022

Expand bit.ly #38

Open

abaumg closed this Nov 14, 2022

abaumg force-pushed the main branch from f027a38 to cee55b8 Compare November 14, 2022 20:45

expand shortened urls

4034b4d

abaumg reopened this Nov 14, 2022

abaumg added 3 commits November 14, 2022 22:52

Revert "expand shortened urls"

9841a32

This reverts commit 4034b4d.

move link expansion to separate file

9ddfe47

try to expand plaintext links in really old tweets

2165d7c

timhutton reviewed Nov 15, 2022

View reviewed changes

expand_urls.py Show resolved Hide resolved

abaumg marked this pull request as draft November 19, 2022 16:59

Merge branch 'timhutton:main' into main

07235eb

jpluimers mentioned this pull request Nov 20, 2022

Feature request: Export unshortened URLs to CSV (e.g. for archiving them in the Internet Archive) #90

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand shortened urls #42

Expand shortened urls #42

abaumg commented Nov 13, 2022

rbairwell commented Nov 13, 2022

rbairwell Nov 13, 2022

timhutton commented Nov 13, 2022 •

edited

Loading

abaumg commented Nov 14, 2022

Atari-Frosch commented Nov 14, 2022

Sjors commented Nov 14, 2022

abaumg commented Nov 14, 2022

abaumg commented Nov 14, 2022

abaumg commented Nov 14, 2022

timhutton commented Nov 15, 2022

jwildeboer commented Nov 15, 2022

timhutton commented Nov 15, 2022 •

edited

Loading

abaumg commented Nov 19, 2022

ixs commented Nov 19, 2022

yoshimo commented Nov 20, 2022

flauschzelle commented Nov 21, 2022

timhutton commented Nov 23, 2022

timhutton commented Nov 23, 2022 •

edited

Loading

timhutton commented Nov 24, 2022 •

edited

Loading

slorquet commented Feb 2, 2023

Expand shortened urls #42

Are you sure you want to change the base?

Expand shortened urls #42

Conversation

abaumg commented Nov 13, 2022

rbairwell commented Nov 13, 2022

rbairwell Nov 13, 2022

Choose a reason for hiding this comment

timhutton commented Nov 13, 2022 • edited Loading

abaumg commented Nov 14, 2022

Atari-Frosch commented Nov 14, 2022

Sjors commented Nov 14, 2022

abaumg commented Nov 14, 2022

abaumg commented Nov 14, 2022

abaumg commented Nov 14, 2022

timhutton commented Nov 15, 2022

jwildeboer commented Nov 15, 2022

timhutton commented Nov 15, 2022 • edited Loading

abaumg commented Nov 19, 2022

ixs commented Nov 19, 2022

yoshimo commented Nov 20, 2022

flauschzelle commented Nov 21, 2022

timhutton commented Nov 23, 2022

timhutton commented Nov 23, 2022 • edited Loading

timhutton commented Nov 24, 2022 • edited Loading

slorquet commented Feb 2, 2023

timhutton commented Nov 13, 2022 •

edited

Loading

timhutton commented Nov 15, 2022 •

edited

Loading

timhutton commented Nov 23, 2022 •

edited

Loading

timhutton commented Nov 24, 2022 •

edited

Loading