-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand shortened urls #42
base: main
Are you sure you want to change the base?
Conversation
I would actually prefer this to be a separate script - I've got Tweets going back to 2009 and I can imagine if I hammer some of those shorteners with requests, it's going to block the script. If it just looped through the generated .md files looking for the shortened links and then expanded them, it'll allow the script to abort/die/be cancelled and restarted later (perhaps from a different IP address) without any loss of progress. |
parser.py
Outdated
@@ -86,6 +92,32 @@ def tweet_json_to_markdown(tweet, username, archive_media_folder, output_media_f | |||
body += f'\n\n(Originally on Twitter: [{timestamp_str}](https://twitter.com/{username}/status/{tweet_id_str}))' | |||
return timestamp, body | |||
|
|||
def is_short_url(url): | |||
hostname = urlparse(url).hostname | |||
shorteners = ['t.co', '7ax.de', 'bit.ly', 'buff.ly', 'cnn.it', 'ct.de', 'flic.kr', 'go.shr.lc', 'ift.tt', 'instagr.am', 'is.gd', 'j.mp', 'ku-rz.de', 'p.dw.com', 'pl0p.de', 'spon.de', 'sz.de', 'tiny.cc', 'tinyurl.com', 'trib.al', 'wp.me', 'www.sz.de', 'yfrog.com'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v.gd is another one ("partner" of is.gd) and https://goo.gl/ (Google's own) would be two major ones I feel are missing.
@abaumg How do you feel about changing the PR to:
Do we want to add a small sleep into the loop to decrease the likelihood of being blocked? We do that for the image downloads. [Edited to update suggestion] |
Sounds reasonable to me. I‘ll update the PR accordingly. |
It can be, and originally is, separate. :-) In that state it should be run over your archived tweets before using the parser. It is also easier to expand the list of shorteners, as I had stored them in a separate config file. I'm pretty sure there are more shorteners than I had listed. |
This also makes more sense to me, because I'd rather not have to run the expansion again in order to tweak the markdown layout. Even better would be to store the mapping in a separate file, so the archive remains untouched (but I can just make a backup). |
Oops, closed this PR by accident while syncing my fork with upstream and switching branches. Will reopen. |
As suggested, I moved everything to a separate file. A map of the expanded links is saved as In addition, the script tries to expand links in really old tweets as well, where there is no meta information, but only plain text. |
TODOs:
|
@abaumg Do you need to read |
Should we add other output formats besides markdown as an option, it makes sense to go over tweets*.js or else we need to create new code to add the expanded URLs per output format. Working with mapping files both for media and URLs makes life easier down the road ;) |
@jwildeboer I agree that we may want mapping files for media and URLs at some point. For now I think let's adopt the simplest possible solution: it searches files for expandable URLs and replaces them. [Edit: We now output html too, so updated comment to be more general.] |
@timhutton IMO it's easier to parse structured |
I opened up PR #85 as the link handling for old tweets before the t.co shortener introduction is pretty much absent. The PR adds handling for these links by extracting them from the tweet and stores them in the in-memory tweets structure. With the link-expander as an external script, these in-memory links are not visible of course as the memory structure got destroyed... So this might be a good idea to think about here: How do we want to handle such a situation? Should we just write out a json list of links in the main |
I've also seen |
I've just read through this discussion and for me, it looks like a good solution would be like this:
If you run the parser again later, it would not do new lookups for any urls that are already saved in the mapping file, so the traffic load is kept to a minimum. |
Running these remote lookups is super slow (mostly because we sleep for 0.75s between each for fear of being limited). And we might have a lot of them to do. Is there any possibility of a bulk lookup feature being available somewhere? |
Someone on mastodon said they hammer t.co to get the redirects in parallel and have never been rate-limited. If that's true with all of the shorteners then maybe we can make this workable. On my archive of 1371 tweets I have 267 URLs to expand. With the sleep turned off this takes 67 seconds. For someone with 10x more tweets they might be looking at 10mins to retrieve the URLs. If we code it in such a way that's it's not crucial that it finishes (as with the media downloads), and if we can paralellize it a bit or whatever, then I think we can run this at the end of parser.py. (I'm getting the feeling that people don't mind leaving things running if they see the benefit. Someone today told me their media download took 14 hours and retrieved 12439 of 12459 files, with the missing 20 being 404s. They were delighted it had worked.) @abaumg So to answer your question, maybe let's do the following:
So this way there's only one script to run, we don't duplicate the tweet-parsing code, and we don't need to parse mds or html. If they run the script again then it will take less time (because of the cache). If the retrievals crash or the computer gets turned off then no big deal, we can carry on from where we left off. Thoughts? Sorry for the churn on the thinking in the discussion above. We've come a long way in 10 days. |
Ways to break up the work into smaller PRs:
[Edit: I've just seen that @ixs's #83 also caches user handles, and would be easily extended to URLs.] |
Hi, I also found URLs shortened via wp.me in my archive, is there a repository of shortening services to expand? |
Starting from the code uploaded by Atari-Frosch in #9, I added functionality to expand shortened URLs from known shorteners. This PR would fix #36 and #38 as well.