Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand shortened urls #42

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

Expand shortened urls #42

wants to merge 5 commits into from

Conversation

abaumg
Copy link

@abaumg abaumg commented Nov 13, 2022

Starting from the code uploaded by Atari-Frosch in #9, I added functionality to expand shortened URLs from known shorteners. This PR would fix #36 and #38 as well.

@rbairwell
Copy link

I would actually prefer this to be a separate script - I've got Tweets going back to 2009 and I can imagine if I hammer some of those shorteners with requests, it's going to block the script. If it just looped through the generated .md files looking for the shortened links and then expanded them, it'll allow the script to abort/die/be cancelled and restarted later (perhaps from a different IP address) without any loss of progress.

parser.py Outdated
@@ -86,6 +92,32 @@ def tweet_json_to_markdown(tweet, username, archive_media_folder, output_media_f
body += f'\n\n(Originally on Twitter: [{timestamp_str}](https://twitter.com/{username}/status/{tweet_id_str}))'
return timestamp, body

def is_short_url(url):
hostname = urlparse(url).hostname
shorteners = ['t.co', '7ax.de', 'bit.ly', 'buff.ly', 'cnn.it', 'ct.de', 'flic.kr', 'go.shr.lc', 'ift.tt', 'instagr.am', 'is.gd', 'j.mp', 'ku-rz.de', 'p.dw.com', 'pl0p.de', 'spon.de', 'sz.de', 'tiny.cc', 'tinyurl.com', 'trib.al', 'wp.me', 'www.sz.de', 'yfrog.com']

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v.gd is another one ("partner" of is.gd) and https://goo.gl/ (Google's own) would be two major ones I feel are missing.

@timhutton
Copy link
Owner

timhutton commented Nov 13, 2022

I agree with @rbairwell. Let's keep parser.py as a tool that parses the archive using only local data. Extended functionality like download_better_images.py and URL expansion can be handled by separate tools.

@abaumg How do you feel about changing the PR to:

  • move the code into a new script expand_urls.py that updates *.md in its local folder
  • add a comment to the end of parser.py to tell the user about the possibility of expanding the URLs

Do we want to add a small sleep into the loop to decrease the likelihood of being blocked? We do that for the image downloads.

[Edited to update suggestion]
[Edited because I now want everything in parser.py to make things easier for the users]

@abaumg
Copy link
Author

abaumg commented Nov 14, 2022

Sounds reasonable to me. I‘ll update the PR accordingly.

@Atari-Frosch
Copy link

I would actually prefer this to be a separate script - I've got Tweets going back to 2009 and I can imagine if I hammer some of those shorteners with requests, it's going to block the script. If it just looped through the generated .md files looking for the shortened links and then expanded them, it'll allow the script to abort/die/be cancelled and restarted later (perhaps from a different IP address) without any loss of progress.

It can be, and originally is, separate. :-) In that state it should be run over your archived tweets before using the parser. It is also easier to expand the list of shorteners, as I had stored them in a separate config file. I'm pretty sure there are more shorteners than I had listed.

@Sjors
Copy link

Sjors commented Nov 14, 2022

In that state it should be run over your archived tweets before using the parser.

This also makes more sense to me, because I'd rather not have to run the expansion again in order to tweak the markdown layout. Even better would be to store the mapping in a separate file, so the archive remains untouched (but I can just make a backup).

@abaumg
Copy link
Author

abaumg commented Nov 14, 2022

Oops, closed this PR by accident while syncing my fork with upstream and switching branches. Will reopen.

@abaumg abaumg reopened this Nov 14, 2022
@abaumg
Copy link
Author

abaumg commented Nov 14, 2022

As suggested, I moved everything to a separate file. A map of the expanded links is saved as *.ini file, although I'm not sure if ConfigParser is the best approach here.

In addition, the script tries to expand links in really old tweets as well, where there is no meta information, but only plain text.

@abaumg
Copy link
Author

abaumg commented Nov 14, 2022

TODOs:

  • fix: the mapping file is regenerated on each run, as I struggled to find existing records with ConfigParser
  • use mapping file in parser.py for generating the markdown

@timhutton
Copy link
Owner

@abaumg Do you need to read tweet*.js in this script? We already expand the URLs in parser.py using the JSON. I had imagined expand_urls.py would work by read the *.md files and expanding URLs found there using requests.

@jwildeboer
Copy link

Should we add other output formats besides markdown as an option, it makes sense to go over tweets*.js or else we need to create new code to add the expanded URLs per output format. Working with mapping files both for media and URLs makes life easier down the road ;)

@timhutton
Copy link
Owner

timhutton commented Nov 15, 2022

@jwildeboer I agree that we may want mapping files for media and URLs at some point. For now I think let's adopt the simplest possible solution: it searches files for expandable URLs and replaces them.

[Edit: We now output html too, so updated comment to be more general.]

@abaumg abaumg marked this pull request as draft November 19, 2022 16:59
@abaumg
Copy link
Author

abaumg commented Nov 19, 2022

@timhutton IMO it's easier to parse structured tweet*.js JSON than to extract links from Markdown files, let alone parsing HTML. As @jwildeboer pointed out, mapping files make life easier. But you're the maintainer, you decide. So now that we have already two output formats, do we stick with modifying MD output files or should we go for mapping files?

@ixs
Copy link
Contributor

ixs commented Nov 19, 2022

I opened up PR #85 as the link handling for old tweets before the t.co shortener introduction is pretty much absent.
Links are just plaintext in the full_text key and the entities.urls as well as the entities.media keys are just empty.

The PR adds handling for these links by extracting them from the tweet and stores them in the in-memory tweets structure. With the link-expander as an external script, these in-memory links are not visible of course as the memory structure got destroyed...

So this might be a good idea to think about here: How do we want to handle such a situation? Should we just write out a json list of links in the main parser.py? And then iterate over that list in the expand_urls.py and expand links?
A second run of parser.py could then check for expanded links in the json file and export the tweets correctly?

@yoshimo
Copy link

yoshimo commented Nov 20, 2022

I've also seen zpr.io

@flauschzelle
Copy link
Collaborator

I've just read through this discussion and for me, it looks like a good solution would be like this:

  1. the parser asks 'do you want to expand shortened urls (using online lookup)?' before all the other parsing happens.
  2. If the mapping file (maybe just a JSON file if the config/ini format is difficult to handle?) doesn't exist yet, it is created (empty) and filled by local lookup in the archive files (the parser already does this lookup when parsing tweets, so it would only have to be moved to an earlier point in time, and the 'save to separate file' part would be new).
  3. if the user says yes to 1., then the js files are searched for shortenend urls that can't be resolved locally (i.e. they don't already appear in the mapping file), resolves them by online lookup, and adds them to the mapping file.
  4. whenever a shortened URL ist encountered while parsing the archive, the parser can replace it with its expanded version by looking it up from the mapping file.

If you run the parser again later, it would not do new lookups for any urls that are already saved in the mapping file, so the traffic load is kept to a minimum.

@timhutton
Copy link
Owner

Running these remote lookups is super slow (mostly because we sleep for 0.75s between each for fear of being limited). And we might have a lot of them to do. Is there any possibility of a bulk lookup feature being available somewhere?

@timhutton
Copy link
Owner

timhutton commented Nov 23, 2022

Someone on mastodon said they hammer t.co to get the redirects in parallel and have never been rate-limited. If that's true with all of the shorteners then maybe we can make this workable.

On my archive of 1371 tweets I have 267 URLs to expand. With the sleep turned off this takes 67 seconds. For someone with 10x more tweets they might be looking at 10mins to retrieve the URLs. If we code it in such a way that's it's not crucial that it finishes (as with the media downloads), and if we can paralellize it a bit or whatever, then I think we can run this at the end of parser.py.

(I'm getting the feeling that people don't mind leaving things running if they see the benefit. Someone today told me their media download took 14 hours and retrieved 12439 of 12459 files, with the missing 20 being 404s. They were delighted it had worked.)

@abaumg So to answer your question, maybe let's do the following:

  • If there's an existing cache urls_unshortened.txt that maps URLs to unshortened then we load and use that.
  • On the first pass we call parse_tweets(), parse_direct_messages()., passing the cache dict. It writes *.md *.html with the existing URLs (or the ones found in the cache) and collects the ones that need unshortening.
  • We then ask the user if they want to try un-shortening N URLs (estimated time and KB). (Maybe we ask about all lengthy downloads before starting any of them.)
  • We run the retrieval, updating the urls_unshortened.txt cache file.
  • We then call parse_tweets(), parse_direct_messages() again, writing out *.md *.html again, with updated urls.
  • All the code is in parser.py.

So this way there's only one script to run, we don't duplicate the tweet-parsing code, and we don't need to parse mds or html. If they run the script again then it will take less time (because of the cache). If the retrievals crash or the computer gets turned off then no big deal, we can carry on from where we left off.

Thoughts? Sorry for the churn on the thinking in the discussion above. We've come a long way in 10 days.

@timhutton
Copy link
Owner

timhutton commented Nov 24, 2022

Ways to break up the work into smaller PRs:

  • just the function to attempt to unshorten a URL
  • just the code to make use of an unshortening cache dict in parse_tweets
  • just the code to return from parse_tweets/DMs a list of URLs that can be unshortened
  • just the code to load and save the unshortening cache

[Edit: I've just seen that @ixs's #83 also caches user handles, and would be easily extended to URLs.]

@slorquet
Copy link

slorquet commented Feb 2, 2023

Hi, I also found URLs shortened via wp.me in my archive, is there a repository of shortening services to expand?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expand is.gd