Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More work on downloadtweets #122

Open
wants to merge 81 commits into
base: downloadtweets
Choose a base branch
from

Conversation

lenaschimmel
Copy link
Collaborator

@lenaschimmel lenaschimmel commented Nov 25, 2022

Here's some work which I already did 3 days ago, but had not made into a PR before.

Yesterday I incorporated the newest stuff from timhutton/twitter-archive-parser/main into both timhutton/twitter-archive-parser/downloadtweets and lenaschimmel/twitter-archive-parser/downloadtweets because I thought that would make it easier to keep everything up to date and focus on the actual (non-merge) commits.

What's actually included

This is just more WIP on the download tweets feature:

  • Error handling in get_tweets - if it fails, return the tweets we already downloaded, plus the ids of tweets that are still missing
  • Add merge method, which can basically merge all kinds of python values, but contains some special treatments for dicts representing tweets. This should make sure that if a tweet is contained in the archive, and we download that same tweet via the API, we can merge them without losing any information.
  • add helper method has_path which simplifies those chained checks that we have all over the code, like: if 'entities' in tweet and 'user_mentions' in tweet['entities'] and tweet['entities']['user_mentions'] is not None
  • save downloaded tweets into known_tweets.json, and load them on next script execution, so that we don't reload the same tweets over and over again.
  • annotate downloaded tweets with attributes like from_api, download_with_user, download_with_alt_text so that collect_tweet_references can make better decisions on what do re-download, and whether to follow references or not
  • improved merging and comparing of tweet JSON objects. Those are some really nasty data structures!

lenaschimmel and others added 24 commits November 22, 2022 22:08
This is mainly archieved by adding `create_path_for_file_` methods to `PathConfig` and using them, and using `rel_url` to link to media files.
@lenaschimmel lenaschimmel changed the title More work on *downloadtweets* (Probably broken PR diff) More work on downloadtweets (Probably broken PR diff) Nov 25, 2022
@lenaschimmel lenaschimmel changed the title More work on downloadtweets (Probably broken PR diff) More work on downloadtweets Nov 27, 2022
@lenaschimmel
Copy link
Collaborator Author

The online diff view is no longer broken 🎉

@slorquet
Copy link

slorquet commented Feb 2, 2023

Hi, any chance all this nice stuff gets integrated? Thanks!

@Sjors
Copy link

Sjors commented Feb 18, 2023

@slorquet I think this is still work in progress so can't be merged yet?

@lenaschimmel are you still working on this? Can you rebase it, so it's easier to review? I'd like to get as much out of the sinking blue ship as possible :-)

@lenaschimmel
Copy link
Collaborator Author

@Sjors I'm not really working on that branch, or twitter-archive-parser, any more since mid December 22. I've since deleted my own twitter* account, so I don't have much motivation to continue, and it's harder to test. And even if I really wanted to, I currently don't have any time to continue working on it.

Until then, I focused on the weirdly-named branch lenaschimmel:archivemode, which already includes all commits from downloadtweets. I'd say the archivemode branch is in rather good shape now, and can be used and/or merged. It is also up-to-date with timhutton:main so it could even be fast-forwarded. But I don't know if @timhutton wants to go into that direction.


* To be precise, I deleted my main twitter account. I still have some secondary ones lying around, but with much less content, thus less edge cases that are interesting for testing new code

@almereyda
Copy link

@lenaschimmel Would you like to document the fact that the other branch is more worthy of being merged by closing/drafting this one, and opening a new PR?

This little step could also make it easier for third-parties, like me, to discover your work and rely on it instead, in case @timhutton would not be available for further deliberation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants