Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep download state and do not attempt to redownload images and user handles #83

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ixs
Copy link
Contributor

@ixs ixs commented Nov 19, 2022

Add a new json file (media/media_state.json) that keeps track of
download status for media files.

When re-running the archive-parser this file is consulted and if
a media file has previously been downloaded, skip any attempt to
check file size etc.

This is great for testing code changes etc. because we're not
hammering the twitter servers anymore. No need to hasten their
demise.

Added: Persistent state file for media downloads.
Added: Skip downloads if media state indicates, we previously
had a successful download.
Drive-By: Capitalize the N choice in prompts (y/N) indicating
the default choice.

@ixs ixs force-pushed the media_state branch 2 times, most recently from 2949633 to 7aad06d Compare November 21, 2022 11:52
Add a new json file (media/media_state.json) that keeps track of
download status for media files.

When re-running the archive-parser this file is consulted and if
a media file has previously been downloaded, skip any attempt to
check file size etc.

This is great for testing code changes etc. because we're not
hammering the twitter servers anymore. No need to hasten their
demise.

Added: Persistent state file for media downloads.
Added: Skip downloads if media state indicates, we previously
       had a successful download.
Drive-By: Capitalize the N choice in prompts (y/N) indicating
          the default choice.
@timhutton
Copy link
Owner

Hi @ixs. Sorry, I've completely ignored this PR. Thanks for sending it.

@ixs
Copy link
Contributor Author

ixs commented Nov 23, 2022

No worries. Reworking this right now to also keep the user data...
New PR coming in a few mins.

@ixs
Copy link
Contributor Author

ixs commented Nov 23, 2022

Reworked the download state cache a bit to also work for user lookups.
I'm not super happy with the way we're now passing down the state dictionary down three functions to the actual get_twitter_users() function but the alternative would be a global variable which is also not that nice...

Maybe should rework everything into a class and then have a class level state... 🤣

I'd appreciate a look at the get_twitter_users() logic, I believe I am trimming correctly and caching correctly but there are a bunch of accounts that parser.py is trying to download over and over again. It looks like these accounts are deleted accounts that don't exist on the platform anymore, but I'd like to have a second pair of eyes on that.

@timhutton timhutton changed the title Keep download state and do not attempt to redownload images Keep download state and do not attempt to redownload images and user handles Nov 24, 2022
@timhutton timhutton mentioned this pull request Nov 24, 2022
@@ -623,6 +632,13 @@ def main():

users = {}
Copy link
Owner

@timhutton timhutton Nov 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state["users"] seems to duplicate users?

I like the idea of have a single cache mechanism for users, media, unshortened URLs, so let's delete users. Unless I've missed something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not explicitly duplicate users but in effect it does.
Good point removing that.

# Use our state store to prevent duplicate downloads
try:
with open(state_path, 'r') as state_file:
state = json.load(state_file)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state is a very general word. Would cache be better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it started out, it was only about keeping download state as the downloaded images were already on disk.
So we were not really caching any data.

But now that we're actually handling user data, sure. We can relabel to cache.

@@ -110,7 +112,7 @@ def lookup_users(user_ids, users):
with requests.Session() as session:
Copy link
Owner

@timhutton timhutton Nov 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print(f'{len(filtered_user_ids)} users are unknown.')

This line is now misleading because many of these are in state and will get filtered out in get_twitter_users(). Can we do the filtering earlier, so that we can tell the users how many handles we need to download?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we're removing the duplication of users = {}, sure. We can do the filtering earlier... Also saves us from passing the dict pointer three layers deep...

@@ -598,6 +606,7 @@ def main():
data_folder = os.path.join(input_folder, 'data')
account_js_filename = os.path.join(data_folder, 'account.js')
log_path = os.path.join(output_media_folder_name, 'download_log.txt')
state_path = 'download_state.json'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to PathConfig and use cache if we agree on that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: the class PathConfig has been merged into main today via PR #115, but it does not have a path for cache yet.

I have just now published PR draft #120 which shall solve #99, and it introduces PathConfig.dir_output_cache, which is only used for a single file there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #120 is merged now, so you can use state_path = os.path.join(paths.dir_output_cache, 'download_state.json')

@timhutton
Copy link
Owner

@ixs Passing a value down through several layers of functions is completely fine. Large classes are bad because they just become a repository of almost-globals. Even small classes with both data and member functions are often bad because they're stateful. Classes with just data or just functions are fine.

lenaschimmel added a commit to lenaschimmel/twitter-archive-parser that referenced this pull request Dec 16, 2022
…n PR timhutton#83 by @ixs on upstream.

(State for user handles is not needed since we already handle that separately.)
@cooljeanius
Copy link

There are some merge conflicts now; try rebasing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants