-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep download state and do not attempt to redownload images and user handles #83
base: main
Are you sure you want to change the base?
Conversation
2949633
to
7aad06d
Compare
Add a new json file (media/media_state.json) that keeps track of download status for media files. When re-running the archive-parser this file is consulted and if a media file has previously been downloaded, skip any attempt to check file size etc. This is great for testing code changes etc. because we're not hammering the twitter servers anymore. No need to hasten their demise. Added: Persistent state file for media downloads. Added: Skip downloads if media state indicates, we previously had a successful download. Drive-By: Capitalize the N choice in prompts (y/N) indicating the default choice.
Hi @ixs. Sorry, I've completely ignored this PR. Thanks for sending it. |
No worries. Reworking this right now to also keep the user data... |
…ry around. also try to cache user lookups
Reworked the download state cache a bit to also work for user lookups. Maybe should rework everything into a class and then have a class level state... 🤣 I'd appreciate a look at the |
@@ -623,6 +632,13 @@ def main(): | |||
|
|||
users = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
state["users"]
seems to duplicate users
?
I like the idea of have a single cache mechanism for users, media, unshortened URLs, so let's delete users
. Unless I've missed something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not explicitly duplicate users but in effect it does.
Good point removing that.
# Use our state store to prevent duplicate downloads | ||
try: | ||
with open(state_path, 'r') as state_file: | ||
state = json.load(state_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
state
is a very general word. Would cache
be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it started out, it was only about keeping download state as the downloaded images were already on disk.
So we were not really caching any data.
But now that we're actually handling user data, sure. We can relabel to cache.
@@ -110,7 +112,7 @@ def lookup_users(user_ids, users): | |||
with requests.Session() as session: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print(f'{len(filtered_user_ids)} users are unknown.')
This line is now misleading because many of these are in state and will get filtered out in get_twitter_users(). Can we do the filtering earlier, so that we can tell the users how many handles we need to download?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we're removing the duplication of users = {}
, sure. We can do the filtering earlier... Also saves us from passing the dict pointer three layers deep...
@@ -598,6 +606,7 @@ def main(): | |||
data_folder = os.path.join(input_folder, 'data') | |||
account_js_filename = os.path.join(data_folder, 'account.js') | |||
log_path = os.path.join(output_media_folder_name, 'download_log.txt') | |||
state_path = 'download_state.json' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to PathConfig and use cache
if we agree on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR #120 is merged now, so you can use state_path = os.path.join(paths.dir_output_cache, 'download_state.json')
@ixs Passing a value down through several layers of functions is completely fine. Large classes are bad because they just become a repository of almost-globals. Even small classes with both data and member functions are often bad because they're stateful. Classes with just data or just functions are fine. |
…n PR timhutton#83 by @ixs on upstream. (State for user handles is not needed since we already handle that separately.)
There are some merge conflicts now; try rebasing? |
Add a new json file (media/media_state.json) that keeps track of
download status for media files.
When re-running the archive-parser this file is consulted and if
a media file has previously been downloaded, skip any attempt to
check file size etc.
This is great for testing code changes etc. because we're not
hammering the twitter servers anymore. No need to hasten their
demise.
Added: Persistent state file for media downloads.
Added: Skip downloads if media state indicates, we previously
had a successful download.
Drive-By: Capitalize the N choice in prompts (y/N) indicating
the default choice.