Keep download state and do not attempt to redownload images and user handles #83

ixs · 2022-11-19T10:56:01Z

Add a new json file (media/media_state.json) that keeps track of
download status for media files.

When re-running the archive-parser this file is consulted and if
a media file has previously been downloaded, skip any attempt to
check file size etc.

This is great for testing code changes etc. because we're not
hammering the twitter servers anymore. No need to hasten their
demise.

Added: Persistent state file for media downloads.
Added: Skip downloads if media state indicates, we previously
had a successful download.
Drive-By: Capitalize the N choice in prompts (y/N) indicating
the default choice.

Add a new json file (media/media_state.json) that keeps track of download status for media files. When re-running the archive-parser this file is consulted and if a media file has previously been downloaded, skip any attempt to check file size etc. This is great for testing code changes etc. because we're not hammering the twitter servers anymore. No need to hasten their demise. Added: Persistent state file for media downloads. Added: Skip downloads if media state indicates, we previously had a successful download. Drive-By: Capitalize the N choice in prompts (y/N) indicating the default choice.

timhutton · 2022-11-23T19:12:19Z

Hi @ixs. Sorry, I've completely ignored this PR. Thanks for sending it.

ixs · 2022-11-23T19:24:19Z

No worries. Reworking this right now to also keep the user data...
New PR coming in a few mins.

…ry around. also try to cache user lookups

ixs · 2022-11-23T20:09:42Z

Reworked the download state cache a bit to also work for user lookups.
I'm not super happy with the way we're now passing down the state dictionary down three functions to the actual get_twitter_users() function but the alternative would be a global variable which is also not that nice...

Maybe should rework everything into a class and then have a class level state... 🤣

I'd appreciate a look at the get_twitter_users() logic, I believe I am trimming correctly and caching correctly but there are a bunch of accounts that parser.py is trying to download over and over again. It looks like these accounts are deleted accounts that don't exist on the platform anymore, but I'd like to have a second pair of eyes on that.

timhutton · 2022-11-24T15:41:50Z

parser.py

@@ -623,6 +632,13 @@ def main():

    users = {}


state["users"] seems to duplicate users?

I like the idea of have a single cache mechanism for users, media, unshortened URLs, so let's delete users. Unless I've missed something?

It does not explicitly duplicate users but in effect it does.
Good point removing that.

timhutton · 2022-11-24T15:42:40Z

parser.py

+    # Use our state store to prevent duplicate downloads
+    try:
+        with open(state_path, 'r') as state_file:
+            state = json.load(state_file)


state is a very general word. Would cache be better?

When it started out, it was only about keeping download state as the downloaded images were already on disk.
So we were not really caching any data.

But now that we're actually handling user data, sure. We can relabel to cache.

timhutton · 2022-11-24T18:02:54Z

parser.py

@@ -110,7 +112,7 @@ def lookup_users(user_ids, users):
        with requests.Session() as session:


print(f'{len(filtered_user_ids)} users are unknown.')

This line is now misleading because many of these are in state and will get filtered out in get_twitter_users(). Can we do the filtering earlier, so that we can tell the users how many handles we need to download?

As we're removing the duplication of users = {}, sure. We can do the filtering earlier... Also saves us from passing the dict pointer three layers deep...

timhutton · 2022-11-24T22:52:40Z

parser.py

@@ -598,6 +606,7 @@ def main():
    data_folder = os.path.join(input_folder, 'data')
    account_js_filename = os.path.join(data_folder, 'account.js')
    log_path = os.path.join(output_media_folder_name, 'download_log.txt')
+    state_path = 'download_state.json'


Move to PathConfig and use cache if we agree on that.

FYI: the class PathConfig has been merged into main today via PR #115, but it does not have a path for cache yet.

I have just now published PR draft #120 which shall solve #99, and it introduces PathConfig.dir_output_cache, which is only used for a single file there.

PR #120 is merged now, so you can use state_path = os.path.join(paths.dir_output_cache, 'download_state.json')

timhutton · 2022-11-24T23:35:17Z

@ixs Passing a value down through several layers of functions is completely fine. Large classes are bad because they just become a repository of almost-globals. Even small classes with both data and member functions are often bad because they're stateful. Classes with just data or just functions are fine.

@ixs

…n PR timhutton#83 by @ixs on upstream. (State for user handles is not needed since we already handle that separately.)

cooljeanius · 2024-06-17T10:26:03Z

There are some merge conflicts now; try rebasing?

ixs force-pushed the media_state branch 2 times, most recently from 2949633 to 7aad06d Compare November 21, 2022 11:52

rework logic, load the state in the main() part and pass the dictiona…

652e044

…ry around. also try to cache user lookups

ixs force-pushed the media_state branch from 7aad06d to 652e044 Compare November 23, 2022 20:06

timhutton changed the title ~~Keep download state and do not attempt to redownload images~~ Keep download state and do not attempt to redownload images and user handles Nov 24, 2022

timhutton mentioned this pull request Nov 24, 2022

Expand shortened urls #42

Draft

timhutton reviewed Nov 24, 2022

View reviewed changes

timhutton mentioned this pull request Nov 30, 2022

Export downloaded user handles and other user data to a JSON file #145

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep download state and do not attempt to redownload images and user handles #83

Keep download state and do not attempt to redownload images and user handles #83

ixs commented Nov 19, 2022

timhutton commented Nov 23, 2022

ixs commented Nov 23, 2022

ixs commented Nov 23, 2022

timhutton Nov 24, 2022 •

edited

Loading

ixs Nov 26, 2022

timhutton Nov 24, 2022

ixs Nov 26, 2022

timhutton Nov 24, 2022 •

edited

Loading

ixs Nov 26, 2022

timhutton Nov 24, 2022

lenaschimmel Nov 24, 2022

lenaschimmel Nov 27, 2022

timhutton commented Nov 24, 2022

cooljeanius commented Jun 17, 2024

		@@ -110,7 +112,7 @@ def lookup_users(user_ids, users):
		with requests.Session() as session:

Keep download state and do not attempt to redownload images and user handles #83

Are you sure you want to change the base?

Keep download state and do not attempt to redownload images and user handles #83

Conversation

ixs commented Nov 19, 2022

timhutton commented Nov 23, 2022

ixs commented Nov 23, 2022

ixs commented Nov 23, 2022

timhutton Nov 24, 2022 • edited Loading

Choose a reason for hiding this comment

ixs Nov 26, 2022

Choose a reason for hiding this comment

timhutton Nov 24, 2022

Choose a reason for hiding this comment

ixs Nov 26, 2022

Choose a reason for hiding this comment

timhutton Nov 24, 2022 • edited Loading

Choose a reason for hiding this comment

ixs Nov 26, 2022

Choose a reason for hiding this comment

timhutton Nov 24, 2022

Choose a reason for hiding this comment

lenaschimmel Nov 24, 2022

Choose a reason for hiding this comment

lenaschimmel Nov 27, 2022

Choose a reason for hiding this comment

timhutton commented Nov 24, 2022

cooljeanius commented Jun 17, 2024

timhutton Nov 24, 2022 •

edited

Loading

timhutton Nov 24, 2022 •

edited

Loading