Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize scraping #8 #31

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Optimize scraping #8 #31

wants to merge 1 commit into from

Conversation

luttje
Copy link
Owner

@luttje luttje commented Jan 1, 2024

In #8 I stated that it would be perfectly achievable to only scrape the changed pages, instead of the entire wiki. This PR attempted to implement that, however I ran into an issue:

I seem to have been mistaken, thinking there would be a list on the gmod wiki with all changes. Using that we could scrape only updates. However the only list I can find is https://wiki.facepunch.com/gmod/~recentchanges which shows only recent changes (last 30 days?) and it doesn't allow pagination to discover more changes.

If anyone has got any ideas around this I'm open to suggestions.

I'll leave this PR as a draft until a solution is found. I won't actively look into a solution myself, so help is greatly appreciated. In any case this is marked low-priority, since the scraping of the entire wiki works fine (besides being a bit wasteful).

@luttje luttje added enhancement New feature or request help wanted Extra attention is needed low-priority Not going to be worked on soon, but nice to improve in the future labels Jan 1, 2024
@aske02
Copy link
Contributor

aske02 commented Jan 1, 2024

wiki.facepunch.com/gmod/~pagelist?format=json's updateCount could work. I haven't checked if they update, but I would assume so.
It would be as simple as saving the count when scraping and comparing the count next time.

@luttje
Copy link
Owner Author

luttje commented Jan 1, 2024

@aske02 Wow, I can't believe I missed that. I even looked at this data and somehow concluded "this is not useful".

You're right, we could put the update count into the __metadata.json and use that to figure out what's entirely new, updated, or deleted.

I probably won't implement this until I have some more time available, I've got a ton of other projects that I want to focus my attention on now.

Nevertheless, thanks so much for helping out with this, and so quickly as well! Much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed low-priority Not going to be worked on soon, but nice to improve in the future
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants