Skip to content
This repository has been archived by the owner on Feb 15, 2023. It is now read-only.

Get doi by scraping actual biorxiv page? #6

Open
vincerubinetti opened this issue Jun 9, 2021 · 3 comments
Open

Get doi by scraping actual biorxiv page? #6

vincerubinetti opened this issue Jun 9, 2021 · 3 comments

Comments

@vincerubinetti
Copy link
Contributor

vincerubinetti commented Jun 9, 2021

Sometimes randomly Disqus returns a biorxiv link that doesn't have the DOI in it. For example in this run, https://www.biorxiv.org/content/early/2018/11/09/459529 is returned, but that redirects to the correct/expected link https://www.biorxiv.org/content/10.1101/459529v1 that contains the complete doi.

To simplify the bot code, I made it read the DOI from the url, assuming and hoping it always would contain it. If we ever want this to be more robust, we could have the bot actually fetch the HTML contents at the link and find the DOI in the document:

image

In the upcoming PR, this at least wont crash the bot, it will just skip the comment with the non-doi link.

@cgreene
Copy link
Member

cgreene commented Jun 9, 2021

Ahh! This is because bioRxiv changed its URL scheme 2-3 years ago to include the DOI. Before that it was just a preprint identifier and date. I think you can convert the old scheme to the new one by using the 10.1101 prefix along with the thing that comes after the date.

@vincerubinetti
Copy link
Contributor Author

vincerubinetti commented Jun 9, 2021

I think you can convert the old scheme to the new one by using the 10.1101 prefix along with the thing that comes after the date.

If you're sure about this, this would be easy to add. Though I could see it causing other errors if there's ever some other url format. How worth it do think it is. If those are old urls, people probably will barely ever comment on them. Also looking at the runs, it seems to only happen like 1 out of 100 runs (note that less than half of the errors you see on that list are actually related to this bug).

@cgreene
Copy link
Member

cgreene commented Jun 9, 2021

I am sure about the past. I am not sure about the future URLs. I agree that since this only occurs on old preprints this will almost never occur - so I am fine with ignoring it for now 👍

vincerubinetti added a commit that referenced this issue Jun 10, 2021
- remove verbose logging (maybe will fix the random errors like [this one](https://github.com/greenelab/preprint-bot/runs/2755283036?check_suite_focus=true) where the process exits for apparently no reason)
- fix bug with comments that return biorxiv links with no dois in them. doesn't throw error anymore, just skips the comment. see #6 for more robust solution
- clean up key logging
- rename gh-actions job names
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants