Get doi by scraping actual biorxiv page? #6

vincerubinetti · 2021-06-09T01:45:32Z

Sometimes randomly Disqus returns a biorxiv link that doesn't have the DOI in it. For example in this run, https://www.biorxiv.org/content/early/2018/11/09/459529 is returned, but that redirects to the correct/expected link https://www.biorxiv.org/content/10.1101/459529v1 that contains the complete doi.

To simplify the bot code, I made it read the DOI from the url, assuming and hoping it always would contain it. If we ever want this to be more robust, we could have the bot actually fetch the HTML contents at the link and find the DOI in the document:

In the upcoming PR, this at least wont crash the bot, it will just skip the comment with the non-doi link.

The text was updated successfully, but these errors were encountered:

cgreene · 2021-06-09T10:22:26Z

Ahh! This is because bioRxiv changed its URL scheme 2-3 years ago to include the DOI. Before that it was just a preprint identifier and date. I think you can convert the old scheme to the new one by using the 10.1101 prefix along with the thing that comes after the date.

vincerubinetti · 2021-06-09T17:05:24Z

I think you can convert the old scheme to the new one by using the 10.1101 prefix along with the thing that comes after the date.

If you're sure about this, this would be easy to add. Though I could see it causing other errors if there's ever some other url format. How worth it do think it is. If those are old urls, people probably will barely ever comment on them. Also looking at the runs, it seems to only happen like 1 out of 100 runs (note that less than half of the errors you see on that list are actually related to this bug).

cgreene · 2021-06-09T20:32:27Z

I am sure about the past. I am not sure about the future URLs. I agree that since this only occurs on old preprints this will almost never occur - so I am fine with ignoring it for now 👍

- remove verbose logging (maybe will fix the random errors like [this one](https://github.com/greenelab/preprint-bot/runs/2755283036?check_suite_focus=true) where the process exits for apparently no reason) - fix bug with comments that return biorxiv links with no dois in them. doesn't throw error anymore, just skips the comment. see #6 for more robust solution - clean up key logging - rename gh-actions job names

vincerubinetti mentioned this issue Jun 9, 2021

fix error when comment returns biorxiv link without doi #7

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get doi by scraping actual biorxiv page? #6

Get doi by scraping actual biorxiv page? #6

vincerubinetti commented Jun 9, 2021 •

edited

Loading

cgreene commented Jun 9, 2021

vincerubinetti commented Jun 9, 2021 •

edited

Loading

cgreene commented Jun 9, 2021

Get doi by scraping actual biorxiv page? #6

Get doi by scraping actual biorxiv page? #6

Comments

vincerubinetti commented Jun 9, 2021 • edited Loading

cgreene commented Jun 9, 2021

vincerubinetti commented Jun 9, 2021 • edited Loading

cgreene commented Jun 9, 2021

vincerubinetti commented Jun 9, 2021 •

edited

Loading

vincerubinetti commented Jun 9, 2021 •

edited

Loading