Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle constantly changing link URLs #80

Open
wo opened this issue Jun 12, 2016 · 1 comment
Open

Handle constantly changing link URLs #80

wo opened this issue Jun 12, 2016 · 1 comment

Comments

@wo
Copy link
Owner

wo commented Jun 12, 2016

On http://www.kantstudiesonline.net/index.php/articles/ the links to papers change on each visit, and can't be factorized into a session id and a non-trivial remainder. As a consequence, all papers get checked every day, and if a paper has been parsed incorrectly and manually corrected, the incorrect parsing is recognized as a new paper (because the corrected paper is not recognised as a duplicate).

For the case of Kant Studies Online, it would help to store the link text in addition to the link url and only process links whose link text is new. But that would not work for other sites where link texts may be something like 'PDF'.

A better solution is probably to store a hash of the pdf file in the Doc table and skip processing of documents whose hash is already in the table. (That would not work if a journal modified the pdf on each retrieval, which fortunately Kant Studies doesn't.)

Another (perhaps complementary) solution would be to improve the post-processing duplication detection: if two papers have almost the exact same content, they should be recognized as duplicates, even if they have different authors or titles.

@wo
Copy link
Owner Author

wo commented Sep 10, 2016

I've improved the duplicate detection so that it at least doesn't return None whenever no author has been extracted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant