Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawls of mkdc only return DNS record in WARC #458

Open
machawk1 opened this issue Mar 21, 2020 · 2 comments
Open

Crawls of mkdc only return DNS record in WARC #458

machawk1 opened this issue Mar 21, 2020 · 2 comments

Comments

@machawk1
Copy link
Owner

machawk1 commented Mar 21, 2020

Tested in both the basic and advanced interface, tried crawling https://matkelly.com and the default https://matkelly.com/wail, both resulting WARCs only contain the DNS record.

Other URIs seem to produce the correct results.

@machawk1
Copy link
Owner Author

machawk1 commented May 14, 2020

Promoting this issue via pinning to give it priority.

Received a report from Wyeth Lynch trying to capture https://www.sdstate.edu/covid-19 with WAIL 2019.05.21. I replicated this in the latest master and only saw a DNS captured.

ezgif com-video-to-gif

Need to recheck the generated Heritrix configuration to see what this is occurring.

Also, this UI/UX needs to be refined to give users the impression that the crawl does not immediately complete, e.g., give direct access via a link or a button to the crawl status.

@machawk1
Copy link
Owner Author

This might be attributed to the startup script including the correct Heritrix libraries per http://web.archive.org/web/20110928012834/http://tech.groups.yahoo.com/group/archive-crawler/message/772 .

The newer releases of Heritrix, when installed in WAIL, do not seem to exhibit the problem. A next-step might be to diff the startup scripts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant