-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace Java ARC/WARC record processing library #494
Labels
Comments
FWIW, Common Crawl seems to use the ClueWeb WARC readers https://github.com/commoncrawl/example-warc-java/tree/master/src/main/java These are also the ones used in Anserini: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/collection/ClueWeb09Collection.java My impression is that these readers are much more impoverished in terms of features... but may be much faster? |
This was referenced May 17, 2022
Closed
ruebot
added a commit
that referenced
this issue
May 18, 2022
* fix discardDate issue * update tests for #494 * add test for #493 * add test for #532 * move issue specific tests to their own directory * add copyright statement to SparklingArchiveRecord * move webarchive-commons back to 1.1.9 * resolves #532 * resolves #494 * resolves #493 * resolves #492 * resolves #317 * resolves #260 * resolves #182 * resolves #76 * resolves #74 * resolves #73 * resolves #23 * resolves #18
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem? Please describe.
We have a number of issues that have crept up over years with how we process ARC and WARC records to hand off to Spark for processing. Namely #317, #492, and #493.
Describe the solution you'd like
Write a new Scala library to handle processing ARC and WARC. This can be part of
aut
or and stand alone library, or we can use/built upon @helgeho'ssparkling
.Describe alternatives you've considered
Fixing and patching what we have now, and potentially jwarc (#411).
Additional context
Implementing this as a data source could also lead to addressing #371 completely. From the Spark dev list, I believe this is an example of implementing Cassandra as a data source that we can potentially build off of.
The text was updated successfully, but these errors were encountered: