Replace Java ARC/WARC record processing library #494

ruebot · 2020-08-10T14:42:41Z

Is your feature request related to a problem? Please describe.

We have a number of issues that have crept up over years with how we process ARC and WARC records to hand off to Spark for processing. Namely #317, #492, and #493.

Describe the solution you'd like

Write a new Scala library to handle processing ARC and WARC. This can be part of aut or and stand alone library, or we can use/built upon @helgeho's sparkling.

Describe alternatives you've considered

Fixing and patching what we have now, and potentially jwarc (#411).

Additional context

Implementing this as a data source could also lead to addressing #371 completely. From the Spark dev list, I believe this is an example of implementing Cassandra as a data source that we can potentially build off of.

The text was updated successfully, but these errors were encountered:

lintool · 2020-09-02T15:17:29Z

FWIW, Common Crawl seems to use the ClueWeb WARC readers https://github.com/commoncrawl/example-warc-java/tree/master/src/main/java

These are also the ones used in Anserini: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/collection/ClueWeb09Collection.java

My impression is that these readers are much more impoverished in terms of features... but may be much faster?

* Partially address #494

* fix discardDate issue * update tests for #494 * add test for #493 * add test for #532 * move issue specific tests to their own directory * add copyright statement to SparklingArchiveRecord * move webarchive-commons back to 1.1.9 * resolves #532 * resolves #494 * resolves #493 * resolves #492 * resolves #317 * resolves #260 * resolves #182 * resolves #76 * resolves #74 * resolves #73 * resolves #23 * resolves #18

ruebot added enhancement Java labels Aug 10, 2020

adamyy mentioned this issue Sep 2, 2020

Heap space!! java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3332) #317

Open

adamyy mentioned this issue Sep 3, 2020

Set the upper limit of WARC content length to half of Integer.MAX_VALUE #496

Closed

ruebot added a commit that referenced this issue Sep 30, 2021

Rip out Java code.

b33f27a

* Partially address #494

This was referenced May 17, 2022

Upgrade to Hadoop 3.x #329

Closed

Convert RecordLoader.loadArchives to a Spark Data Source #371

Closed

ruebot mentioned this issue May 18, 2022

Remove Java w/arc processing, and replace it with Sparkling. #533

Merged

ruebot closed this as completed in c8fa256 May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace Java ARC/WARC record processing library #494

Replace Java ARC/WARC record processing library #494

ruebot commented Aug 10, 2020 •

edited

Loading

lintool commented Sep 2, 2020

Replace Java ARC/WARC record processing library #494

Replace Java ARC/WARC record processing library #494

Comments

ruebot commented Aug 10, 2020 • edited Loading

lintool commented Sep 2, 2020

ruebot commented Aug 10, 2020 •

edited

Loading