Convert RecordLoader.loadArchives to a Spark Data Source #371

ruebot · 2019-11-05T23:12:18Z

Since we're pivoting to full DataFrame support (#223, #190), we should convert/migrate RecordLoader.loadArchives, and any other related functions to a Spark Data Source. That way we could do things like:

spark.read.format("webArchive")
  .option("mode", "FAILFAST")
  .option("inferSchema", "true")
  .option("/path/to/files")
  .schema(someSchema)
  .load()

Then, we could, (since it's an open issue #147) write WARCs that way too? 🤷‍♂️

spark.write.format("webArchive")
  .option("mode", "OVERWRITE")
  .option("/path/to/files")
  .save()

These are the Spark core data sources:

CSV
JSON
Parquet
ORC
JDBC/ODBC
Plain-text
Avro

Community implemented data sources:

Cassandra
HBase
MongoDB
AWS Redshift
XML

The text was updated successfully, but these errors were encountered:

ruebot · 2020-03-30T19:44:27Z

Some helpful links:

sepastian · 2020-07-20T07:34:50Z

https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-DataSourceV2.html
https://github.com/spirom/spark-data-sources

ruebot · 2020-10-01T21:12:29Z

Cassandra example

ruebot · 2022-05-17T20:09:30Z

I'm thinking this is out of scope for this project given the work being done on #494 now. So, I'm going to close it as won't fix.

ruebot added Python Java Scala feature DataFrames labels Nov 5, 2019

lintool mentioned this issue Nov 15, 2019

Rename DF functions to be consistent with Python DF functions. #379

Merged

sepastian mentioned this issue Jul 20, 2020

Output filtered data to WARC Format #147

Open

ruebot mentioned this issue Aug 10, 2020

Replace Java ARC/WARC record processing library #494

Closed

ruebot added wontfix and removed Python Java Scala feature DataFrames labels May 17, 2022

ruebot closed this as completed May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert RecordLoader.loadArchives to a Spark Data Source #371

Convert RecordLoader.loadArchives to a Spark Data Source #371

ruebot commented Nov 5, 2019 •

edited

Loading

ruebot commented Mar 30, 2020 •

edited

Loading

sepastian commented Jul 20, 2020

ruebot commented Oct 1, 2020

ruebot commented May 17, 2022

Convert RecordLoader.loadArchives to a Spark Data Source #371

Convert RecordLoader.loadArchives to a Spark Data Source #371

Comments

ruebot commented Nov 5, 2019 • edited Loading

ruebot commented Mar 30, 2020 • edited Loading

sepastian commented Jul 20, 2020

ruebot commented Oct 1, 2020

ruebot commented May 17, 2022

ruebot commented Nov 5, 2019 •

edited

Loading

ruebot commented Mar 30, 2020 •

edited

Loading