-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to Spark 2.4.3 and update Tika to 1.20. #321
Conversation
@@ -178,9 +178,13 @@ class CommandLineApp(conf: CmdAppConf) { | |||
|
|||
def save(d: Dataset[Row]): Unit = { | |||
if (!configuration.partition.isEmpty) { | |||
d.coalesce(configuration.partition()).write.csv(saveTarget) | |||
d.coalesce(configuration.partition()).write | |||
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: These were added for the newer version of Tika (underlying libraries). Timestamp format is required.
Codecov Report
@@ Coverage Diff @@
## master #321 +/- ##
==========================================
+ Coverage 74.84% 74.97% +0.13%
==========================================
Files 39 39
Lines 1117 1123 +6
Branches 197 197
==========================================
+ Hits 836 842 +6
Misses 215 215
Partials 66 66
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting through its paces, things work well with example data (including DataFrames) except the language filter.
The following script:
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
RecordLoader.loadArchives("example.arc.gz", sc)
.keepValidPages()
.keepDomains(Set("www.archive.org"))
.keepLanguages(Set("fr"))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("plain-text-fr/")
Fails with
19/07/04 11:48:54 ERROR Utils: Aborting task
java.lang.NoClassDefFoundError: Could not initialize class org.apache.tika.langdetect.OptimaizeLangDetector
at io.archivesunleashed.matchbox.DetectLanguage$.apply(DetectLanguage.scala:35)
at io.archivesunleashed.package$WARecordRDD$$anonfun$keepLanguages$1.apply(package.scala:245)
at io.archivesunleashed.package$WARecordRDD$$anonfun$keepLanguages$1.apply(package.scala:245)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:128)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139)
et al. [full log here]
Given the Tika issues around the language detector, probably somewhat to be expected?
@ianmilligan1 That's what I was afraid of. |
Looks like I'm getting a slightly different error with:
ERROR:
I'll dig into it more. I think it might be a guava issue. |
Ah cool. We have 16 different occurrences of |
I think I got it!
318-test-lang-scala
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested and confirmed on multi-lingual data – fantastic work, @ruebot, congrats on getting this resolved!
GitHub issue(s):
What does this Pull Request do?
Update to Spark 2.4.3 and update Tika to 1.20, and pulls in unfinished work by @jrwiebe and @borislin.
I had to tweak the language tests. I believe is because of the updated language detection in Tika, and that's why the test fixtures changed. Though, I'm not a 100% certain. So, definitely need a sanity check here.
How should this be tested?
TravisCI should take care of the first bit.
@ianmilligan1 would you mind putting this through a bunch of examples?
rm -rf ~/.m2/repository/* && mvn clean install
rm -rf ~/.ivy2/* && spark-shell --packages "io.archivesunleashed:aut:0.17.1-SNAPSHOT"
Additional Notes:
Once we get this in, and #318, might as well cut a 0.18.0 release. There's a fair bit done since the last release. It'd be nice to get #317 sorted, but that isn't a blocker.
@jrwiebe if you have some time, can you give this a sanity look too, since I'll built on a bit of what you were previously digging into.