Improve Schema Diff Error Message #160

zeotuan · 2024-09-28T05:26:07Z

What changed:

Improve Schema Diff Error message with color on different element
Fix Dataset content color diff to properly display for Dataset[Array[_]] cases
Display diff for each element of Dataset[Iterable[_]] cases

Schema Diff before:

Schema Diff now:

Dispaly for Dataframe diff

Display for Dataset diff

core/src/main/scala/com/github/mrpowers/spark/fast/tests/ProductUtil.scala

alfonsorr · 2024-10-01T09:32:01Z

core/src/main/scala/com/github/mrpowers/spark/fast/tests/ProductUtil.scala

        else if (expectedSeq.isEmpty)
-          List(Red(actualSeq.mkString("[", ",", "]")), Green("[]"))
+          List(Red(prodToString(actualSeq)), Green(emptyProd))
        else {
          val withEquals = actualSeq


I would suggest skipping this part if the Type is not Row, if it´s T the schema will be compared in compile time and if the type T is equal to the DataSet, it will fail when we execute .collect() in runtime.

I thought by this point we only work with in memory data - Seq[T] ?
or you mean in case of showProductDiff[Dataset[...]]

core/src/main/scala/com/github/mrpowers/spark/fast/tests/ProductUtil.scala

alfonsorr · 2024-10-01T10:01:42Z

core/src/main/scala/com/github/mrpowers/spark/fast/tests/DatasetComparer.scala

    val a = actualDS.collect().toSeq
    val e = expectedDS.collect().toSeq
    if (!a.approximateSameElements(e, equals)) {
      val arr = ("Actual Content", "Expected Content")


I dont see the improvement, any Dataset[T] can be transformed into a DataFrame, the only change here is that it will work with the original Class T` that was provided, but the result will be the same, only with the Class name prefixing each line. If the only thing was to prevent the .asRows method we could have done this and not any other change will be needed.

Suggested change

val a = actualDS.collect().toSeq

val e = expectedDS.collect().toSeq

if (!a.approximateSameElements(e, equals)) {

val arr = ("Actual Content", "Expected Content")

val a = actualDS.toDF.collect() //now its a Row

val e = expectedDS.toDF.collect()

if (!a.approximateSameElements(e, equals)) {

val arr = ("Actual Content", "Expected Content")

val msg = "Diffs\n" ++ DataframeUtil.showDataframeDiff(arr, a, e, truncate)

I see this as more unified and we wont have to check if its a Product or not.

ShowProductDiff is an update/extension to the original showDataframeDiff so we can have a nicely format table display for any Case Class and additionally Row. This was done so we have nicely format table for StructField

And since showProductDiff can now also display differences for any Case Class I figured we no longer need to convert a Dataset[T] into a Dataframe.

All these changes could be simplified by only adding a single parameter in the header because at the end the processing doesn't care if it's a row or a case class, it only requires the list of elements:

private[mrpowers] def showDataframeDiff( header: (String, String), actual: Seq[Row], expected: Seq[Row], truncate: Int = 20, minColWidth: Int = 3, className = Option[String] ): String = { val (className, lBracket, rBracket) = className match { case None => ("", "[", "]") case Some(cn) (cn, "(", ")") }

Use these 3 elements to make the representation, and when you call this function, pass the class name if it's not a Row.
I don't see any other benefit of getting the element as T because the comparison is not done here.

since we also use it for Seq[StructField] for displaying schema fields won't we have to make actual and expected Seq[Any] ?

private[mrpowers] def showDataframeDiff( header: (String, String), actual: Seq[Any], expected: Seq[Any], truncate: Int = 20, minColWidth: Int = 3, className = Option[String] ): String = { val (className, lBracket, rBracket) = className match { case None => ("", "[", "]") case Some(cn) (cn, "(", ")") }

def assertSmallDatasetContentEquality[T: ClassTag](actualDS: Dataset[T], expectedDS: Dataset[T], truncate: Int, equals: (T, T) => Boolean): Unit = { val a = actualDS.collect().toSeq val e = expectedDS.collect().toSeq if (!a.approximateSameElements(e, equals)) { val arr = ("Actual Content", "Expected Content") val runTimeClass = implicitly[ClassTag[T]].runtimeClass val prefix = if (runTimeClass == classOf[Row]) None else Some(runTimeClass.getSimpleName) val msg = "Diffs\n" ++ DataframeUtil.showDataframeDiff(arr, a.asRows, e.asRows, truncate, prefix) throw DatasetContentMismatch(msg) } }

If you don't want to do this here, move it inside showDataframeDiff, you can put the type parameter but you ar not forced to use it. But I see it weird to have the signature of your function as:

private[mrpowers] def showDataframeDiff[T:ClassTag]( header: (String, String), actual: Seq[Any], expected: Seq[Any], truncate: Int = 20, minColWidth: Int = 3 ): String

I prefer putting it in showProductDiff since it will also need to be on betterSchemaMismatchMessage. Do you think we should still use Seq[T]? I think that would make it at least safer for the input Type. ofcourse unless someone decided to use Any

Ok to change the name, it's called from a dataset method. To keep the [T], this is not a public method, it carries the type and the compiler will validate the schema of the actual and expected dataframe. The only problem is if the conversion to DataSet[T] is not correct, e.g. :

assertSmallDataset(spark.table("aTable").as[MyType], expected)

If the table doesn't match, the error will be raised with an exception in the .as[T] method in runtime.
Twhy reason I say if we detect a non-Row DataSet, to skschema validation schema because it's already done.

I see it simpler if you know the exact type of the data as a Row only, and you only have one way to work with it. If you still think this could be good I would suggest first making more tests, not only with products but also with datasets of only one type like DataSet[String] DataSet[Int] DataSet[Timestamp] DataSet[Array[Int]]. These types are not products and in some cases, I don't know how will it work.

Ah I see. I forgot that people can total do Dataset[Array[Int]]. I did some test, and it support single value type correctly but not Array[Int].
Actually, currently our main branch also has issue with Dataset[Seq[Int]]. let's me try fixing that

@alfonsorr I fixed issue with Dataset[Array[Int]] and also add test cases for String, Int, Array, Seq cases

core/src/main/scala/com/github/mrpowers/spark/fast/tests/ProductUtil.scala

zeotuan added 2 commits September 28, 2024 15:25

Add Table support for StructField Diff

9db07d8

Determine bracket based on type

7e9491e

zeotuan commented Sep 28, 2024

View reviewed changes

core/src/main/scala/com/github/mrpowers/spark/fast/tests/ProductUtil.scala Show resolved Hide resolved

core/src/main/scala/com/github/mrpowers/spark/fast/tests/ProductUtil.scala Outdated Show resolved Hide resolved

Add Test for schema diff

c8c15e5

zeotuan mentioned this pull request Sep 28, 2024

Improve Schema Comparer Error Message #159

Closed

3 tasks

zeotuan changed the title ~~Add Table support for StructField Diff~~ Improve Schema Diff Error Message Sep 28, 2024

Handle single valued case

cbc01e7

zeotuan marked this pull request as ready for review October 1, 2024 08:45

zeotuan requested a review from SemyonSinchenko as a code owner October 1, 2024 08:45

zeotuan requested a review from alfonsorr October 1, 2024 08:45

alfonsorr requested changes Oct 1, 2024

View reviewed changes

Make Row Diff not display class name

857f59e

zeotuan requested a review from alfonsorr October 1, 2024 22:11

zeotuan self-assigned this Oct 1, 2024

alfonsorr reviewed Oct 2, 2024

View reviewed changes

core/src/main/scala/com/github/mrpowers/spark/fast/tests/ProductUtil.scala Outdated Show resolved Hide resolved

zeotuan added 5 commits October 3, 2024 08:41

Disallow input default val

081fd5b

formatting

36b8dfa

handle Iterable cases

bcfb255

remove space

606355b

Fix String test

90b0d77

zeotuan requested a review from alfonsorr October 6, 2024 11:01

zeotuan merged commit 641fe4e into mrpowers-io:main Oct 12, 2024
6 checks passed

zeotuan mentioned this pull request Oct 12, 2024

Revert "Improve Schema Diff Error Message" #165

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Schema Diff Error Message #160

Improve Schema Diff Error Message #160

zeotuan commented Sep 28, 2024 •

edited

Loading

alfonsorr Oct 1, 2024

zeotuan Oct 1, 2024

alfonsorr Oct 1, 2024

zeotuan Oct 1, 2024

alfonsorr Oct 2, 2024

zeotuan Oct 2, 2024

alfonsorr Oct 2, 2024

zeotuan Oct 2, 2024 •

edited

Loading

alfonsorr Oct 3, 2024 •

edited

Loading

zeotuan Oct 3, 2024

zeotuan Oct 6, 2024

Improve Schema Diff Error Message #160

Improve Schema Diff Error Message #160

Conversation

zeotuan commented Sep 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeotuan Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

alfonsorr Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeotuan commented Sep 28, 2024 •

edited

Loading

zeotuan Oct 2, 2024 •

edited

Loading

alfonsorr Oct 3, 2024 •

edited

Loading