Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CustomTransform RequiredColumns & AddedColumns are case sensitive #153

Open
labbedaine opened this issue Nov 18, 2022 · 1 comment
Open

Comments

@labbedaine
Copy link

labbedaine commented Nov 18, 2022

Hi.

I would like to know if there is a way to turn off case sensitivity on requiredColumns and addedColumns? Even if I have spark.sql.caseSensitive set to false my unit test is still failing.

sparkSession.conf.set("spark.sql.caseSensitive", false)

test("CustomTransform RequiredColumns & AddedColumns are case sensitive") {
    val lowercaseDF = spark.createDF(List(("Hello, world")), List(("lowercase", StringType, false)))

    lowercaseDF
      .trans(
        CustomTransform(
          requiredColumns = Seq("LOWERCASE"),
          transform = withTest(),
          addedColumns = Seq("test"),
        )
      )

    def withTest()(df: DataFrame): DataFrame = {
      df.withColumn("test", lit("A simple test."))
    }
  }

The [LOWERCASE] columns are not included in the DataFrame with the following columns [lowercase]
com.github.mrpowers.spark.daria.sql.MissingDataFrameColumnsException: The [LOWERCASE] columns are not included in the DataFrame with the following columns [lowercase]
at com.github.mrpowers.spark.daria.sql.DataFrameColumnsChecker.validatePresenceOfColumns(DataFrameColumnsChecker.scala:19)

Thank you!

@brayanjuls
Copy link

Hi,

This seems to be a problem with how the library is validating the columns. I can go ahead and fix this problem by applying the following change if @MrPowers agrees with that.

I would change class com.github.mrpowers.spark.daria.sql.DataFrameColumnsChecker, from val missingColumns = requiredColNames.diff(df.columns.toSeq) to

    val givenColumns = df.columns.toSeq.map(_.toLowerCase)
    val requiredColumnsLower = requiredColNames.map(_.toLowerCase)
    requiredColumnsLower.diff(givenColumns)

That way the block of code keeps with time complexity O(n) and the problem is solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants