Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty array filtering issue #1268

Open
vokuk-walkme opened this issue Jul 23, 2024 · 2 comments
Open

Empty array filtering issue #1268

vokuk-walkme opened this issue Jul 23, 2024 · 2 comments
Assignees
Labels
waiting for information Waiting for additional information from the issue opener

Comments

@vokuk-walkme
Copy link

vokuk-walkme commented Jul 23, 2024

I have a piece of code which reads the same data from GCS or BigQuery. Simplified schema for the data looks like this:

root
 |-- time: timestamp (nullable = true)
 |-- name: string (nullable = true)
 |-- ids: array (nullable = true)
 |    |-- element: string (containsNull = true)

I need to read only the data where ids array is not empty and it seem to work fine with a GCS connector.

spark.read.format("parquet")
  .load("gs://some-path")
  .filter($"ids" =!= Array.empty[String])
  ... // other filters

However when I try to do the same with a BigQuery connector I get an error:

spark.read.format("bigquery")
  .option("viewsEnabled", "true")
  .load("internal-table-name")
  .filter($"ids" =!= Array.empty[String])
  ... // other filters

24/07/23 11:00:30 INFO ReadSessionCreator: |creation a read session for table null, parameters: |selectedFields=[name,ids],|filter=[(NOT (`ids` = ArraySeq())) AND (`ids` IS NOT NULL) AND (`name` = 'John') AND (`name` IS NOT NULL) AND (`time` <= TIMESTAMP '2024-07-18T23:59:59.999999Z') AND (`time` >= TIMESTAMP '2024-07-18T00:00:00Z') AND (`time` IS NOT NULL)]
com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.InvalidArgumentException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: request failed: Query error: Function not found: ArraySeq at [1:189]
        at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:92)
        at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:41)
        at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:86)
        at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
        at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
...

Seems like a predicate pushdown doesn't work correctly for this case.
Changing filter condition to .filter(array_size($"ids") > 0) works but this condition most likely not going to be pushed down.

Dependencies used:

"org.apache.spark" %% "spark-sql" % "3.5.0",
"com.google.cloud.bigdataoss" % "gcs-connector" % "3.0.0",
"com.google.cloud.spark" % "[spark-3.5-bigquery / spark-bigquery-with-dependencies]" % "0.39.1"
@vishalkarve15 vishalkarve15 self-assigned this Jul 29, 2024
@vokuk-walkme
Copy link
Author

Any updates on this?

@vishalkarve15
Copy link
Contributor

The filter is not getting translated and pushed down to BigQuery correctly. However, even if it did, BigQuery does not support equality comparisons on arrays.

Please try the workaround:

spark.read.format("bigquery")
  .option("viewsEnabled", "true")
  .option("filter", "ARRAY_LENGTH(ids) != 0")
  .load("internal-table-name")

@vishalkarve15 vishalkarve15 added the waiting for information Waiting for additional information from the issue opener label Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting for information Waiting for additional information from the issue opener
Projects
None yet
Development

No branches or pull requests

2 participants