JSON to BigQuery Type Inference Problems #10451
Unanswered
hamishcraze
asked this question in
Q&A
Replies: 1 comment
-
Do you know the schemas of the tables ahead of time? Either way, I'm not sure that inferring schemas using pandas here is helping, and it's probably hurting. I would create the tables in bigquery with the correct schemas, the use the built-in |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi everybody. I've been tasked with syncing tables between BigQuery and our customers infrastructure. I create JSON representations of the customer data and then run a custom Python script with Ibis to ingest this into BigQuery.
I'm having issues with Type inference. See the problem is that a table gets created on BigQuery through Ibis and it receives a particular schema that it infers from the content of the JSON I send it. The problem is that the customer data is sparse and it is not uncommon for a particular column to have NULL for every single value.
This creates two challenges. 1. Correctly inferring the column type when you create the table and 2. Correctly matching the incoming data type to the already existing table when new data is coming in.
Since you never know when an entire column could be NULL, it makes this difficult.
My current solution is to send a dummy row with every JSON request. This dummy row has fake data in it, but with the correct type. This is there to ensure that we correctly infer the type when we sync it to BigQuery. So if there's a key called "Name"
I'll populate it's value with the word "STRING" in big capital letters in this dummy row to ensure there's at least one row with a value in it to help type inference down the line.
At this point it is useful to show you some of my code:
As you can see, I connect to BigQuery using Ibis, I read in the JSON data I've sent into a pandas DataFrame, I try to get Pandas to infer the types with df.infer_objects(). I mark the dummy row for deletion with a "Deletion Flag", I check if the table exists and if it does I explicitly try to convert the pandas dataframe types to the correct ones. Despite all of this, I still receive this error from BigQuery:
google.api_core.exceptions.BadRequest: 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/example_project/queries/job-id?maxResults=0&location=US&prettyPrint=false: Query column 16 has type INT64 which cannot be inserted into column DeliveryCity, which has type STRING at [1:853]
Which in other words is saying that the above code did not actually coerce the types.
So I am at a loss. How is it I can
So something must not be doing what it says it's doing and perhaps the problem lies with Ibis. Any ideas?
Beta Was this translation helpful? Give feedback.
All reactions