You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think the source of this problem is the way Bigquery Transfer Service handles backfills - ie when doing backfills Match tables are not updated. Source
For example we started using the transfers service on the 20th August and ALL the tables started syncing. Since then, we have a row (snapshot) for every single table for every single day. That is the _DATA_DATE field based on _PARTITIONTIME.
Since we wanted to get historic data we initiated a backfill and it populated all BUT the Match tables with a row for each day in the past. So for example the MIN(_DATA_DATE) in the campaigns table is still 20.8.2018, whereas in the campaign_stats table that MIN(_DATA_DATE) is currently 1.3.2018.
Now in the Looker block in the explore all the joins between the stats and Match tables happen on the id AND _DATA_DATE like this: sql_on: ${master_stats.campaign_id} = ${campaign.campaign_id} AND ${master_stats._data_raw} = ${campaign._data_raw}
But for all the historic (backfill) data that join will never work since the Match tables are not updated during backfill.
My understanding would be that this particular join condition is there to make sure you use the values of the Match table that were valid at the time the stats were taken. Which is perfectly fine and it works as well if you have the data for match tables for the same period (in our case that is everything after 20.8.2018. But since we don't have the data going that far back we are getting null values when joining with campaign.
A possible fix is to change the condition to use the latest campaign value like so: sql_on: ${master_stats.campaign_id} = ${campaign.campaign_id} AND ${campaign._data_raw} = ${campaign._latest}
This would mean you are always looking at the latest values of the campaign but at least you would be getting some values. I guess another option would be to try to figure out if the join can happen before the data exists and if it doesn't, use the latest version (or first for that matter).
It this something you have considered, does that make sense or am i missing something and there is a reason why it was done that way?
Thanks,
The text was updated successfully, but these errors were encountered:
I think the source of this problem is the way Bigquery Transfer Service handles backfills - ie when doing backfills Match tables are not updated. Source
For example we started using the transfers service on the 20th August and ALL the tables started syncing. Since then, we have a row (snapshot) for every single table for every single day. That is the
_DATA_DATE
field based on_PARTITIONTIME
.Since we wanted to get historic data we initiated a backfill and it populated all BUT the Match tables with a row for each day in the past. So for example the
MIN(_DATA_DATE)
in the campaigns table is still 20.8.2018, whereas in the campaign_stats table thatMIN(_DATA_DATE)
is currently 1.3.2018.Now in the Looker block in the explore all the joins between the stats and Match tables happen on the
id
AND_DATA_DATE
like this:sql_on: ${master_stats.campaign_id} = ${campaign.campaign_id} AND ${master_stats._data_raw} = ${campaign._data_raw}
But for all the historic (backfill) data that join will never work since the Match tables are not updated during backfill.
My understanding would be that this particular join condition is there to make sure you use the values of the Match table that were valid at the time the stats were taken. Which is perfectly fine and it works as well if you have the data for match tables for the same period (in our case that is everything after 20.8.2018. But since we don't have the data going that far back we are getting
null
values when joining with campaign.A possible fix is to change the condition to use the latest campaign value like so:
sql_on: ${master_stats.campaign_id} = ${campaign.campaign_id} AND ${campaign._data_raw} = ${campaign._latest}
This would mean you are always looking at the latest values of the campaign but at least you would be getting some values. I guess another option would be to try to figure out if the join can happen before the data exists and if it doesn't, use the latest version (or first for that matter).
It this something you have considered, does that make sense or am i missing something and there is a reason why it was done that way?
Thanks,
The text was updated successfully, but these errors were encountered: