-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate readiness to union data relay data and historical clearinghouse data #423
Comments
Considering this issue done when Summer has a writeup on findings and appropriate follow-ups (potentially QC model, and specific fixes). Then issues will be created for the follow-on tasks. |
After comparing the data relay server data to the clearinghouse data, my conclusion is that we are not ready to union these two datasets. This is because there seems to be a significant amount of data missing in the data relay server, that we will want to resolve before cutting off the clearinghouse pipeline. My recommendations: Investigation details: Snowflake Notebook of this analysis: link. Using a representative sample of station IDs (including ~300 station IDs per district), I counted the number of observations per day in the clearinghouse dataset (unique by station ID and timestamp), and compared that to the same data in the data relay server. In the graph below, the dark blue line represents the number of observations per day in the clearinghouse dataset, where the light blue line is the number of obserations in the data relay server: As you can see, the totals have never matched exactly between the two datasets, and the data relay server consistently has fewer observations than the clearinghouse -- with particularly large gaps happening starting at the end of September and continuing to today. Sometimes this missing data is the result of data in the data relay server missing some (but not all) timestamp values in a given day -- for example, this station ID has data in both clearinghouse and data_relay on October 15th up until 20:00:03 -- but all timestamps after that for 10/15 are missing in the data_relay server: And sometimes this missing data is the result of entire days worth of data not being uploaded to S3 While there is a data discrepancy in all districts that should be investigated, District 4 and District 7 have a significant amount of missing data (this graph shows the number of observations in each district from Oct 1st - Oct 26th): We noticed that District 7 has not been collecting any data in S3 since October 7th: |
@summer-mothwood @ian-r-rose @ZhenyuZhu-Caltrans @kengodleskidot @jkarpen Updates Since Oct 7, the District 4's json -> parquet conversion failed due to the short of memory, And this happens the same for District 7 as well. So there are intermediate json files that are exceeding the internal linux machine's memory capacity. The mitigation is straightforward,
But this does not fully address the missing station ids before the Oct 7. So more investigation is needed. |
I recommend not using JSON as an intermediate representation, it has very poor memory characteristics. Notice that in the file sizes, parquet is about 100x smaller than JSON. Also, it looks like on average, the size of the JSON files increased by about 50%. Do you know why that might have happened? |
Fully understood. Parquet efficient compressions make it ideal for transferring large volume of data through network from Caltrans to Snowflake.
For this particular out-of-memory issue, I can quick fix by splitting large json into smaller chunks and convert to parquets separately, and merge into bigger one.
With this new function, I’ll re-upload the parquet files that are failed ones in the past.
It is worth noting that, currently our Dev / Prod environment is still mixed.
If this is not addressed, it would be a constant risk to the system stability.
We are working hard with our IT department on getting new Prod machines provisioned for PeMS.
From: Ian Rose ***@***.***>
Sent: Thursday, October 31, 2024 10:50 AM
To: cagov/caldata-mdsa-caltrans-pems ***@***.***>
Cc: Xiu, ***@***.*** ***@***.***>; Mention ***@***.***>
Subject: Re: [cagov/caldata-mdsa-caltrans-pems] Evaluate readiness to union data relay data and historical clearinghouse data (Issue #423)
EXTERNAL EMAIL. Links/attachments may not be safe.
I recommend not using JSON as an intermediate representation, it has very poor memory characteristics. Notice that in the file sizes, parquet is about 100x smaller than JSON.
Also, it looks like on average, the size of the JSON files increased by about 50%. Do you know why that might have happened?
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https:/github.com/cagov/caldata-mdsa-caltrans-pems/issues/423*issuecomment-2450479147__;Iw!!LWi6xHDyrA!4tzxWf_wuEB4lhlqzJ4hz-J9uKTri9CO1tBeHR3gYlCjVz0-LvcDQVM1us6FdnGeSAddFCMIqec3OQ4LrpUk2TUCqEpActs$>, or unsubscribe<https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/BDVOYUMOYFBGUBOGHJA3FJLZ6JUVDAVCNFSM6AAAAABPZOGSXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJQGQ3TSMJUG4__;!!LWi6xHDyrA!4tzxWf_wuEB4lhlqzJ4hz-J9uKTri9CO1tBeHR3gYlCjVz0-LvcDQVM1us6FdnGeSAddFCMIqec3OQ4LrpUk2TUCpyytqfE$>.
You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>
|
RE: “Also, it looks like on average, the size of the JSON files increased by about 50%. Do you know why that might have happened”
I also noticed on that. After I re-uploading all the missing data and check the remaining gap, if this still the case, we can dig further.
From: Xiu, ***@***.***
Sent: Thursday, October 31, 2024 11:32 AM
To: cagov/caldata-mdsa-caltrans-pems ***@***.***>; cagov/caldata-mdsa-caltrans-pems ***@***.***>
Cc: Mention ***@***.***>
Subject: RE: [cagov/caldata-mdsa-caltrans-pems] Evaluate readiness to union data relay data and historical clearinghouse data (Issue #423)
Fully understood. Parquet efficient compressions make it ideal for transferring large volume of data through network from Caltrans to Snowflake.
For this particular out-of-memory issue, I can quick fix by splitting large json into smaller chunks and convert to parquets separately, and merge into bigger one.
With this new function, I’ll re-upload the parquet files that are failed ones in the past.
It is worth noting that, currently our Dev / Prod environment is still mixed.
If this is not addressed, it would be a constant risk to the system stability.
We are working hard with our IT department on getting new Prod machines provisioned for PeMS.
From: Ian Rose ***@***.******@***.***>>
Sent: Thursday, October 31, 2024 10:50 AM
To: cagov/caldata-mdsa-caltrans-pems ***@***.******@***.***>>
Cc: Xiu, ***@***.*** ***@***.******@***.***>>; Mention ***@***.******@***.***>>
Subject: Re: [cagov/caldata-mdsa-caltrans-pems] Evaluate readiness to union data relay data and historical clearinghouse data (Issue #423)
EXTERNAL EMAIL. Links/attachments may not be safe.
I recommend not using JSON as an intermediate representation, it has very poor memory characteristics. Notice that in the file sizes, parquet is about 100x smaller than JSON.
Also, it looks like on average, the size of the JSON files increased by about 50%. Do you know why that might have happened?
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https:/github.com/cagov/caldata-mdsa-caltrans-pems/issues/423*issuecomment-2450479147__;Iw!!LWi6xHDyrA!4tzxWf_wuEB4lhlqzJ4hz-J9uKTri9CO1tBeHR3gYlCjVz0-LvcDQVM1us6FdnGeSAddFCMIqec3OQ4LrpUk2TUCqEpActs$>, or unsubscribe<https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/BDVOYUMOYFBGUBOGHJA3FJLZ6JUVDAVCNFSM6AAAAABPZOGSXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJQGQ3TSMJUG4__;!!LWi6xHDyrA!4tzxWf_wuEB4lhlqzJ4hz-J9uKTri9CO1tBeHR3gYlCjVz0-LvcDQVM1us6FdnGeSAddFCMIqec3OQ4LrpUk2TUCpyytqfE$>.
You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>
|
The solution for the data loss is found. Currently, the following code line failed to reserve memory for bigger json (>2.5G),
Splitting the big json into smaller parts, read them separately, will solve the issue
I'm go ahead to apply the fix to the data relay and will verify the counts. |
I think a much better solution would be to write the files as parquet in the first place. All of the parquet files for D7 in S3 are 10-20 MB, and basically any machine should be able to handle those. I'm not sure what the constraints are that prevent using parquet as an intermediate storage format. Do you understand why the files are so large? Even with the poor memory footprint of JSON files, I'm a little surprised that they are 2.5 GB. |
"what the constraints are that prevent using parquet as an intermediate storage format."
|
I know you've talked about this before, but I don't really understand it. Why would changing a file format require a new network drive? If you're writing JSON, couldn't you just write parquet instead?
We can, however, monitor data quality within Snowflake. I don't think that JSON+Logstash is necessarily the best tool for monitoring billions of records hosted as JSON blobs on a single machine (we have a whole scalable data warehouse for that!). One idea we discussed a while ago was to write JSON logs in your pipeline that described what was being done, rather than writing the whole dataset as JSON. Can we revisit that? |
I would only comment about 1: Currently we write to Kafka for intermediate (in json). And Kafka is isolated environment. (Dedicated to PeMS) And Kafka cannot accept parquet. |
Since the work/conversation for #453 has been happening in this ticket, I closed 453, and we'll continue the dicussion here. Thank you for letting us know about the inability to save parquet files in Kafka, @pingpingxiu-DOT-ca-gov . Is your solution to batch the json files moving along? |
In #270 we are going to union the data relay data and the historical clearinghouse data. It would be good to have a model (either ad-hoc or ongoing) to evaluate the readiness to do that. This would go a long way towards making sure it is successful when we do it.
A few thoughts of things to check (some of which may overlap in implementation):
TRANSFORMER_DEV
role has, so it may not be the highest priority right now)@summer-mothwood let's plan to chat through some of these in detail
The text was updated successfully, but these errors were encountered: