-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor of schema for pipeline #97
Conversation
hellais
commented
Sep 10, 2024
•
edited
Loading
edited
- Replace observation_id with observation_idx (closes: Improve performance of the DELETE operation on observation generation #87)
- Use PARTITION KEY for deduplication instead of running deletes (closes: Improve schema of obs_* tables in preparation for 5.0.0-rc.0 #88)
- Refactor of CLI commands
The above will actually not do it, since the |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #97 +/- ##
==========================================
+ Coverage 82.14% 82.24% +0.09%
==========================================
Files 82 83 +1
Lines 6351 6347 -4
==========================================
+ Hits 5217 5220 +3
+ Misses 1134 1127 -7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
* Add extra check on observation ids
In the end the table migrations were run using the following manual migration script (click show to expand) showCREATE TABLE obs_web_new ( measurement_uid String, observation_idx UInt16, input Nullable(String), report_id String, measurement_start_time Datetime64(3, 'UTC'), software_name String, software_version String, test_name String, test_version String, bucket_date String, probe_asn UInt32, probe_cc String, probe_as_org_name String, probe_as_cc String, probe_as_name String, network_type String, platform String, origin String, engine_name String, engine_version String, architecture String, resolver_ip String, resolver_asn UInt32, resolver_cc String, resolver_as_org_name String, resolver_as_cc String, resolver_is_scrubbed UInt8, resolver_asn_probe UInt32, resolver_as_org_name_probe String, created_at Nullable(Datetime('UTC')), target_id Nullable(String), hostname Nullable(String), transaction_id Nullable(UInt16), ip Nullable(String), port Nullable(UInt16), ip_asn Nullable(UInt32), ip_as_org_name Nullable(String), ip_as_cc Nullable(String), ip_cc Nullable(String), ip_is_bogon Nullable(UInt8), dns_query_type Nullable(String), dns_failure Nullable(String), dns_engine Nullable(String), dns_engine_resolver_address Nullable(String), dns_answer_type Nullable(String), dns_answer Nullable(String), dns_answer_asn Nullable(UInt32), dns_answer_as_org_name Nullable(String), dns_t Nullable(Float64), tcp_failure Nullable(String), tcp_success Nullable(UInt8), tcp_t Nullable(Float64), tls_failure Nullable(String), tls_server_name Nullable(String), tls_version Nullable(String), tls_cipher_suite Nullable(String), tls_is_certificate_valid Nullable(UInt8), tls_end_entity_certificate_fingerprint Nullable(String), tls_end_entity_certificate_subject Nullable(String), tls_end_entity_certificate_subject_common_name Nullable(String), tls_end_entity_certificate_issuer Nullable(String), tls_end_entity_certificate_issuer_common_name Nullable(String), tls_end_entity_certificate_san_list Array(String), tls_end_entity_certificate_not_valid_after Nullable(Datetime64(3, 'UTC')), tls_end_entity_certificate_not_valid_before Nullable(Datetime64(3, 'UTC')), tls_certificate_chain_length Nullable(UInt16), tls_certificate_chain_fingerprints Array(String), tls_handshake_read_count Nullable(UInt16), tls_handshake_write_count Nullable(UInt16), tls_handshake_read_bytes Nullable(UInt32), tls_handshake_write_bytes Nullable(UInt32), tls_handshake_last_operation Nullable(String), tls_handshake_time Nullable(Float64), tls_t Nullable(Float64), http_request_url Nullable(String), http_network Nullable(String), http_alpn Nullable(String), http_failure Nullable(String), http_request_body_length Nullable(UInt32), http_request_method Nullable(String), http_runtime Nullable(Float64), http_response_body_length Nullable(Int32), http_response_body_is_truncated Nullable(UInt8), http_response_body_sha1 Nullable(String), http_response_status_code Nullable(UInt16), http_response_header_location Nullable(String), http_response_header_server Nullable(String), http_request_redirect_from Nullable(String), http_request_body_is_truncated Nullable(UInt8), http_t Nullable(Float64), probe_analysis Nullable(String) ) ENGINE = ReplacingMergeTree -- Partition by the month of the bucket PARTITION BY concat(substring(bucket_date, 1, 4), substring(bucket_date, 6, 2)) PRIMARY KEY (measurement_uid, observation_idx) ORDER BY (measurement_uid, observation_idx, measurement_start_time, probe_cc, probe_asn) SETTINGS index_granularity = 8192; INSERT INTO obs_web_new ( measurement_uid, observation_idx, input, report_id, measurement_start_time, software_name, software_version, test_name, test_version, probe_asn, probe_cc, probe_as_org_name, probe_as_cc, probe_as_name, network_type, platform, origin, engine_name, engine_version, architecture, resolver_ip, resolver_asn, resolver_cc, resolver_as_org_name, resolver_as_cc, resolver_is_scrubbed, resolver_asn_probe, resolver_as_org_name_probe, bucket_date, created_at, target_id, hostname, transaction_id, ip, port, ip_asn, ip_as_org_name, ip_as_cc, ip_cc, ip_is_bogon, dns_query_type, dns_failure, dns_engine, dns_engine_resolver_address, dns_answer_type, dns_answer, dns_answer_asn, dns_answer_as_org_name, dns_t, tcp_failure, tcp_success, tcp_t, tls_failure, tls_server_name, tls_version, tls_cipher_suite, tls_is_certificate_valid, tls_end_entity_certificate_fingerprint, tls_end_entity_certificate_subject, tls_end_entity_certificate_subject_common_name, tls_end_entity_certificate_issuer, tls_end_entity_certificate_issuer_common_name, tls_end_entity_certificate_san_list, tls_end_entity_certificate_not_valid_after, tls_end_entity_certificate_not_valid_before, tls_certificate_chain_length, tls_certificate_chain_fingerprints, tls_handshake_read_count, tls_handshake_write_count, tls_handshake_read_bytes, tls_handshake_write_bytes, tls_handshake_last_operation, tls_handshake_time, tls_t, http_request_url, http_network, http_alpn, http_failure, http_request_body_length, http_request_method, http_runtime, http_response_body_length, http_response_body_is_truncated, http_response_body_sha1, http_response_status_code, http_response_header_location, http_response_header_server, http_request_redirect_from, http_request_body_is_truncated, http_t, probe_analysis ) SELECT measurement_uid, IF( observation_id = '', 1, toUInt16(arraySlice(splitByChar('_', observation_id), -1, 1)[1]) + 1 ) as observation_idx, input, report_id, measurement_start_time, software_name, software_version, test_name, test_version, probe_asn, probe_cc, probe_as_org_name, probe_as_cc, probe_as_name, network_type, platform, origin, engine_name, engine_version, architecture, resolver_ip, resolver_asn, resolver_cc, resolver_as_org_name, resolver_as_cc, resolver_is_scrubbed, resolver_asn_probe, resolver_as_org_name_probe, bucket_date, created_at, target_id, hostname, transaction_id, ip, port, ip_asn, ip_as_org_name, ip_as_cc, ip_cc, ip_is_bogon, dns_query_type, dns_failure, dns_engine, dns_engine_resolver_address, dns_answer_type, dns_answer, dns_answer_asn, dns_answer_as_org_name, dns_t, tcp_failure, tcp_success, tcp_t, tls_failure, tls_server_name, tls_version, tls_cipher_suite, tls_is_certificate_valid, tls_end_entity_certificate_fingerprint, tls_end_entity_certificate_subject, tls_end_entity_certificate_subject_common_name, tls_end_entity_certificate_issuer, tls_end_entity_certificate_issuer_common_name, tls_end_entity_certificate_san_list, tls_end_entity_certificate_not_valid_after, tls_end_entity_certificate_not_valid_before, tls_certificate_chain_length, tls_certificate_chain_fingerprints, tls_handshake_read_count, tls_handshake_write_count, tls_handshake_read_bytes, tls_handshake_write_bytes, tls_handshake_last_operation, tls_handshake_time, tls_t, http_request_url, http_network, http_alpn, http_failure, http_request_body_length, http_request_method, http_runtime, http_response_body_length, http_response_body_is_truncated, http_response_body_sha1, http_response_status_code, http_response_header_location, http_response_header_server, http_request_redirect_from, http_request_body_is_truncated, http_t, probe_analysis FROM obs_web |