ADR for Data Collection #80

HumairAK · 2021-04-27T15:04:20Z

The Operate First environments will create a vast amount of operational data from platform systems and user workloads.
We want to publish the data under a license agreement that is similar to an open source license agreement.
We still have to operate in the boundaries of law and therefore cannot publish data that would break the law.

We need an ADR for different options on how to satisfy this requirement.

msdisme · 2021-05-03T15:17:41Z

Is there an issue for tracking the details of what we would like to capture (I'll want to have similar for Decorus) so that I may use that as a basis for discussions with university legal and IRB.

tumido · 2021-05-05T10:30:23Z

Data we want to collect and share:

Application logs from all the applications running in the cluster, this can be really anything. If the application logs for example which users are connecting to it, we will collect it. Example (ODH JupyterHub):

[I 2021-05-05 10:02:42.206 JupyterHub pages:402] [email protected] is pending spawn
[I 2021-05-05 10:02:42.210 JupyterHub log:189] 200 GET /hub/spawn-pending/[email protected] ([email protected]@::ffff:10.131.0.1) 13.28ms
10:02:47.190 [ConfigProxy] �[32minfo�[39m: 200 GET /api/routes
http://10.131.2.139:9090!=http://10.131.2.139:8080
 2021-05-05 10:03:00.462 JupyterHub proxy:282] Adding user [email protected] to proxy /user/[email protected]/ => http://10.131.3.105:8080
10:03:00.465 [ConfigProxy] �[32minfo�[39m: Adding route /user/[email protected] -> http://10.131.3.105:8080
10:03:00.465 [ConfigProxy] �[32minfo�[39m: Route added /user/[email protected] -> http://10.131.3.105:8080
10:03:00.465 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/[email protected]
[I 2021-05-05 10:03:00.468 JupyterHub log:189] 200 GET /hub/api (@10.131.3.105) 1.97ms
[I 2021-05-05 10:03:00.469 JupyterHub users:671] Server [email protected] is ready
[I 2021-05-05 10:03:00.471 JupyterHub log:189] 200 GET /hub/api/users/[email protected]/server/progress ([email protected]@::ffff:10.131.0.1) 18057.13ms
[I 2021-05-05 10:03:00.528 JupyterHub log:189] 200 POST /hub/api/users/[email protected]/activity ([email protected]@10.131.3.105) 33.95ms
[I 2021-05-05 10:03:00.613 JupyterHub log:189] 302 GET /hub/spawn-pending/[email protected] -> /user/[email protected]/ ([email protected]@::ffff:10.131.0.1) 6.94ms
[I 2021-05-05 10:03:01.023 JupyterHub log:189] 302 GET /hub/api/oauth2/authorize?client_id=jupyterhub-user-tcoufal%2540redhat.com&redirect_uri=%2Fuser%2Ftcoufal%40redhat.com%2Foauth_callback&response_type=code&state=[secret] -> /user/[email protected]/oauth_callback?code=[secret]&state=[secret] ([email protected]@::ffff:10.131.0.1) 34.50ms
[I 2021-05-05 10:03:01.215 JupyterHub log:189] 200 POST /hub/api/oauth2/token ([email protected]@10.131.3.105) 53.71ms
[I 2021-05-05 10:03:01.246 JupyterHub log:189] 200 GET /hub/api/authorizations/token/[secret] ([email protected]@10.131.3.105) 24.52ms
10:03:02.662 [ConfigProxy] �[32minfo�[39m: 200 GET /api/routes

Application metrics if the application exposes them, each application can define what metrics to show. This may include PII, if the username or what not is used to name a pod for example (labels can be anything really). Example (ODH JupyterHub):

# HELP jupyterhub_server_spawn_duration_seconds time taken for server spawning operation
# TYPE jupyterhub_server_spawn_duration_seconds histogram
jupyterhub_server_spawn_duration_seconds_bucket{le="0.5",status="success"} 0.0
jupyterhub_server_spawn_duration_seconds_bucket{le="1.0",status="success"} 0.0
jupyterhub_server_spawn_duration_seconds_bucket{le="2.5",status="success"} 0.0
jupyterhub_server_spawn_duration_seconds_bucket{le="5.0",status="success"} 0.0
jupyterhub_server_spawn_duration_seconds_bucket{le="10.0",status="success"} 1.0
jupyterhub_server_spawn_duration_seconds_bucket{le="15.0",status="success"} 8.0
jupyterhub_server_spawn_duration_seconds_bucket{le="30.0",status="success"} 27.0
jupyterhub_server_spawn_duration_seconds_bucket{le="60.0",status="success"} 42.0
jupyterhub_server_spawn_duration_seconds_bucket{le="120.0",status="success"} 52.0
jupyterhub_server_spawn_duration_seconds_bucket{le="+Inf",status="success"} 57.0
jupyterhub_server_spawn_duration_seconds_count{status="success"} 57.0
jupyterhub_server_spawn_duration_seconds_sum{status="success"} 3389.5434402088904

Platform events - events generated by the OCP platform itself. Example (spawning a pod):

{"apiVersion":"v1","count":1,"eventTime":null,"firstTimestamp":"2021-05-05T10:07:31Z","involvedObject":{"apiVersion":"v1","kind":"Pod","name":"jupyterhub-nb-tcoufal-40redhat-2ecom","namespace":"opf-jupyterhub","resourceVersion":"209536743","uid":"7a27741f-a72d-4f0d-bf17-3cd0d3ede494"},"kind":"Event","lastTimestamp":"2021-05-05T10:07:31Z","message":"Add eth0 [10.131.3.106/23]","metadata":{"creationTimestamp":"2021-05-05T10:07:31Z","managedFields":[{"apiVersion":"v1","fieldsType":"FieldsV1","fieldsV1":{"f:count":{},"f:firstTimestamp":{},"f:involvedObject":{"f:apiVersion":{},"f:kind":{},"f:name":{},"f:namespace":{},"f:resourceVersion":{},"f:uid":{}},"f:lastTimestamp":{},"f:message":{},"f:reason":{},"f:source":{"f:component":{}},"f:type":{}},"manager":"multus","operation":"Update","time":"2021-05-05T10:07:31Z"}],"name":"jupyterhub-nb-tcoufal-40redhat-2ecom.167c23baee4d0e1c","namespace":"opf-jupyterhub","resourceVersion":"209537358","selfLink":"/api/v1/namespaces/opf-jupyterhub/events/jupyterhub-nb-tcoufal-40redhat-2ecom.167c23baee4d0e1c","uid":"40c934ef-4bd2-4f41-8686-e5c979adec62"},"reason":"AddedInterface","reportingComponent":"","reportingInstance":"","source":{"component":"multus"},"type":"Normal"}
{"apiVersion":"v1","count":1,"eventTime":null,"firstTimestamp":"2021-05-05T10:07:32Z","involvedObject":{"apiVersion":"v1","fieldPath":"spec.containers{notebook}","kind":"Pod","name":"jupyterhub-nb-tcoufal-40redhat-2ecom","namespace":"opf-jupyterhub","resourceVersion":"209536741","uid":"7a27741f-a72d-4f0d-bf17-3cd0d3ede494"},"kind":"Event","lastTimestamp":"2021-05-05T10:07:32Z","message":"Container image \"quay.io/thoth-station/s2i-minimal-notebook@sha256:eacfa74842ce6330991d945408bb37c3e8f37246ff3f1b98837cf7ae4f5a78af\" already present on machine","metadata":{"creationTimestamp":"2021-05-05T10:07:32Z","managedFields":[{"apiVersion":"v1","fieldsType":"FieldsV1","fieldsV1":{"f:count":{},"f:firstTimestamp":{},"f:involvedObject":{"f:apiVersion":{},"f:fieldPath":{},"f:kind":{},"f:name":{},"f:namespace":{},"f:resourceVersion":{},"f:uid":{}},"f:lastTimestamp":{},"f:message":{},"f:reason":{},"f:source":{"f:component":{},"f:host":{}},"f:type":{}},"manager":"kubelet","operation":"Update","time":"2021-05-05T10:07:32Z"}],"name":"jupyterhub-nb-tcoufal-40redhat-2ecom.167c23bb0d9cb74e","namespace":"opf-jupyterhub","resourceVersion":"209537393","selfLink":"/api/v1/namespaces/opf-jupyterhub/events/jupyterhub-nb-tcoufal-40redhat-2ecom.167c23bb0d9cb74e","uid":"4677bec6-ee3a-4866-9d0f-b3c3e06f86f6"},"reason":"Pulled","reportingComponent":"","reportingInstance":"","source":{"component":"kubelet","host":"os-wrk-1"},"type":"Normal"}

Platform logs are similar to the application logs, but generated by the OCP platform itself. Example (OAuth logs):

I0427 19:19:02.124608       1 named_certificates.go:53] loaded SNI cert [1/"sni-serving-cert::/var/config/system/secrets/v4-0-config-system-router-certs/apps.zero.massopen.cloud::/var/config/system/secrets/v4-0-config-system-router-certs/apps.zero.massopen.cloud"]: "api.zero.massopen.cloud" [serving,client] validServingFor=[*.apps.zero.massopen.cloud,api.zero.massopen.cloud] issuer="R3" (2021-03-08 12:41:20 +0000 UTC to 2021-06-06 12:41:20 +0000 UTC (now=2021-04-27 19:19:02.124599505 +0000 UTC))
I0427 19:19:02.124830       1 named_certificates.go:53] loaded SNI cert [0/"self-signed loopback"]: "apiserver-loopback-client@1619551141" [serving] validServingFor=[apiserver-loopback-client] issuer="apiserver-loopback-client-ca@1619551141" (2021-04-27 18:19:00 +0000 UTC to 2022-04-27 18:19:00 +0000 UTC (now=2021-04-27 19:19:02.124819977 +0000 UTC))
E0427 19:21:28.160157       1 osinserver.go:91] internal error: system:serviceaccount:opf-jupyterhub:jupyterhub-hub has no redirectURIs; set serviceaccounts.openshift.io/oauth-redirecturi.<some-value>=<redirect> or create a dynamic URI using serviceaccounts.openshift.io/oauth-redirectreference.<some-value>=<reference>
E0427 19:21:28.160157       1 osinserver.go:91] internal error: system:serviceaccount:openshift-logging:kibana has no redirectURIs; set serviceaccounts.openshift.io/oauth-redirecturi.<some-value>=<redirect> or create a dynamic URI using serviceaccounts.openshift.io/oauth-redirectreference.<some-value>=<reference>
E0427 19:21:40.496897       1 osinserver.go:91] internal error: system:serviceaccount:openshift-logging:kibana has no redirectURIs; set serviceaccounts.openshift.io/oauth-redirecturi.<some-value>=<redirect> or create a dynamic URI using serviceaccounts.openshift.io/oauth-redirectreference.<some-value>=<reference>
E0427 19:21:40.496905       1 osinserver.go:91] internal error: system:serviceaccount:opf-jupyterhub:jupyterhub-hub has no redirectURIs; set serviceaccounts.openshift.io/oauth-redirecturi.<some-value>=<redirect> or create a dynamic URI using serviceaccounts.openshift.io/oauth-redirectreference.<some-value>=<reference>
E0427 19:21:46.638010       1 osinserver.go:91] internal error: system:serviceaccount:opf-jupyterhub:jupyterhub-hub has no redirectURIs; set serviceaccounts.openshift.io/oauth-redirecturi.<some-value>=<redirect> or create a dynamic URI using serviceaccounts.openshift.io/oauth-redirectreference.<some-value>=<reference>
E0428 14:28:36.180088       1 osinserver.go:91] internal error: system:serviceaccount:opf-monitoring:grafana-serviceaccount has no redirectURIs; set serviceaccounts.openshift.io/oauth-redirecturi.<some-value>=<redirect> or create a dynamic URI using serviceaccounts.openshift.io/oauth-redirectreference.<some-value>=<reference>
E0503 19:37:02.866939       1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, context canceled]

Platform metrics - same data structure as the application metrics, but generated by the OCP itself. Sample of kube pod info metric:


kube_pod_info{container="kube-rbac-proxy-main", created_by_kind="<none>", created_by_name="<none>", endpoint="https-main", host_ip="192.12.185.110", job="kube-state-metrics", namespace="openshift-etcd", node="os-ctrl-0", pod="revision-pruner-5-os-ctrl-0", pod_ip="10.130.0.7", priority_class="system-node-critical", service="kube-state-metrics", uid="c5a1bc73-f28b-4e0f-9cc6-c2c7abd5b0b8"} 1
kube_pod_info{container="kube-rbac-proxy-main", created_by_kind="<none>", created_by_name="<none>", endpoint="https-main", host_ip="192.12.185.110", job="kube-state-metrics", namespace="openshift-kube-apiserver", node="os-ctrl-0", pod="revision-pruner-22-os-ctrl-0", pod_ip="10.130.0.139", priority_class="system-node-critical", service="kube-state-metrics", uid="721d7288-3b7a-4460-be92-bea36e3539fa"} 1
kube_pod_info{container="kube-rbac-proxy-main", created_by_kind="<none>", created_by_name="<none>", endpoint="https-main", host_ip="192.12.185.110", job="kube-state-metrics", namespace="openshift-kube-controller-manager", node="os-ctrl-0", pod="revision-pruner-12-os-ctrl-0", pod_ip="10.130.0.141", priority_class="system-node-critical", service="kube-state-metrics", uid="c92a8983-6ba6-42ed-af6f-535aed848e67"} 1
kube_pod_info{container="kube-rbac-proxy-main", created_by_kind="<none>", created_by_name="<none>", endpoint="https-main", host_ip="192.12.185.110", job="kube-state-metrics", namespace="openshift-kube-scheduler", node="os-ctrl-0", pod="revision-pruner-11-os-ctrl-0", pod_ip="10.130.0.140", priority_class="system-node-critical", service="kube-state-metrics", uid="e799030c-703f-4654-b896-8493f3e2dd35"} 1
kube_pod_info{container="kube-rbac-proxy-main", created_by_kind="<none>", created_by_name="<none>", endpoint="https-main", host_ip="192.12.185.111", job="kube-state-metrics", namespace="openshift-etcd", node="os-ctrl-1", pod="revision-pruner-5-os-ctrl-1", pod_ip="10.128.0.121", priority_class="system-node-critical", service="kube-state-metrics", uid="e8df07cc-9ad9-4479-8922-556b2a1cc2ae"} 1

We are also collecting data derived from it, like alerts which are directly calculated from metrics, e.g.: AggregatedLoggingSystemCPUHigh alerts#5609

Data we are hosting for users and their applications. We're not collecting the data intentionally, but they can share via our platform:

We provide block storage which is used by applications and users to store data. Direct access to this block storage is available within the platform only and data can be retrieved only via proxy (the application mounting the storage itself).
We provide object storage that can be interfaced externally - users can access this data from outside of the platform if they have credentials to their object storage bucket.

msdisme · 2021-05-05T13:43:12Z

Thanks, this is great! should I break the details in the comment above into a different issue or does it make sense for them to live here?

msdisme · 2021-05-05T17:30:19Z

a quick update, met with the folks who review IRB - scheduling a follow up discussion with them to dive deeper in to the data.

durandom · 2021-05-07T06:09:41Z

Operational data specifically excludes users own data sets, i.e. it's only data that is generated by the platform: logs, metrics, telemetry.
For logs it excludes logs from the workloads pods, but includes logs from the platform pods. E.g. JupyterHub vs etcd
For metrics it'll include CPU metrics for workloads pods, but not metrics that the application exposes. E.g. JupyterHub metrcs vs Pod metric

The same definition can be made for workload data, which should be governed by an opt-in or opt-out policy - see #87

billburnseh · 2021-07-07T12:15:41Z

No updates from BU yet.

billburnseh · 2021-07-21T12:17:28Z

The Data Usage Agreement (DUA) is on the table and being discussed, including access to telemetry without anonymization.

quaid · 2021-08-26T15:50:23Z

We want to publish the data under a license agreement that is similar to an open source license agreement.
We still have to operate in the boundaries of law and therefore cannot publish data that would break the law.

Let's pull together a workstream to study and advise an approach from an open source licensing approach:

operate-first/community#79

sesheta · 2021-11-24T18:16:27Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

sesheta · 2021-12-24T18:22:09Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

sesheta · 2022-01-23T21:23:24Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

sesheta · 2022-01-23T21:23:26Z

@sesheta: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

HumairAK mentioned this issue Apr 27, 2021

[WIP] Data collection #75

Closed

HumairAK added the ADR-needed label May 5, 2021

durandom mentioned this issue May 7, 2021

Definition of the most open environment #78

Closed

durandom mentioned this issue May 7, 2021

ADR for defintion and collection of workload data #87

Closed

HumairAK changed the title ~~Write an ADR for Data Collection~~ ADR for Data Collection May 12, 2021

HumairAK assigned billburnseh Jun 23, 2021

sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 24, 2021

sesheta added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 24, 2021

sesheta closed this as completed Jan 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADR for Data Collection #80

ADR for Data Collection #80

HumairAK commented Apr 27, 2021 •

edited

Loading

msdisme commented May 3, 2021

tumido commented May 5, 2021

msdisme commented May 5, 2021

msdisme commented May 5, 2021

durandom commented May 7, 2021 •

edited

Loading

billburnseh commented Jul 7, 2021

billburnseh commented Jul 21, 2021

quaid commented Aug 26, 2021

sesheta commented Nov 24, 2021

sesheta commented Dec 24, 2021

sesheta commented Jan 23, 2022

sesheta commented Jan 23, 2022

ADR for Data Collection #80

ADR for Data Collection #80

Comments

HumairAK commented Apr 27, 2021 • edited Loading

msdisme commented May 3, 2021

tumido commented May 5, 2021

msdisme commented May 5, 2021

msdisme commented May 5, 2021

durandom commented May 7, 2021 • edited Loading

billburnseh commented Jul 7, 2021

billburnseh commented Jul 21, 2021

quaid commented Aug 26, 2021

sesheta commented Nov 24, 2021

sesheta commented Dec 24, 2021

sesheta commented Jan 23, 2022

sesheta commented Jan 23, 2022

HumairAK commented Apr 27, 2021 •

edited

Loading

durandom commented May 7, 2021 •

edited

Loading