Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADR for Data Collection #80

Closed
HumairAK opened this issue Apr 27, 2021 · 12 comments
Closed

ADR for Data Collection #80

HumairAK opened this issue Apr 27, 2021 · 12 comments
Assignees
Labels
ADR-needed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@HumairAK
Copy link
Member

HumairAK commented Apr 27, 2021

The Operate First environments will create a vast amount of operational data from platform systems and user workloads.
We want to publish the data under a license agreement that is similar to an open source license agreement.
We still have to operate in the boundaries of law and therefore cannot publish data that would break the law.

We need an ADR for different options on how to satisfy this requirement.

@msdisme
Copy link

msdisme commented May 3, 2021

Is there an issue for tracking the details of what we would like to capture (I'll want to have similar for Decorus) so that I may use that as a basis for discussions with university legal and IRB.

@tumido
Copy link
Member

tumido commented May 5, 2021

Data we want to collect and share:

  1. Application logs from all the applications running in the cluster, this can be really anything. If the application logs for example which users are connecting to it, we will collect it. Example (ODH JupyterHub):
[I 2021-05-05 10:02:42.206 JupyterHub pages:402] [email protected] is pending spawn
[I 2021-05-05 10:02:42.210 JupyterHub log:189] 200 GET /hub/spawn-pending/[email protected] ([email protected]@::ffff:10.131.0.1) 13.28ms
10:02:47.190 [ConfigProxy] �[32minfo�[39m: 200 GET /api/routes
http://10.131.2.139:9090!=http://10.131.2.139:8080
 2021-05-05 10:03:00.462 JupyterHub proxy:282] Adding user [email protected] to proxy /user/[email protected]/ => http://10.131.3.105:8080
10:03:00.465 [ConfigProxy] �[32minfo�[39m: Adding route /user/[email protected] -> http://10.131.3.105:8080
10:03:00.465 [ConfigProxy] �[32minfo�[39m: Route added /user/[email protected] -> http://10.131.3.105:8080
10:03:00.465 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/[email protected]
[I 2021-05-05 10:03:00.468 JupyterHub log:189] 200 GET /hub/api (@10.131.3.105) 1.97ms
[I 2021-05-05 10:03:00.469 JupyterHub users:671] Server [email protected] is ready
[I 2021-05-05 10:03:00.471 JupyterHub log:189] 200 GET /hub/api/users/[email protected]/server/progress ([email protected]@::ffff:10.131.0.1) 18057.13ms
[I 2021-05-05 10:03:00.528 JupyterHub log:189] 200 POST /hub/api/users/[email protected]/activity ([email protected]@10.131.3.105) 33.95ms
[I 2021-05-05 10:03:00.613 JupyterHub log:189] 302 GET /hub/spawn-pending/[email protected] -> /user/[email protected]/ ([email protected]@::ffff:10.131.0.1) 6.94ms
[I 2021-05-05 10:03:01.023 JupyterHub log:189] 302 GET /hub/api/oauth2/authorize?client_id=jupyterhub-user-tcoufal%2540redhat.com&redirect_uri=%2Fuser%2Ftcoufal%40redhat.com%2Foauth_callback&response_type=code&state=[secret] -> /user/[email protected]/oauth_callback?code=[secret]&state=[secret] ([email protected]@::ffff:10.131.0.1) 34.50ms
[I 2021-05-05 10:03:01.215 JupyterHub log:189] 200 POST /hub/api/oauth2/token ([email protected]@10.131.3.105) 53.71ms
[I 2021-05-05 10:03:01.246 JupyterHub log:189] 200 GET /hub/api/authorizations/token/[secret] ([email protected]@10.131.3.105) 24.52ms
10:03:02.662 [ConfigProxy] �[32minfo�[39m: 200 GET /api/routes
  1. Application metrics if the application exposes them, each application can define what metrics to show. This may include PII, if the username or what not is used to name a pod for example (labels can be anything really). Example (ODH JupyterHub):
# HELP jupyterhub_server_spawn_duration_seconds time taken for server spawning operation
# TYPE jupyterhub_server_spawn_duration_seconds histogram
jupyterhub_server_spawn_duration_seconds_bucket{le="0.5",status="success"} 0.0
jupyterhub_server_spawn_duration_seconds_bucket{le="1.0",status="success"} 0.0
jupyterhub_server_spawn_duration_seconds_bucket{le="2.5",status="success"} 0.0
jupyterhub_server_spawn_duration_seconds_bucket{le="5.0",status="success"} 0.0
jupyterhub_server_spawn_duration_seconds_bucket{le="10.0",status="success"} 1.0
jupyterhub_server_spawn_duration_seconds_bucket{le="15.0",status="success"} 8.0
jupyterhub_server_spawn_duration_seconds_bucket{le="30.0",status="success"} 27.0
jupyterhub_server_spawn_duration_seconds_bucket{le="60.0",status="success"} 42.0
jupyterhub_server_spawn_duration_seconds_bucket{le="120.0",status="success"} 52.0
jupyterhub_server_spawn_duration_seconds_bucket{le="+Inf",status="success"} 57.0
jupyterhub_server_spawn_duration_seconds_count{status="success"} 57.0
jupyterhub_server_spawn_duration_seconds_sum{status="success"} 3389.5434402088904
  1. Platform events - events generated by the OCP platform itself. Example (spawning a pod):
{"apiVersion":"v1","count":1,"eventTime":null,"firstTimestamp":"2021-05-05T10:07:31Z","involvedObject":{"apiVersion":"v1","kind":"Pod","name":"jupyterhub-nb-tcoufal-40redhat-2ecom","namespace":"opf-jupyterhub","resourceVersion":"209536743","uid":"7a27741f-a72d-4f0d-bf17-3cd0d3ede494"},"kind":"Event","lastTimestamp":"2021-05-05T10:07:31Z","message":"Add eth0 [10.131.3.106/23]","metadata":{"creationTimestamp":"2021-05-05T10:07:31Z","managedFields":[{"apiVersion":"v1","fieldsType":"FieldsV1","fieldsV1":{"f:count":{},"f:firstTimestamp":{},"f:involvedObject":{"f:apiVersion":{},"f:kind":{},"f:name":{},"f:namespace":{},"f:resourceVersion":{},"f:uid":{}},"f:lastTimestamp":{},"f:message":{},"f:reason":{},"f:source":{"f:component":{}},"f:type":{}},"manager":"multus","operation":"Update","time":"2021-05-05T10:07:31Z"}],"name":"jupyterhub-nb-tcoufal-40redhat-2ecom.167c23baee4d0e1c","namespace":"opf-jupyterhub","resourceVersion":"209537358","selfLink":"/api/v1/namespaces/opf-jupyterhub/events/jupyterhub-nb-tcoufal-40redhat-2ecom.167c23baee4d0e1c","uid":"40c934ef-4bd2-4f41-8686-e5c979adec62"},"reason":"AddedInterface","reportingComponent":"","reportingInstance":"","source":{"component":"multus"},"type":"Normal"}
{"apiVersion":"v1","count":1,"eventTime":null,"firstTimestamp":"2021-05-05T10:07:32Z","involvedObject":{"apiVersion":"v1","fieldPath":"spec.containers{notebook}","kind":"Pod","name":"jupyterhub-nb-tcoufal-40redhat-2ecom","namespace":"opf-jupyterhub","resourceVersion":"209536741","uid":"7a27741f-a72d-4f0d-bf17-3cd0d3ede494"},"kind":"Event","lastTimestamp":"2021-05-05T10:07:32Z","message":"Container image \"quay.io/thoth-station/s2i-minimal-notebook@sha256:eacfa74842ce6330991d945408bb37c3e8f37246ff3f1b98837cf7ae4f5a78af\" already present on machine","metadata":{"creationTimestamp":"2021-05-05T10:07:32Z","managedFields":[{"apiVersion":"v1","fieldsType":"FieldsV1","fieldsV1":{"f:count":{},"f:firstTimestamp":{},"f:involvedObject":{"f:apiVersion":{},"f:fieldPath":{},"f:kind":{},"f:name":{},"f:namespace":{},"f:resourceVersion":{},"f:uid":{}},"f:lastTimestamp":{},"f:message":{},"f:reason":{},"f:source":{"f:component":{},"f:host":{}},"f:type":{}},"manager":"kubelet","operation":"Update","time":"2021-05-05T10:07:32Z"}],"name":"jupyterhub-nb-tcoufal-40redhat-2ecom.167c23bb0d9cb74e","namespace":"opf-jupyterhub","resourceVersion":"209537393","selfLink":"/api/v1/namespaces/opf-jupyterhub/events/jupyterhub-nb-tcoufal-40redhat-2ecom.167c23bb0d9cb74e","uid":"4677bec6-ee3a-4866-9d0f-b3c3e06f86f6"},"reason":"Pulled","reportingComponent":"","reportingInstance":"","source":{"component":"kubelet","host":"os-wrk-1"},"type":"Normal"}
  1. Platform logs are similar to the application logs, but generated by the OCP platform itself. Example (OAuth logs):
I0427 19:19:02.124608       1 named_certificates.go:53] loaded SNI cert [1/"sni-serving-cert::/var/config/system/secrets/v4-0-config-system-router-certs/apps.zero.massopen.cloud::/var/config/system/secrets/v4-0-config-system-router-certs/apps.zero.massopen.cloud"]: "api.zero.massopen.cloud" [serving,client] validServingFor=[*.apps.zero.massopen.cloud,api.zero.massopen.cloud] issuer="R3" (2021-03-08 12:41:20 +0000 UTC to 2021-06-06 12:41:20 +0000 UTC (now=2021-04-27 19:19:02.124599505 +0000 UTC))
I0427 19:19:02.124830       1 named_certificates.go:53] loaded SNI cert [0/"self-signed loopback"]: "apiserver-loopback-client@1619551141" [serving] validServingFor=[apiserver-loopback-client] issuer="apiserver-loopback-client-ca@1619551141" (2021-04-27 18:19:00 +0000 UTC to 2022-04-27 18:19:00 +0000 UTC (now=2021-04-27 19:19:02.124819977 +0000 UTC))
E0427 19:21:28.160157       1 osinserver.go:91] internal error: system:serviceaccount:opf-jupyterhub:jupyterhub-hub has no redirectURIs; set serviceaccounts.openshift.io/oauth-redirecturi.<some-value>=<redirect> or create a dynamic URI using serviceaccounts.openshift.io/oauth-redirectreference.<some-value>=<reference>
E0427 19:21:28.160157       1 osinserver.go:91] internal error: system:serviceaccount:openshift-logging:kibana has no redirectURIs; set serviceaccounts.openshift.io/oauth-redirecturi.<some-value>=<redirect> or create a dynamic URI using serviceaccounts.openshift.io/oauth-redirectreference.<some-value>=<reference>
E0427 19:21:40.496897       1 osinserver.go:91] internal error: system:serviceaccount:openshift-logging:kibana has no redirectURIs; set serviceaccounts.openshift.io/oauth-redirecturi.<some-value>=<redirect> or create a dynamic URI using serviceaccounts.openshift.io/oauth-redirectreference.<some-value>=<reference>
E0427 19:21:40.496905       1 osinserver.go:91] internal error: system:serviceaccount:opf-jupyterhub:jupyterhub-hub has no redirectURIs; set serviceaccounts.openshift.io/oauth-redirecturi.<some-value>=<redirect> or create a dynamic URI using serviceaccounts.openshift.io/oauth-redirectreference.<some-value>=<reference>
E0427 19:21:46.638010       1 osinserver.go:91] internal error: system:serviceaccount:opf-jupyterhub:jupyterhub-hub has no redirectURIs; set serviceaccounts.openshift.io/oauth-redirecturi.<some-value>=<redirect> or create a dynamic URI using serviceaccounts.openshift.io/oauth-redirectreference.<some-value>=<reference>
E0428 14:28:36.180088       1 osinserver.go:91] internal error: system:serviceaccount:opf-monitoring:grafana-serviceaccount has no redirectURIs; set serviceaccounts.openshift.io/oauth-redirecturi.<some-value>=<redirect> or create a dynamic URI using serviceaccounts.openshift.io/oauth-redirectreference.<some-value>=<reference>
E0503 19:37:02.866939       1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, context canceled]
  1. Platform metrics - same data structure as the application metrics, but generated by the OCP itself. Sample of kube pod info metric:

kube_pod_info{container="kube-rbac-proxy-main", created_by_kind="<none>", created_by_name="<none>", endpoint="https-main", host_ip="192.12.185.110", job="kube-state-metrics", namespace="openshift-etcd", node="os-ctrl-0", pod="revision-pruner-5-os-ctrl-0", pod_ip="10.130.0.7", priority_class="system-node-critical", service="kube-state-metrics", uid="c5a1bc73-f28b-4e0f-9cc6-c2c7abd5b0b8"} 1
kube_pod_info{container="kube-rbac-proxy-main", created_by_kind="<none>", created_by_name="<none>", endpoint="https-main", host_ip="192.12.185.110", job="kube-state-metrics", namespace="openshift-kube-apiserver", node="os-ctrl-0", pod="revision-pruner-22-os-ctrl-0", pod_ip="10.130.0.139", priority_class="system-node-critical", service="kube-state-metrics", uid="721d7288-3b7a-4460-be92-bea36e3539fa"} 1
kube_pod_info{container="kube-rbac-proxy-main", created_by_kind="<none>", created_by_name="<none>", endpoint="https-main", host_ip="192.12.185.110", job="kube-state-metrics", namespace="openshift-kube-controller-manager", node="os-ctrl-0", pod="revision-pruner-12-os-ctrl-0", pod_ip="10.130.0.141", priority_class="system-node-critical", service="kube-state-metrics", uid="c92a8983-6ba6-42ed-af6f-535aed848e67"} 1
kube_pod_info{container="kube-rbac-proxy-main", created_by_kind="<none>", created_by_name="<none>", endpoint="https-main", host_ip="192.12.185.110", job="kube-state-metrics", namespace="openshift-kube-scheduler", node="os-ctrl-0", pod="revision-pruner-11-os-ctrl-0", pod_ip="10.130.0.140", priority_class="system-node-critical", service="kube-state-metrics", uid="e799030c-703f-4654-b896-8493f3e2dd35"} 1
kube_pod_info{container="kube-rbac-proxy-main", created_by_kind="<none>", created_by_name="<none>", endpoint="https-main", host_ip="192.12.185.111", job="kube-state-metrics", namespace="openshift-etcd", node="os-ctrl-1", pod="revision-pruner-5-os-ctrl-1", pod_ip="10.128.0.121", priority_class="system-node-critical", service="kube-state-metrics", uid="e8df07cc-9ad9-4479-8922-556b2a1cc2ae"} 1
  1. We are also collecting data derived from it, like alerts which are directly calculated from metrics, e.g.: AggregatedLoggingSystemCPUHigh alerts#5609

Data we are hosting for users and their applications. We're not collecting the data intentionally, but they can share via our platform:

  • We provide block storage which is used by applications and users to store data. Direct access to this block storage is available within the platform only and data can be retrieved only via proxy (the application mounting the storage itself).
  • We provide object storage that can be interfaced externally - users can access this data from outside of the platform if they have credentials to their object storage bucket.

@msdisme
Copy link

msdisme commented May 5, 2021

Thanks, this is great! should I break the details in the comment above into a different issue or does it make sense for them to live here?

@msdisme
Copy link

msdisme commented May 5, 2021

a quick update, met with the folks who review IRB - scheduling a follow up discussion with them to dive deeper in to the data.

@durandom
Copy link
Member

durandom commented May 7, 2021

Operational data specifically excludes users own data sets, i.e. it's only data that is generated by the platform: logs, metrics, telemetry.
For logs it excludes logs from the workloads pods, but includes logs from the platform pods. E.g. JupyterHub vs etcd
For metrics it'll include CPU metrics for workloads pods, but not metrics that the application exposes. E.g. JupyterHub metrcs vs Pod metric

The same definition can be made for workload data, which should be governed by an opt-in or opt-out policy - see #87

@HumairAK HumairAK changed the title Write an ADR for Data Collection ADR for Data Collection May 12, 2021
@billburnseh
Copy link

No updates from BU yet.

@billburnseh
Copy link

The Data Usage Agreement (DUA) is on the table and being discussed, including access to telemetry without anonymization.

@quaid
Copy link
Member

quaid commented Aug 26, 2021

We want to publish the data under a license agreement that is similar to an open source license agreement.
We still have to operate in the boundaries of law and therefore cannot publish data that would break the law.

Let's pull together a workstream to study and advise an approach from an open source licensing approach:

operate-first/community#79

@sesheta
Copy link
Member

sesheta commented Nov 24, 2021

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@sesheta sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 24, 2021
@sesheta
Copy link
Member

sesheta commented Dec 24, 2021

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@sesheta sesheta added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 24, 2021
@sesheta
Copy link
Member

sesheta commented Jan 23, 2022

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

@sesheta
Copy link
Member

sesheta commented Jan 23, 2022

@sesheta: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sesheta sesheta closed this as completed Jan 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADR-needed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

7 participants