Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW FEATURE]: TUfast goes Data #123

Open
3 tasks
Noxdor opened this issue Apr 11, 2023 · 14 comments
Open
3 tasks

[NEW FEATURE]: TUfast goes Data #123

Noxdor opened this issue Apr 11, 2023 · 14 comments
Assignees
Labels
enhancement New feature or request

Comments

@Noxdor
Copy link
Member

Noxdor commented Apr 11, 2023

See TUfast-TUD/TUFastData#1 for more details.

After the mentioned API endpoints are created, they have to be used by the frontend.

  • There has to be a tracker in the local storage keeping track of how many clicks of the total amount of clicks were already send to the api. The delta will be send every (randomly) 1 - 3 days to distribute the traffic to the api/database evenly.
  • There has to be a boolean state array in the local storage that keeps track of which features are being used by the user and which not. This data will be send once per week to the api to keep track by which percentage every feature is used.
  • An api call must be send every time the link in the "hiring"/participating is being clicked
@Noxdor Noxdor added the enhancement New feature or request label Apr 11, 2023
@Noxdor Noxdor self-assigned this Apr 11, 2023
@OliEfr
Copy link
Member

OliEfr commented Apr 12, 2023

Folgende Statistiken fände ich interessant

Nur meine Ideen. Gerne auch ändern! Folgende Flags nutze ich zum klassifizieren:
RT: send to api in real time
weekly: send only once per week to api; with possible database reset in between (if only required)
ETI: easy to implement
HTI: harder to implement

High prio
Anzahl gesparter Klicks | RT + weekly, HTI
Welche Features aktiviert sind (OWAFetch, Suchmaschinenweiterleitung, AutoLogin) | weekly, ETI
Wie viele aktive User wir haben | weekly, ETI

Normal prio
Welche Studiengänge ausgewählt sind | weekly, ETI
Ob Opal-Kurse importiert sind | weekly, HTI
Wie oft Popup geöffnet wird | RTI, ETI
Welche Banner-Links gedrückt wurden | ETI, RT

Low prio
Wie oft Settings-Seite geöffnet wird und ob überhaupt | RT, ETI
Ob die User-Introduction durchgeklickt wurde | RT, ETI
Wie oft die einzelnen Icons im popup gedrückt werden | RT, ETI
Welche Shortcuts wie oft verwendet werden (Dafür gibts keinen localStorage bisher) | ET, RT

Relevante Objekte im LocalStorage

Seit @C0ntroller 's rework kenn ich die korrespondierenden Variablenname im local storage nicht mehr. Ich würde alles für die Variablennamen nach dem mv3 update (also aktuelle main/dev branch) auslegen! Vielleicht kann @C0ntroller helfen die Variablennamen für die Settingseinstellungen zu nennen (oder weiß sie aus dem Kopf)?

@Noxdor

@OliEfr
Copy link
Member

OliEfr commented Apr 12, 2023

Ich denke:

AutoLogin enabled: isEnabled
Weiterleitung enabled: fwdEnabled
Owa mails: enabledOWAFetch
Clicks: savedClickCounter
Pdfs: pdfInNewTab und pdfInInline
Studiengang: studiengang

Das müsste aber nochmal experimentell am besten bestätigt werden.

@C0ntroller
Copy link
Member

C0ntroller commented Apr 13, 2023

Ich denke:

AutoLogin enabled: isEnabled Weiterleitung enabled: fwdEnabled Owa mails: enabledOWAFetch Clicks: savedClickCounter Pdfs: pdfInNewTab und pdfInInline Studiengang: studiengang

Das müsste aber nochmal experimentell am besten bestätigt werden.

Sollte alles richtig sein, mein local storage hat gerade all das hier, sollte ja selbsterklärend sein:

additionalNotificationOnNewMail
enabledOWAFetch
fwdEnabled
isEnabled
pdfInInline
pdfInNewTab
savedClickCounter
studiengang

Auf die Datenschutzerklärung habe ich ja schonmal gar keine Lust ^^

@OliEfr
Copy link
Member

OliEfr commented Apr 13, 2023

@C0ntroller Die Datenschutzerklärung würde nur angepasst werden müssen, wenn nutzerbezogene Daten übertragen werden, was wir nicht tun werden. Die übertragenen Daten sind nicht zuordenbar und anonym. Über solche Änderungen müssen Nutzer übrigens auch nicht benachrichtigt werden.

Trotzdem wird es eine Meldung und Erklärung für die Nutzer geben, wie und warum wir das tun. Für Transparenz und Sicherheit.

@C0ntroller
Copy link
Member

Also mindestens (!) der Studiengang ist personenbezogen.
Und man kann definitiv streiten, ob meine persönlichen Einstellungen als personenbezogen gilt.

@Noxdor
Copy link
Member Author

Noxdor commented Apr 13, 2023

@C0ntroller Personenbezogene Daten sind Daten, die dich als Person eindeutig identifizieren könne. Weder der Studiengang, noch dein Einstellungssetup sind eineindeutig, somit lässt sich kein Nutzer durch diese Daten identifizieren. Die Einträge werden außerdem nicht einzeln gespeichert, sondern nur als Summe in der Datenbank, daraus lassen sich absolut keinerlei Schlüsse über einzelne Nutzer ziehen.

@OliEfr
Copy link
Member

OliEfr commented Apr 13, 2023

Here is my grain of opinion:

In general, the GDPR applies. I got my info from europa.eu, gdrp-info.eu and gdpr.eu.

I have two thoughts:

The first is: it is indeed possible to pinpoint an individual person by only knowing the saved klicks. Because how many people are there with this exact amount of saved klick...? Exactly. Probably only one. So this could be seen as a personal identifier - just as the name+surname is. (The same argument might hold for combining all the 6 setting flags, because a unique combination might exist. In contrast to that, the Studiengang might actually the most uncritical data because it doesnt pinpoint an individual but only a large group, IMO.) Everything here is based on the assumption that the saved klicks are unique to everyuser - do you think they are?

The second thought is: even though the data could theoretically identify, we still dont have a physical identifier, as a IP-Adress or CPU-Hardware-Information, or a Name, or a ethnicity would be. So we cannot really match an individual! Also, the data is non-persistent and can be faked easily. So it is not reliable for identification. The saved clicks, for instance, might change daily.

EDIT: after reading this again, I feel like this might fall under GDPR. Options to proceed:
a) stop doing this feature (although its very nice).
b) only collect selected information from TUfast users. This would need to be very carefully desinged though.
c) create a opt-in for the user and update privacy policy.

EDIT2: After reading it the second time, I am pretty convined that we cannot do it with a privacy policy+opt-in. Could some one please make an argument against it..?

@JulianHGER opinion?

@OliEfr
Copy link
Member

OliEfr commented Apr 13, 2023

I do not like option c too much. If only 20% of users opt-in, we don't have too many benefits. Maybe we could do the opt-in first, see how many users agree, and only then implement the rest?

@OliEfr
Copy link
Member

OliEfr commented Apr 13, 2023

Here is my analogy:

If you take all the data we collect and create a single string from it, would this string be unique (or would there only be a really small group of people with the sam string)? If the answer is yes, then it is personal data!

I dont know - is this analogy correct, or am I off?

@Noxdor
Copy link
Member Author

Noxdor commented Apr 14, 2023

I am going to make the argument, that neither the number of clicks nor the individual setup of which features are used, are going to enable us to identify a single individual. This is a simple result of how we will save this data in the database. The GDPR, as far as I understand, only applies to how the data is saved not how it is transferred to the saving solution (database), otherwise no anonymous collection would even be possible, because data is identifiable until it reaches the database (simply by how TCP and HTTP work, no connection without connectants).

So why will we not be able to identify a user? Simply because we will not save each request to the database as a new row/data point. We will only increment a counter with it. As soon as there are 2 data points mixed in that counter, it is impossible to identify a user. The sum can't be split back into its addends without knowing what they are (which we will simply not know by how the database and api is set up). So this information you are talking about, total number of clicks of a single user, boolean array of features activated of a single user, won't exist! By that not causing any issue with GDPR.

For example, assume the database contains the counter "total_saved_clicks" with the value 55. Now, is this 55 the sum of 50 and 5, 45 and 10, 13 and 42? We can never know, by that, never identify a user. Transferierung the delta value makes this even more impossible, because we don't know the start value of clicks of that user, making it impossible to trace back to the total number of clicks the user actually has.
A same argument can be made (except for the delta value, since this won't be relevant) for the boolean array of activated features. User A uses features 1 and 3, user B uses features 2 and 4. Now we have in the Database: 1 of 2 users uses feature 1. It is impossible to know which user it is, since we don't save any identifier connecting a feature usage to a user.

That is, as long as we don't keep track of things like "most clicks by a single user" which will indeed identify a single user.

@OliEfr
Copy link
Member

OliEfr commented Apr 14, 2023

Yes, I agree with you. The argument you make holds. Intuitively I also don't view the data we intend to use as personal.

However, I am not sure about the first premise:

The GDPR, as far as I understand, only applies to how the data is saved not how it is transferred to the saving solution (database), otherwise no anonymous collection would even be possible, because data is identifiable until it reaches the database (simply by how TCP and HTTP work, no connection without connectants).

My intuition was the following: it doesn't matter how the data is saved. The only thing thing that matters is which data is transferred and if this data can theoretically be used to identify a user - independent of how you actually gonna save and use the data. Because in the end, the user has no control over the data once it has been transferred to a remote entity. And yes, you would be correct in that case: all http traffis would be personal.

I see that this intuition makes only limited sense. I couldnt resolve this issue with a quick google search atm.

@OliEfr
Copy link
Member

OliEfr commented Apr 14, 2023

@C0ntroller
Copy link
Member

C0ntroller commented Apr 14, 2023

GDPR states that any personal data that is transferred to our backend and processed in any way must be disclosed.

If we just talk about the technicalities of connection, it should be safe, as the IP address is not processed by any program.
But for example, if the web server logs accesses - we must disclose that the IP address is stored, even if we never use it.

Also, GDPR clearly states:

personenbezogene Daten“ [sind] alle Informationen, die sich auf eine identifizierte oder identifizierbare natürliche Person (im Folgenden „betroffene Person“) beziehen; als identifizierbar wird eine natürliche Person angesehen, die direkt oder indirekt, insbesondere mittels Zuordnung zu einer Kennung wie einem Namen, zu einer Kennnummer, zu Standortdaten, zu einer Online-Kennung oder zu einem oder mehreren besonderen Merkmalen, die Ausdruck der physischen, physiologischen, genetischen, psychischen, wirtschaftlichen, kulturellen oder sozialen Identität dieser natürlichen Person sind, identifiziert werden kann; …

(didn't find the equivalent English section, source is German Wikipedia)

The study course is a social identity.
One could argue that just using TUfast and that usage getting logged is enough to prove that this person is a student of TU Dresden (rough location data and social identity student or employee at TUD).
Currently, the stats we have are collected by Google, who you need to agree on, that they are allowed to track you.

GDPR also states that personal data is every data, that potentially could make one identifiable.
And if that means, we know it's the same anonymous user we identified him (even if we don't know his name etc.).
So, when tracking the saved clicks, for example, it could be there is one guy with >500k saved clicks.
We would know every week who this guy is, as he's the only one with that score, and we know he's a power user → so we know some attributes of him.
As this potentially could happen, this raises the need for disclosure.

Privacy law and making something compliant with that law is hell.
That's one of the reasons most Websites have their cookie-banners made by a third party.
It's also why you always should have a Datenschutzerklärung even if you think it's not necessary (you never want to fight over this in court).

Unrelated, in my opinion, you should always give the user the choice anyway (Opt-in).

@OliEfr
Copy link
Member

OliEfr commented Apr 15, 2023

I agree - GDPR + opt-in is required.

Option A) Abort this feature
Option B) Ask for a opt-in first, then see how many comply, then implement the tracking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants