From c4998cd07fbd3b15f2f0c25a1deaf5d9b184f796 Mon Sep 17 00:00:00 2001 From: Djura Smits Date: Tue, 6 Aug 2024 13:18:42 +0200 Subject: [PATCH 1/5] Incorporated TODO feedback --- episodes/chap1_introduction_FL.md | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/episodes/chap1_introduction_FL.md b/episodes/chap1_introduction_FL.md index e21f15f8..783de0fb 100644 --- a/episodes/chap1_introduction_FL.md +++ b/episodes/chap1_introduction_FL.md @@ -128,12 +128,11 @@ branch of techniques that can be used for this is Secure Multiparty Computation (MPC). With MPC, computations are performed collaboratively by multiple parties. Data is encrypted in such a way that other parties cannot see the original values, but values of multiple parties can still be combined ( -e.g. added or -multiplied). +e.g. added or multiplied). A classic technique from the field of MPC is secret sharing. With this technique data is encrypted, after which pieces of the encryption are sent to the other parties. No single party will be able to -reconstruct the original value. Only when all parties work together, the original value can be -retrieved. +reconstruct the original value. Only when a certain minimum of parties work together (n-1 in many cases) +the original value can be retrieved. When combining multiple values using secret sharing, this will result in the parties owning new puzzle pieces that when put together will reveal the result of the computation. @@ -176,7 +175,9 @@ weighs. An aggregation is differentially private when someone cannot infer whether a particular individual was used in the computation. A way to make a result more differentially private is to replace a -selection of inputs with random noise. A single individual will then always be able to deny that +selection of inputs with random noise. Given the dataset is large enough, and the noise is distributed +appropriately for the data, this will approximately retain the same statistical properties as the +original dataset. Because of this noise, a single individual will then always be able to deny that their data has contributed to the final result. An individual has _plausible deniability_ with regards to whether it was part of the dataset. @@ -189,7 +190,9 @@ The previously mentioned techniques are not used in isolation, but are usually s eachother to mitigate the privacy risks that are relevant within the usecase. Typically, the process begins by anonymizing or pseudonymizing the data. With vantage6, the data is then placed in a federated setting. Then, the data is analyzed using federated learning, -which may also include, for instance, MPC protocols to further safeguard data privacy. +which may also include, for instance, MPC protocols to further safeguard data privacy. Additionally, +noise may be added to the raw data as well before it is analyzed, using techniques from differential +privacy. ## Data partitioning @@ -212,7 +215,11 @@ to be found to link identities across datasources. Vertical partitioning require of privacy enhancing algorithms than horizontal partitioning. Data can even be horizontally and vertically partitioned at the same time. In these cases, it may be -necessary to combine multiple techniques. +necessary to combine multiple techniques. For example, you might want to combine census data with +medical records from hospitals. This is vertically partitioned because the census and medical features +are stored in different places. On top of that, you might want to combine multiple hospitals, that +all store their records separately. Since the medical records are physically separated as well, it is +horizontally partitioned. ![Horizontal and vertical partitioning refers to how data is separated](fig/chapter1/horizontal_vertical_partitioning.jpg) From 8daa0cb3052b5c16dbb643e8aad26900f7cab8cd Mon Sep 17 00:00:00 2001 From: Djura Smits Date: Tue, 6 Aug 2024 15:47:39 +0200 Subject: [PATCH 2/5] Added more suggestions --- episodes/chap1_introduction_FL.md | 48 ++++++++++++++++++++++--------- 1 file changed, 34 insertions(+), 14 deletions(-) diff --git a/episodes/chap1_introduction_FL.md b/episodes/chap1_introduction_FL.md index 783de0fb..7a5f0c9e 100644 --- a/episodes/chap1_introduction_FL.md +++ b/episodes/chap1_introduction_FL.md @@ -87,21 +87,28 @@ Internet Movie Database (IMDB). ## Federated data analysis -There are different ways in which privacy risks can be mitigated. We will focus on the idea of -federated analysis. In a federated setting, the data with the data owner, who keeps full control -over it. In this case, it is not the data that travels, but the analysis itself. The system sends -a query or instruction to the data and only the results will get back to the user. +There are different ways in which privacy risks can be mitigated. For example, a well-known technique +is to send the data to a trusted third party (TTP). In this case, the data is sent to an trusted neutral +party. The data can then be analyzed at that location in a traditional way. There are issues with +this technique however. When the data is copied to the TTP, the original owner loses control over it. Another +issue with it is that this technique results in a single point of failure. If the security at the TTP is +breached, all the data it handled could be exposed. + +Instead, we will focus on the idea of federated analysis. In a federated setting, the data with the data owner, + who keeps full control over it. In this case, it is not the data that travels, but the analysis itself. + The system sends a query or instruction to the data and only the results will get back to the user. The results are often akin to a form of _aggregation_ of the data. This can be in the shape of -traditional -statistics like the mean, or it could be a combination of aggregations to form a more complicated -analysis. +traditional statistics like the mean, or it can be more intricate like a machine learning model. Aggregating the data does not ensure complete protection of person-level information, but it -certainly makes it less likely that this will happen. +certainly makes it less likely that this will happen. It is usually the case that the larger +the dataset, the lower the risk of data leakage. For instance, if you would like to take the mean of +all records that fit a certain requirement, like age > 90, and there happens to be only one +records that fits the requirement, the aggregation will be equal to the raw values. + ![In federated analysis, an aggregation of the local data is sent to a central point. At the central point the local aggregations are combined. This can also be a small step in a larger analysis.](fig/chapter1/federated_sum.jpg) -TODO: Example of data leakage in simple aggregated case ## Federated learning @@ -232,17 +239,30 @@ privacy enhancing technologies, a lot of work goes into complying with regulatio trust. Since these projects have a risk of affecting the privacy of individuals, a Data Protection Impact -Assessment (DPIA) -is usually required. This is a process that will help identify and minimize privacy risks of a -project -and is required by the GDPR. +Assessment (DPIA) is usually required. This is a process that will help identify and minimize privacy risks of a +project and is required by the GDPR. There is already a [DPIA available for vantage6](https://vantage6.ai/documents/28/DPIA_vantage6_version2023.pdf) Apart from procedures required by GDPR there might be other regulations in place enforced by the -owners of the data (e.g. hospitals). The specific situation of a project can affect the way in which +owners of the data (e.g. hospitals). For healthcare specifically there will be the +[European Health Data Space (EHDS)](https://health.ec.europa.eu/ehealth-digital-health-and-care/european-health-data-space_en). + EHDS builds upon the GDPR and ensures the exchange and reuse of healthcare data in a safe and secure way. + +The specific situation of a project can affect the way in which the data is allowed to be processed. Some privacy enhancing technologies might be allowed in one project but prohibited in another. It is always important to stay transparent about privacy risks of the technologies you intend to use. +Another factor in performing PET analysis is *data harmonization*. All parties need to make sure that +their data is in the right shape to be combined. For instance, if you have two datasets where one +stores the height of patients in a column "height" in centimeters, and another dataset where it is stored +in meters in the column "patient_height" you cannot perform your analysis. You will need to make sure +all datasets follow the same standards and schemas in order to process them together. There are a +couple of datastandards out there to help you with that, such as [OMOP CDM](https://www.ohdsi.org/data-standardization/) +or [FHIR](https://hl7.org/fhir/) together with [SNOMED-CT](https://www.snomed.org/). + +In conclusion, PETs will help you to extract useful insights from sensitive data, but there are other factors +to keep in mind as well. + ::::::::::::::::::::::::::::::::::::: keypoints - TODO From e1f61980797d46e2c847bfab18b29860c60460ff Mon Sep 17 00:00:00 2001 From: Djura Smits Date: Wed, 7 Aug 2024 08:59:20 +0200 Subject: [PATCH 3/5] Rephrased last sentence --- episodes/chap1_introduction_FL.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/episodes/chap1_introduction_FL.md b/episodes/chap1_introduction_FL.md index 7a5f0c9e..2adcaf41 100644 --- a/episodes/chap1_introduction_FL.md +++ b/episodes/chap1_introduction_FL.md @@ -260,8 +260,8 @@ all datasets follow the same standards and schemas in order to process them toge couple of datastandards out there to help you with that, such as [OMOP CDM](https://www.ohdsi.org/data-standardization/) or [FHIR](https://hl7.org/fhir/) together with [SNOMED-CT](https://www.snomed.org/). -In conclusion, PETs will help you to extract useful insights from sensitive data, but there are other factors -to keep in mind as well. +In conclusion, PETs will help you to extract valuable insights from sensitive data, but they are only one aspect +of the process. It is also important to consider other factors such as regulations and data preprocessing. ::::::::::::::::::::::::::::::::::::: keypoints From babe03013ba118e9db93494b3bb3e7d83214c57f Mon Sep 17 00:00:00 2001 From: Djura Smits Date: Wed, 7 Aug 2024 09:01:47 +0200 Subject: [PATCH 4/5] Removed r markdown link --- episodes/chap1_introduction_FL.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/episodes/chap1_introduction_FL.md b/episodes/chap1_introduction_FL.md index 2adcaf41..62c40b09 100644 --- a/episodes/chap1_introduction_FL.md +++ b/episodes/chap1_introduction_FL.md @@ -268,5 +268,3 @@ of the process. It is also important to consider other factors such as regulatio - TODO :::::::::::::::::::::::::::::::::::::::::::::::: - -[r-markdown]: https://rmarkdown.rstudio.com/ From aa7cb8ea38dcc7871d14563b627c83b2c39d7c79 Mon Sep 17 00:00:00 2001 From: Djura Smits Date: Mon, 2 Sep 2024 16:47:56 +0200 Subject: [PATCH 5/5] Incorporated suggestions --- episodes/chap1_introduction_FL.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/episodes/chap1_introduction_FL.md b/episodes/chap1_introduction_FL.md index 62c40b09..e12f3a7c 100644 --- a/episodes/chap1_introduction_FL.md +++ b/episodes/chap1_introduction_FL.md @@ -88,13 +88,12 @@ Internet Movie Database (IMDB). ## Federated data analysis There are different ways in which privacy risks can be mitigated. For example, a well-known technique -is to send the data to a trusted third party (TTP). In this case, the data is sent to an trusted neutral -party. The data can then be analyzed at that location in a traditional way. There are issues with -this technique however. When the data is copied to the TTP, the original owner loses control over it. Another -issue with it is that this technique results in a single point of failure. If the security at the TTP is -breached, all the data it handled could be exposed. +is to send the data to a trusted third party (TTP). The data can then be analyzed at that location in +a traditional way. However, there are issues with this technique. When the data is copied to the TTP, +the original owner loses control over it. Another issue with it is that this technique results in a +single point of failure. If the security at the TTP is breached, all the data it handled could be exposed. -Instead, we will focus on the idea of federated analysis. In a federated setting, the data with the data owner, +In this course we will focus on federated analysis. In a federated setting, the data with the data owner, who keeps full control over it. In this case, it is not the data that travels, but the analysis itself. The system sends a query or instruction to the data and only the results will get back to the user. The results are often akin to a form of _aggregation_ of the data. This can be in the shape of @@ -104,7 +103,7 @@ Aggregating the data does not ensure complete protection of person-level informa certainly makes it less likely that this will happen. It is usually the case that the larger the dataset, the lower the risk of data leakage. For instance, if you would like to take the mean of all records that fit a certain requirement, like age > 90, and there happens to be only one -records that fits the requirement, the aggregation will be equal to the raw values. +records that fits the requirement, the aggregation will be equal to the one matching individual record. ![In federated analysis, an aggregation of the local data is sent to a central point. At the central point the local aggregations are combined. This can also be a small step in a larger analysis.](fig/chapter1/federated_sum.jpg)