diff --git a/docs/contents/contributors.html b/docs/contents/contributors.html index 4ce58639..f589d759 100644 --- a/docs/contents/contributors.html +++ b/docs/contents/contributors.html @@ -611,7 +611,7 @@

Contributors & Thanks

shanzehbatool
shanzehbatool

-kai4avaya
kai4avaya

+Kai Kleinbard
Kai Kleinbard

Elias Nuwara
Elias Nuwara

diff --git a/docs/contents/core/frameworks/frameworks.html b/docs/contents/core/frameworks/frameworks.html index fa1f510a..8184a659 100644 --- a/docs/contents/core/frameworks/frameworks.html +++ b/docs/contents/core/frameworks/frameworks.html @@ -1619,7 +1619,7 @@

6.9 Choosing the Right Framework

Choosing the right machine learning framework for a given application requires carefully evaluating models, hardware, and software considerations. Figure 6.13 provides a comparison of different TensorFlow frameworks, which we’ll discuss in more detail:

-
+
@@ -1639,7 +1639,7 @@

6.9.2 Software

As shown in Figure 6.14, TensorFlow Lite Micro does not have OS support, while TensorFlow and TensorFlow Lite do. This design choice for TensorFlow Lite Micro helps reduce memory overhead, make startup times faster, and consume less energy. Instead, TensorFlow Lite Micro can be used in conjunction with real-time operating systems (RTOS) like FreeRTOS, Zephyr, and Mbed OS.

The figure also highlights an important memory management feature: TensorFlow Lite and TensorFlow Lite Micro support model memory mapping, allowing models to be directly accessed from flash storage rather than loaded into RAM. In contrast, TensorFlow does not offer this capability.

-
+
@@ -1655,7 +1655,7 @@

6.9.3 Hardware

TensorFlow Lite and TensorFlow Lite Micro have significantly smaller base binary sizes and memory footprints than TensorFlow (see Figure 6.15). For example, a typical TensorFlow Lite Micro binary is less than 200KB, whereas TensorFlow is much larger. This is due to the resource-constrained environments of embedded systems. TensorFlow supports x86, TPUs, and GPUs like NVIDIA, AMD, and Intel.

-
+
@@ -1702,7 +1702,7 @@

6.10.1 Decomposition

Currently, the ML system stack consists of four abstractions as shown in Figure 6.16, namely (1) computational graphs, (2) tensor programs, (3) libraries and runtimes, and (4) hardware primitives.

-
+
diff --git a/docs/contents/core/privacy_security/privacy_security.html b/docs/contents/core/privacy_security/privacy_security.html index 3f97825b..34ae107f 100644 --- a/docs/contents/core/privacy_security/privacy_security.html +++ b/docs/contents/core/privacy_security/privacy_security.html @@ -1757,7 +1757,7 @@

Tradeoffs

-

Ready to unlock the power of encrypted computation? Homomorphic encryption is like a magic trick for your data! In this Colab, we’ll learn how to do calculations on secret numbers without ever revealing them. Imagine training a model on data you can’t even see – that’s the power of this mind-bending technology.

+

The power of encrypted computation is unlocked through homomorphic encryption – a transformative approach in which calculations are performed directly on encrypted data, ensuring privacy is preserved throughout the process. This Colab explores the principles of computing on encrypted numbers without exposing the underlying data. Imagine a scenario where a machine learning model is trained on data that cannot be directly accessed – such is the strength of homomorphic encryption.

@@ -1768,8 +1768,8 @@

Tradeoffs

14.8.5 Secure Multiparty Communication

Core Idea

-

The overarching goal of Multi-Party Communication (MPC) is to enable different parties to jointly compute a function over their inputs while keeping those inputs private. For example, two organizations may want to collaborate on training a machine learning model by combining their respective data sets. Still, they cannot directly reveal that data due to Privacy or confidentiality constraints. MPC provides protocols and techniques that allow them to achieve the benefits of pooled data for model accuracy without compromising the privacy of each organization’s sensitive data.

-

At a high level, MPC works by carefully splitting the computation into parts that each party can execute independently using their private input. The results are then combined to reveal only the final output of the function and nothing about the intermediate values. Cryptographic techniques are used to guarantee that the partial results remain private provably.

+

Multi-Party Communication (MPC) enables multiple parties to jointly compute a function over their inputs while ensuring that each party’s inputs remain confidential. For instance, two organizations can collaborate on training a machine learning model by combining datasets without revealing sensitive information to each other. MPC protocols are essential where privacy and confidentiality regulations restrict direct data sharing, such as in healthcare or financial sectors.

+

MPC divides computation into parts that each participant executes independently using their private data. These results are then combined to reveal only the final output, preserving the privacy of intermediate values. Cryptographic techniques are used to guarantee that the partial results remain private provably.

Let’s take a simple example of an MPC protocol. One of the most basic MPC protocols is the secure addition of two numbers. Each party splits its input into random shares that are secretly distributed. They exchange the shares and locally compute the sum of the shares, which reconstructs the final sum without revealing the individual inputs. For example, if Alice has input x and Bob has input y:

  1. Alice generates random \(x_1\) and sets \(x_2 = x - x_1\)

  2. @@ -1778,7 +1778,7 @@

    Core Idea

  3. Alice computes \(x_2 + y_1 = s_1\), Bob computes \(x_1 + y_2 = s_2\)

  4. \(s_1 + s_2 = x + y\) is the final sum, without revealing \(x\) or \(y\).

-

Alice’s and Bob’s individual inputs (\(x\) and \(y\)) remain private, and each party only reveals one number associated with their original inputs. The random outputs ensure that no information about the original numbers disclosed.

+

Alice’s and Bob’s individual inputs (\(x\) and \(y\)) remain private, and each party only reveals one number associated with their original inputs. The random outputs ensure that no information about the original numbers is disclosed.

Secure Comparison: Another basic operation is a secure comparison of two numbers, determining which is greater than the other. This can be done using techniques like Yao’s Garbled Circuits, where the comparison circuit is encrypted to allow joint evaluation of the inputs without leaking them.

Secure Matrix Multiplication: Matrix operations like multiplication are essential for machine learning. MPC techniques like additive secret sharing can be used to split matrices into random shares, compute products on the shares, and then reconstruct the result.

Secure Model Training: Distributed machine learning training algorithms like federated averaging can be made secure using MPC. Model updates computed on partitioned data at each node are secretly shared between nodes and aggregated to train the global model without exposing individual updates.

@@ -1798,7 +1798,7 @@

Tradeoffs

  • In partially homomorphic encryption, each computation on ciphertexts requires costly public-key operations. Fully homomorphic encryption has even higher overheads.

  • Secret sharing divides data into multiple shares, so even basic operations require manipulating many shares.

  • Oblivious transfer and garbled circuits add masking and encryption to hide data access patterns and execution flows.

  • -
  • MPC systems require extensive communication and interaction between parties to compute on shares/ciphertexts jointly.

  • +
  • MPC systems require extensive communication and interaction between parties to jointly compute on shares/ciphertexts.

  • As a result, MPC protocols can slow down computations by 3-4 orders of magnitude compared to plain implementations. This becomes prohibitively expensive for large datasets and models. Therefore, training machine learning models on encrypted data using MPC remains infeasible today for realistic dataset sizes due to the overhead. Clever optimizations and approximations are needed to make MPC practical.

    Ongoing MPC research closes this efficiency gap through cryptographic advances, new algorithms, trusted hardware like SGX enclaves, and leveraging accelerators like GPUs/TPUs. However, in the foreseeable future, some degree of approximation and performance tradeoff is needed to scale MPC to meet the demands of real-world machine learning systems.

    @@ -1808,8 +1808,8 @@

    Tradeoffs

    14.8.6 Synthetic Data Generation

    Core Idea

    -

    Synthetic data generation has emerged as an important privacy-preserving machine learning approach that allows models to be developed and tested without exposing real user data. The key idea is to train generative models on real-world datasets and then sample from these models to synthesize artificial data that statistically match the original data distribution but does not contain actual user information. For example, a GAN could be trained on a dataset of sensitive medical records to learn the underlying patterns and then used to sample synthetic patient data.

    -

    The primary challenge of synthesizing data is to ensure adversaries are unable to re-identify the original dataset. A simple approach to achieving synthetic data is adding noise to the original dataset, which still risks privacy leakage. When noise is added to data in the context of differential privacy, sophisticated mechanisms based on the data’s sensitivity are used to calibrate the amount and distribution of noise. Through these mathematically rigorous frameworks, differential Privacy generally guarantees Privacy at some level, which is the primary goal of this privacy-preserving technique. Beyond preserving privacy, synthetic data combats multiple data availability issues such as imbalanced datasets, scarce datasets, and anomaly detection.

    +

    Synthetic data generation has emerged as an important privacy-preserving machine learning approach that allows models to be developed and tested without exposing real user data. The key idea is to train generative models on real-world datasets and then sample from these models to synthesize artificial data that statistically matches the original data distribution but does not contain actual user information. For example, a GAN could be trained on a dataset of sensitive medical records to learn the underlying patterns and then used to sample synthetic patient data.

    +

    The primary challenge of synthesizing data is to ensure adversaries cannot re-identify the original dataset. A simple approach to achieving synthetic data is adding noise to the original dataset, which still risks privacy leakage. When noise is added to data in the context of differential privacy, sophisticated mechanisms based on the data’s sensitivity are used to calibrate the amount and distribution of noise. Through these mathematically rigorous frameworks, differential Privacy generally guarantees Privacy at some level, which is the primary goal of this privacy-preserving technique. Beyond preserving privacy, synthetic data combats multiple data availability issues such as imbalanced datasets, scarce datasets, and anomaly detection.

    Researchers can freely share this synthetic data and collaborate on modeling without revealing private medical information. Well-constructed synthetic data protects Privacy while providing utility for developing accurate models. Key techniques to prevent reconstructing the original data include adding differential privacy noise during training, enforcing plausibility constraints, and using multiple diverse generative models. Here are some common approaches for generating synthetic data:

    • Generative Adversarial Networks (GANs): GANs are an AI algorithm used in unsupervised learning where two neural networks compete against each other in a game. Figure 14.16 is an overview of the GAN system. The generator network (big red box) is responsible for producing the synthetic data, and the discriminator network (yellow box) evaluates the authenticity of the data by distinguishing between fake data created by the generator network and the real data. The generator and discriminator networks learn and update their parameters based on the results. The discriminator acts as a metric on how similar the fake and real data are to one another. It is highly effective at generating realistic data and is a popular approach for generating synthetic data.
    • @@ -1847,7 +1847,7 @@

      Benefits

      Tradeoffs

      While synthetic data tries to remove any evidence of the original dataset, privacy leakage is still a risk since the synthetic data mimics the original data. The statistical information and distribution are similar, if not the same, between the original and synthetic data. By resampling from the distribution, adversaries may still be able to recover the original training samples. Due to their inherent learning processes and complexities, neural networks might accidentally reveal sensitive information about the original training data.

      A core challenge with synthetic data is the potential gap between synthetic and real-world data distributions. Despite advancements in generative modeling techniques, synthetic data may only partially capture real data’s complexity, diversity, and nuanced patterns. This can limit the utility of synthetic data for robustly training machine learning models. Rigorously evaluating synthetic data quality through adversary methods and comparing model performance to real data benchmarks helps assess and improve fidelity. However, inherently, synthetic data remains an approximation.

      -

      Another critical concern is the privacy risks of synthetic data. Generative models may leak identifiable information about individuals in the training data, which could enable reconstruction of private information. Emerging adversarial attacks demonstrate the challenges in preventing identity leakage from synthetic data generation pipelines. Techniques like differential Privacy can help safeguard Privacy but come with tradeoffs in data utility. There is an inherent tension between producing useful synthetic data and fully protecting sensitive training data, which must be balanced.

      +

      Another critical concern is the privacy risks of synthetic data. Generative models may leak identifiable information about individuals in the training data, which could enable the reconstruction of private information. Emerging adversarial attacks demonstrate the challenges in preventing identity leakage from synthetic data generation pipelines. Techniques like differential privacy can help safeguard privacy, but they come with tradeoffs in data utility. There is an inherent tension between producing valid synthetic data and fully protecting sensitive training data, which must be balanced.

      Additional pitfalls of synthetic data include amplified biases, mislabeling, the computational overhead of training generative models, storage costs, and failure to account for out-of-distribution novel data. While these are secondary to the core synthetic-real gap and privacy risks, they remain important considerations when evaluating the suitability of synthetic data for particular machine-learning tasks. As with any technique, the advantages of synthetic data come with inherent tradeoffs and limitations that require thoughtful mitigation strategies.

    diff --git a/docs/search.json b/docs/search.json index c38dcdd8..453871bc 100644 --- a/docs/search.json +++ b/docs/search.json @@ -64,7 +64,7 @@ "href": "contents/contributors.html", "title": "Contributors & Thanks", "section": "", - "text": "We extend our sincere thanks to the diverse group of individuals who have generously contributed their expertise, insights, time, and support to improve both the content and codebase of this project. This includes not only those who have directly contributed through code and writing but also those who have helped by identifying issues, providing feedback, and offering suggestions. Below, you will find a list of all contributors. If you would like to contribute to this project, please visit our GitHub page for more information.\n\n\n\n\n\n\n\n\nVijay Janapa Reddi\n\n\nIkechukwu Uchendu\n\n\nNaeem Khoshnevis\n\n\nMarcelo Rovai\n\n\njasonjabbour\n\n\n\n\nDouwe den Blanken\n\n\nshanzehbatool\n\n\nkai4avaya\n\n\nElias Nuwara\n\n\nJared Ping\n\n\n\n\nSara Khosravi\n\n\nMatthew Stewart\n\n\nItai Shapira\n\n\nMaximilian Lam\n\n\nJayson Lin\n\n\n\n\nSophia Cho\n\n\nAndrea\n\n\nJeffrey Ma\n\n\nAlex Rodriguez\n\n\nKorneel Van den Berghe\n\n\n\n\nColby Banbury\n\n\nZishen Wan\n\n\nAbdulrahman Mahmoud\n\n\nSrivatsan Krishnan\n\n\nDivya Amirtharaj\n\n\n\n\nEmeka Ezike\n\n\nAghyad Deeb\n\n\nHaoran Qiu\n\n\nmarin-llobet\n\n\nEmil Njor\n\n\n\n\nAditi Raju\n\n\nJared Ni\n\n\nMichael Schnebly\n\n\noishib\n\n\nELSuitorHarvard\n\n\n\n\nHenry Bae\n\n\nJae-Won Chung\n\n\nYu-Shun Hsiao\n\n\nMark Mazumder\n\n\nMarco Zennaro\n\n\n\n\nEura Nofshin\n\n\nAndrew Bass\n\n\nPong Trairatvorakul\n\n\nJennifer Zhou\n\n\nShvetank Prakash\n\n\n\n\nAlex Oesterling\n\n\nArya Tschand\n\n\nBruno Scaglione\n\n\nGauri Jain\n\n\nAllen-Kuang\n\n\n\n\nFin Amin\n\n\nSercan Aygün\n\n\nFatima Shah\n\n\ngnodipac886\n\n\nAbenezer\n\n\n\n\nBaldassarre Cesarano\n\n\nEmmanuel Rassou\n\n\nBilge Acun\n\n\nyanjingl\n\n\nYang Zhou\n\n\n\n\nabigailswallow\n\n\nJason Yik\n\n\nhappyappledog\n\n\nCurren Iyer\n\n\nJessica Quaye\n\n\n\n\nSonia Murthy\n\n\nShreya Johri\n\n\nVijay Edupuganti\n\n\nThe Random DIY\n\n\nCostin-Andrei Oncescu\n\n\n\n\nAnnie Laurie Cook\n\n\nJothi Ramaswamy\n\n\nBatur Arslan\n\n\nFatima Shah\n\n\na-saraf\n\n\n\n\nsonghan\n\n\nZishen", + "text": "We extend our sincere thanks to the diverse group of individuals who have generously contributed their expertise, insights, time, and support to improve both the content and codebase of this project. This includes not only those who have directly contributed through code and writing but also those who have helped by identifying issues, providing feedback, and offering suggestions. Below, you will find a list of all contributors. If you would like to contribute to this project, please visit our GitHub page for more information.\n\n\n\n\n\n\n\n\nVijay Janapa Reddi\n\n\nIkechukwu Uchendu\n\n\nNaeem Khoshnevis\n\n\nMarcelo Rovai\n\n\njasonjabbour\n\n\n\n\nDouwe den Blanken\n\n\nshanzehbatool\n\n\nKai Kleinbard\n\n\nElias Nuwara\n\n\nJared Ping\n\n\n\n\nSara Khosravi\n\n\nMatthew Stewart\n\n\nItai Shapira\n\n\nMaximilian Lam\n\n\nJayson Lin\n\n\n\n\nSophia Cho\n\n\nAndrea\n\n\nJeffrey Ma\n\n\nAlex Rodriguez\n\n\nKorneel Van den Berghe\n\n\n\n\nColby Banbury\n\n\nZishen Wan\n\n\nAbdulrahman Mahmoud\n\n\nSrivatsan Krishnan\n\n\nDivya Amirtharaj\n\n\n\n\nEmeka Ezike\n\n\nAghyad Deeb\n\n\nHaoran Qiu\n\n\nmarin-llobet\n\n\nEmil Njor\n\n\n\n\nAditi Raju\n\n\nJared Ni\n\n\nMichael Schnebly\n\n\noishib\n\n\nELSuitorHarvard\n\n\n\n\nHenry Bae\n\n\nJae-Won Chung\n\n\nYu-Shun Hsiao\n\n\nMark Mazumder\n\n\nMarco Zennaro\n\n\n\n\nEura Nofshin\n\n\nAndrew Bass\n\n\nPong Trairatvorakul\n\n\nJennifer Zhou\n\n\nShvetank Prakash\n\n\n\n\nAlex Oesterling\n\n\nArya Tschand\n\n\nBruno Scaglione\n\n\nGauri Jain\n\n\nAllen-Kuang\n\n\n\n\nFin Amin\n\n\nSercan Aygün\n\n\nFatima Shah\n\n\ngnodipac886\n\n\nAbenezer\n\n\n\n\nBaldassarre Cesarano\n\n\nEmmanuel Rassou\n\n\nBilge Acun\n\n\nyanjingl\n\n\nYang Zhou\n\n\n\n\nabigailswallow\n\n\nJason Yik\n\n\nhappyappledog\n\n\nCurren Iyer\n\n\nJessica Quaye\n\n\n\n\nSonia Murthy\n\n\nShreya Johri\n\n\nVijay Edupuganti\n\n\nThe Random DIY\n\n\nCostin-Andrei Oncescu\n\n\n\n\nAnnie Laurie Cook\n\n\nJothi Ramaswamy\n\n\nBatur Arslan\n\n\nFatima Shah\n\n\na-saraf\n\n\n\n\nsonghan\n\n\nZishen", "crumbs": [ "Contributors & Thanks" ] @@ -1424,7 +1424,7 @@ "href": "contents/core/privacy_security/privacy_security.html#privacy-preserving-ml-techniques", "title": "14  Security & Privacy", "section": "14.8 Privacy-Preserving ML Techniques", - "text": "14.8 Privacy-Preserving ML Techniques\nMany techniques have been developed to preserve privacy, each addressing different aspects and data security challenges. These methods can be broadly categorized into several key areas: Differential Privacy, which focuses on statistical privacy in data outputs; Federated Learning, emphasizing decentralized data processing; Homomorphic Encryption and Secure Multi-party Computation (SMC), both enabling secure computations on encrypted or private data; Data Anonymization and Data Masking and Obfuscation, which alter data to protect individual identities; Private Set Intersection and Zero-Knowledge Proofs, facilitating secure data comparisons and validations; Decentralized Identifiers (DIDs) for self-sovereign digital identities; Privacy-Preserving Record Linkage (PPRL), linking data across sources without exposure; Synthetic Data Generation, creating artificial datasets for safe analysis; and Adversarial Learning Techniques, enhancing data or model resistance to privacy attacks.\nGiven the extensive range of these techniques, it is not feasible to dive into each in depth within a single course or discussion, let alone for anyone to know it all in its glorious detail. Therefore, we will explore a few specific techniques in relative detail, providing a deeper understanding of their principles, applications, and the unique privacy challenges they address in machine learning. This focused approach will give us a more comprehensive and practical understanding of key privacy-preserving methods in modern ML systems.\n\n14.8.1 Differential Privacy\n\nCore Idea\nDifferential Privacy is a framework for quantifying and managing the privacy of individuals in a dataset (Dwork et al. 2006). It provides a mathematical guarantee that the privacy of individuals in the dataset will not be compromised, regardless of any additional knowledge an attacker may possess. The core idea of differential Privacy is that the outcome of any analysis (like a statistical query) should be essentially the same, whether any individual’s data is included in the dataset or not. This means that by observing the analysis result, one cannot determine whether any individual’s data was used in the computation.\n\nDwork, Cynthia, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. “Calibrating Noise to Sensitivity in Private Data Analysis.” In Theory of Cryptography, edited by Shai Halevi and Tal Rabin, 265–84. Berlin, Heidelberg: Springer Berlin Heidelberg.\nFor example, let’s say a database contains medical records for 10 patients. We want to release statistics about the prevalence of diabetes in this sample without revealing one patient’s condition. To do this, we could add a small amount of random noise to the true count before releasing it. If the true number of diabetes patients is 6, we might add noise from a Laplace distribution to randomly output 5, 6, or 7 each with some probability. An observer now can’t tell if any single patient has diabetes based only on the noisy output. The query result looks similar to whether each patient’s data is included or excluded. This is differential Privacy. More formally, a randomized algorithm satisfies ε-differential Privacy if, for any neighbor databases D and Dʹ differing by only one entry, the probability of any outcome changes by at most a factor of ε. A lower ε provides stronger privacy guarantees.\nThe Laplace Mechanism is one of the most straightforward and commonly used methods to achieve differential Privacy. It involves adding noise that follows a Laplace distribution to the data or query results. Apart from the Laplace Mechanism, the general principle of adding noise is central to differential Privacy. The idea is to add random noise to the data or the results of a query. The noise is calibrated to ensure the necessary privacy guarantee while keeping the data useful.\nWhile the Laplace distribution is common, other distributions like Gaussian can also be used. Laplace noise is used for strict ε-differential Privacy for low-sensitivity queries. In contrast, Gaussian distributions can be used when Privacy is not guaranteed, known as (ϵ, 𝛿)-Differential Privacy. In this relaxed version of differential Privacy, epsilon and delta define the amount of Privacy guaranteed when releasing information or a model related to a dataset. Epsilon sets a bound on how much information can be learned about the data based on the output. At the same time, delta allows for a small probability of the privacy guarantee to be violated. The choice between Laplace, Gaussian, and other distributions will depend on the specific requirements of the query and the dataset and the tradeoff between Privacy and accuracy.\nTo illustrate the tradeoff of Privacy and accuracy in (\\(\\epsilon\\), \\(\\delta\\))-differential Privacy, the following graphs in Figure 14.12 show the results on accuracy for different noise levels on the MNIST dataset, a large dataset of handwritten digits (Abadi et al. 2016). The delta value (black line; right y-axis) denotes the level of privacy relaxation (a high value means Privacy is less stringent). As Privacy becomes more relaxed, the accuracy of the model increases.\n\n\n\n\n\n\nFigure 14.12: Privacy-accuracy tradeoff. Source: Abadi et al. (2016).\n\n\nAbadi, Martin, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. “Deep Learning with Differential Privacy.” In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 308–18. CCS ’16. New York, NY, USA: ACM. https://doi.org/10.1145/2976749.2978318.\n\n\nThe key points to remember about differential Privacy are the following:\n\nAdding Noise: The fundamental technique in differential Privacy is adding controlled random noise to the data or query results. This noise masks the contribution of individual data points.\nBalancing Act: There’s a balance between Privacy and accuracy. More noise (lower ϵ) in the data means higher Privacy but less accuracy in the model’s results.\nUniversality: Differential Privacy doesn’t rely on assumptions about what an attacker knows. This makes it robust against re-identification attacks, where an attacker tries to uncover individual data.\nApplicability: It can be applied to various types of data and queries, making it a versatile tool for privacy-preserving data analysis.\n\n\n\nTradeoffs\nThere are several tradeoffs to make with differential Privacy, as is the case with any algorithm. But let’s focus on the computational-specific tradeoffs since we care about ML systems. There are some key computational considerations and tradeoffs when implementing differential Privacy in a machine-learning system:\nNoise generation: Implementing differential Privacy introduces several important computational tradeoffs compared to standard machine learning techniques. One major consideration is the need to securely generate random noise from distributions like Laplace or Gaussian that get added to query results and model outputs. High-quality cryptographic random number generation can be computationally expensive.\nSensitivity analysis: Another key requirement is rigorously tracking the sensitivity of the underlying algorithms to single data points getting added or removed. This global sensitivity analysis is required to calibrate the noise levels properly. However, analyzing worst-case sensitivity can substantially increase computational complexity for complex model training procedures and data pipelines.\nPrivacy budget management: Managing the privacy loss budget across multiple queries and learning iterations is another bookkeeping overhead. The system must keep track of cumulative privacy costs and compose them to explain overall privacy guarantees. This adds a computational burden beyond just running queries or training models.\nBatch vs. online tradeoffs: For online learning systems with continuous high-volume queries, differentially private algorithms require new mechanisms to maintain utility and prevent too much accumulated privacy loss since each query can potentially alter the privacy budget. Batch offline processing is simpler from a computational perspective as it processes data in large batches, where each batch is treated as a single query. High-dimensional sparse data also increases sensitivity analysis challenges.\nDistributed training: When training models using distributed or federated approaches, new cryptographic protocols are needed to track and bound privacy leakage across nodes. Secure multiparty computation with encrypted data for differential Privacy adds substantial computational load.\nWhile differential Privacy provides strong formal privacy guarantees, implementing it rigorously requires additions and modifications to the machine learning pipeline at a computational cost. Managing these overheads while preserving model accuracy remains an active research area.\n\n\nCase Study\nApple’s implementation of differential Privacy in iOS and MacOS provides a prominent real-world example of how differential Privacy can be deployed at large scale. Apple wanted to collect aggregated usage statistics across their ecosystem to improve products and services, but aimed to do so without compromising individual user privacy.\nTo achieve this, they implemented differential privacy techniques directly on user devices to anonymize data points before sending them to Apple servers. Specifically, Apple uses the Laplace mechanism to inject carefully calibrated random noise. For example, suppose a user’s location history contains [Work, Home, Work, Gym, Work, Home]. In that case, the differentially private version might replace the exact locations with a noisy sample like [Gym, Home, Work, Work, Home, Work].\nApple tunes the Laplace noise distribution to provide a high level of Privacy while preserving the utility of aggregated statistics. Increasing noise levels provides stronger privacy guarantees (lower ε values in DP terminology) but can reduce data utility. Apple’s privacy engineers empirically optimized this tradeoff based on their product goals.\nApple obtains high-fidelity aggregated statistics by aggregating hundreds of millions of noisy data points from devices. For instance, they can analyze new iOS apps’ features while masking any user’s app behaviors. On-device computation avoids sending raw data to Apple servers.\nThe system uses hardware-based secure random number generation to sample from the Laplace distribution on devices efficiently. Apple also had to optimize its differentially private algorithms and pipeline to operate under the computational constraints of consumer hardware.\nMultiple third-party audits have verified that Apple’s system provides rigorous differential privacy protections in line with their stated policies. Of course, assumptions around composition over time and potential re-identification risks still apply. Apple’s deployment shows how differential Privacy can be realized in large real-world products when backed by sufficient engineering resources.\n\n\n\n\n\n\nExercise 14.1: Differential Privacy - TensorFlow Privacy\n\n\n\n\n\nWant to train an ML model without compromising anyone’s secrets? Differential Privacy is like a superpower for your data! In this Colab, we’ll use TensorFlow Privacy to add special noise during training. This makes it way harder for anyone to determine if a single person’s data was used, even if they have sneaky ways of peeking at the model.\n\n\n\n\n\n\n\n14.8.2 Federated Learning\n\nCore Idea\nFederated Learning (FL) is a type of machine learning in which a model is built and distributed across multiple devices or servers while keeping the training data localized. It was previously discussed in the Model Optimizations chapter. Still, we will recap it here briefly to complete it and focus on things that pertain to this chapter.\nFL trains machine learning models across decentralized networks of devices or systems while keeping all training data localized. Figure 14.13 illustrates this process: each participating device leverages its local data to calculate model updates, which are then aggregated to build an improved global model. However, the raw training data is never directly shared, transferred, or compiled. This privacy-preserving approach allows for the joint development of ML models without centralizing the potentially sensitive training data in one place.\n\n\n\n\n\n\nFigure 14.13: Federated Learning lifecycle. Source: Jin et al. (2020).\n\n\nJin, Yilun, Xiguang Wei, Yang Liu, and Qiang Yang. 2020. “Towards Utilizing Unlabeled Data in Federated Learning: A Survey and Prospective.” arXiv Preprint arXiv:2002.11545.\n\n\nOne of the most common model aggregation algorithms is Federated Averaging (FedAvg), where the global model is created by averaging all of the parameters from local parameters. While FedAvg works well with independent and identically distributed data (IID), alternate algorithms like Federated Proximal (FedProx) are crucial in real-world applications where data is often non-IID. FedProx is designed for the FL process when there is significant heterogeneity in the client updates due to diverse data distributions across devices, computational capabilities, or varied amounts of data.\nBy leaving the raw data distributed and exchanging only temporary model updates, federated learning provides a more secure and privacy-enhancing alternative to traditional centralized machine learning pipelines. This allows organizations and users to benefit collaboratively from shared models while maintaining control and ownership over sensitive data. The decentralized nature of FL also makes it robust to single points of failure.\nImagine a group of hospitals that want to collaborate on a study to predict patient outcomes based on their symptoms. However, they cannot share their patient data due to privacy concerns and regulations like HIPAA. Here’s how Federated Learning can help.\n\nLocal Training: Each hospital trains a machine learning model on patient data. This training happens locally, meaning the data never leaves the hospital’s servers.\nModel Sharing: After training, each hospital only sends the model (specifically, its parameters or weights ) to a central server. It does not send any patient data.\nAggregating Models: The central server aggregates these models from all hospitals into a single, more robust model. This process typically involves averaging the model parameters.\nBenefit: The result is a machine learning model that has learned from a wide range of patient data without sharing sensitive data or removing it from its original location.\n\n\n\nTradeoffs\nThere are several system performance-related aspects of FL in machine learning systems. It would be wise to understand these tradeoffs because there is no “free lunch” for preserving Privacy through FL (Li et al. 2020).\n\nLi, Tian, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. 2020. “Federated Learning: Challenges, Methods, and Future Directions.” IEEE Signal Process Mag. 37 (3): 50–60. https://doi.org/10.1109/msp.2020.2975749.\nCommunication Overhead and Network Constraints: In FL, one of the most significant challenges is managing the communication overhead. This involves the frequent transmission of model updates between a central server and numerous client devices, which can be bandwidth-intensive. The total number of communication rounds and the size of transmitted messages per round need to be reduced to minimize communication further. This can lead to substantial network traffic, especially in scenarios with many participants. Additionally, latency becomes a critical factor — the time taken for these updates to be sent, aggregated, and redistributed can introduce delays. This affects the overall training time and impacts the system’s responsiveness and real-time capabilities. Managing this communication while minimizing bandwidth usage and latency is crucial for implementing FL.\nComputational Load on Local Devices: FL relies on client devices (like smartphones or IoT devices, which especially matter in TinyML) for model training, which often have limited computational power and battery life. Running complex machine learning algorithms locally can strain these resources, leading to potential performance issues. Moreover, the capabilities of these devices can vary significantly, resulting in uneven contributions to the model training process. Some devices process updates faster and more efficiently than others, leading to disparities in the learning process. Balancing the computational load to ensure consistent participation and efficiency across all devices is a key challenge in FL.\nModel Training Efficiency: FL’s decentralized nature can impact model training’s efficiency. Achieving convergence, where the model no longer significantly improves, can be slower in FL than in centralized training methods. This is particularly true in cases where the data is non-IID (non-independent and identically distributed) across devices. Additionally, the algorithms used for aggregating model updates play a critical role in the training process. Their efficiency directly affects the speed and effectiveness of learning. Developing and implementing algorithms that can handle the complexities of FL while ensuring timely convergence is essential for the system’s performance.\nScalability Challenges: Scalability is a significant concern in FL, especially as the number of participating devices increases. Managing and coordinating model updates from many devices adds complexity and can strain the system. Ensuring that the system architecture can efficiently handle this increased load without degrading performance is crucial. This involves not just handling the computational and communication aspects but also maintaining the quality and consistency of the model as the scale of the operation grows. A key challenge is designing FL systems that scale effectively while maintaining performance.\nData Synchronization and Consistency: Ensuring data synchronization and maintaining model consistency across all participating devices in FL is challenging. Keeping all devices synchronized with the latest model version can be difficult in environments with intermittent connectivity or devices that go offline periodically. Furthermore, maintaining consistency in the learned model, especially when dealing with a wide range of devices with different data distributions and update frequencies, is crucial. This requires sophisticated synchronization and aggregation strategies to ensure that the final model accurately reflects the learnings from all devices.\nEnergy Consumption: The energy consumption of client devices in FL is a critical factor, particularly for battery-powered devices like smartphones and other TinyML/IoT devices. The computational demands of training models locally can lead to significant battery drain, which might discourage continuous participation in the FL process. Balancing the computational requirements of model training with energy efficiency is essential. This involves optimizing algorithms and training processes to reduce energy consumption while achieving effective learning outcomes. Ensuring energy-efficient operation is key to user acceptance and the sustainability of FL systems.\n\n\nCase Studies\nHere are a couple of real-world case studies that can illustrate the use of federated learning:\n\nGoogle Gboard\nGoogle uses federated learning to improve predictions on its Gboard mobile keyboard app. The app runs a federated learning algorithm on users’ devices to learn from their local usage patterns and text predictions while keeping user data private. The model updates are aggregated in the cloud to produce an enhanced global model. This allows for providing next-word predictions personalized to each user’s typing style while avoiding directly collecting sensitive typing data. Google reported that the federated learning approach reduced prediction errors by 25% compared to the baseline while preserving Privacy.\n\n\nHealthcare Research\nThe UK Biobank and American College of Cardiology combined datasets to train a model for heart arrhythmia detection using federated learning. The datasets could not be combined directly due to legal and Privacy restrictions. Federated learning allowed collaborative model development without sharing protected health data, with only model updates exchanged between the parties. This improved model accuracy as it could leverage a wider diversity of training data while meeting regulatory requirements.\n\n\nFinancial Services\nBanks are exploring using federated learning for anti-money laundering (AML) detection models. Multiple banks could jointly improve AML Models without sharing confidential customer transaction data with competitors or third parties. Only the model updates need to be aggregated rather than raw transaction data. This allows access to richer training data from diverse sources while avoiding regulatory and confidentiality issues around sharing sensitive financial customer data.\nThese examples demonstrate how federated learning provides tangible privacy benefits and enables collaborative ML in settings where direct data sharing is impossible.\n\n\n\n\n14.8.3 Machine Unlearning\n\nCore Idea\nMachine unlearning is a fairly new process that describes how the influence of a subset of training data can be removed from the model. Several methods have been used to perform machine unlearning and remove the influence of a subset of training data from the final model. A baseline approach might consist of simply fine-tuning the model for more epochs on just the data that should be remembered to decrease the influence of the data “forgotten” by the model. Since this approach doesn’t explicitly remove the influence of data that should be erased, membership inference attacks are still possible, so researchers have adopted other approaches to unlearn data from a model explicitly. One type of approach that researchers have adopted includes adjusting the model loss function to treat the losses of the “forget set explicitly” (data to be unlearned) and the “retain set” (remaining data that should still be remembered) differently (Tarun et al. 2022; Khan and Swaroop 2021). Figure 14.14 illustrates some of the applications of Machine-unlearning.\n\nTarun, Ayush K, Vikram S Chundawat, Murari Mandal, and Mohan Kankanhalli. 2022. “Deep Regression Unlearning.” ArXiv Preprint abs/2210.08196. https://arxiv.org/abs/2210.08196.\n\nKhan, Mohammad Emtiyaz, and Siddharth Swaroop. 2021. “Knowledge-Adaptation Priors.” In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, Virtual, edited by Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, 19757–70. https://proceedings.neurips.cc/paper/2021/hash/a4380923dd651c195b1631af7c829187-Abstract.html.\n\n\n\n\n\n\nFigure 14.14: Applications of Machine Unlearning. Source: BBVA OpenMind\n\n\n\n\n\nCase Study\nSome researchers have demonstrated a real-life example of machine unlearning approaches applied to SOTA machine learning models through training an LLM, LLaMA2-7b, to unlearn any references to Harry Potter (Eldan and Russinovich 2023). Though this model took 184K GPU hours to pre-train, it only took 1 GPU hour of fine-tuning to erase the model’s ability to generate or recall Harry Potter-related content without noticeably compromising the accuracy of generating content unrelated to Harry Potter. Figure 14.15 demonstrates how the model output changes before (Llama-7b-chat-hf column) and after (Finetuned Llama-b column) unlearning has occurred.\n\n\n\n\n\n\nFigure 14.15: Llama unlearning Harry Potter. Source: Eldan and Russinovich (2023).\n\n\nEldan, Ronen, and Mark Russinovich. 2023. “Who’s Harry Potter? Approximate Unlearning in LLMs.” ArXiv Preprint abs/2310.02238. https://arxiv.org/abs/2310.02238.\n\n\n\n\nOther Uses\n\nRemoving adversarial data\nDeep learning models have previously been shown to be vulnerable to adversarial attacks, in which the attacker generates adversarial data similar to the original training data, where a human cannot tell the difference between the real and fabricated data. The adversarial data results in the model outputting incorrect predictions, which could have detrimental consequences in various applications, including healthcare diagnosis predictions. Machine unlearning has been used to unlearn the influence of adversarial data to prevent these incorrect predictions from occurring and causing any harm.\n\n\n\n\n14.8.4 Homomorphic Encryption\n\nCore Idea\nHomomorphic encryption is a form of encryption that allows computations to be carried out on ciphertext, generating an encrypted result that, when decrypted, matches the result of operations performed on the plaintext. For example, multiplying two numbers encrypted with homomorphic encryption produces an encrypted product that decrypts the actual product of the two numbers. This means that data can be processed in an encrypted form, and only the resulting output needs to be decrypted, significantly enhancing data security, especially for sensitive information.\nHomomorphic encryption enables outsourced computation on encrypted data without exposing the data itself to the external party performing the operations. However, only certain computations like addition and multiplication are supported in partially homomorphic schemes. Fully homomorphic encryption (FHE) that can handle any computation is even more complex. The number of possible operations is limited before noise accumulation corrupts the ciphertext.\nTo use homomorphic encryption across different entities, carefully generated public keys must be exchanged for operations across separately encrypted data. This advanced encryption technique enables previously impossible secure computation paradigms but requires expertise to implement correctly for real-world systems.\n\n\nBenefits\nHomomorphic encryption enables machine learning model training and inference on encrypted data, ensuring that sensitive inputs and intermediate values remain confidential. This is critical in healthcare, finance, genetics, and other domains, which are increasingly relying on ML to analyze sensitive and regulated data sets containing billions of personal records.\nHomomorphic encryption thwarts attacks like model extraction and membership inference that could expose private data used in ML workflows. It provides an alternative to TEEs using hardware enclaves for confidential computing. However, current schemes have high computational overheads and algorithmic limitations that constrain real-world applications.\nHomomorphic encryption realizes the decades-old vision of secure multiparty computation by allowing computation on ciphertexts. Conceptualized in the 1970s, the first fully homomorphic cryptosystems emerged in 2009, enabling arbitrary computations. Ongoing research is making these techniques more efficient and practical.\nHomomorphic encryption shows great promise in enabling privacy-preserving machine learning under emerging data regulations. However, given constraints, one should carefully evaluate its applicability against other confidential computing approaches. Extensive resources exist to explore homomorphic encryption and track progress in easing adoption barriers.\n\n\nMechanics\n\nData Encryption: Before data is processed or sent to an ML model, it is encrypted using a homomorphic encryption scheme and public key. For example, encrypting numbers \\(x\\) and \\(y\\) generates ciphertexts \\(E(x)\\) and \\(E(y)\\).\nComputation on Ciphertext: The ML algorithm processes the encrypted data directly. For instance, multiplying the ciphertexts \\(E(x)\\) and \\(E(y)\\) generates \\(E(xy)\\). More complex model training can also be done on ciphertexts.\nResult Encryption: The result \\(E(xy)\\) remains encrypted and can only be decrypted by someone with the corresponding private key to reveal the actual product \\(xy\\).\n\nOnly authorized parties with the private key can decrypt the final outputs, protecting the intermediate state. However, noise accumulates with each operation, preventing further computation without decryption.\nBeyond healthcare, homomorphic encryption enables confidential computing for applications like financial fraud detection, insurance analytics, genetics research, and more. It offers an alternative to techniques like multiparty computation and TEEs. Ongoing research improves the efficiency and capabilities.\nTools like HElib, SEAL, and TensorFlow HE provide libraries for exploring implementing homomorphic encryption in real-world machine learning pipelines.\n\n\nTradeoffs\nFor many real-time and embedded applications, fully homomorphic encryption remains impractical for the following reasons.\nComputational Overhead: Homomorphic encryption imposes very high computational overheads, often resulting in slowdowns of over 100x for real-world ML applications. This makes it impractical for many time-sensitive or resource-constrained uses. Optimized hardware and parallelization can alleviate but not eliminate this issue.\nComplexity of Implementation The sophisticated algorithms require deep expertise in cryptography to be implemented correctly. Nuances like format compatibility with floating point ML models and scalable key management pose hurdles. This complexity hinders widespread practical adoption.\nAlgorithmic Limitations: Current schemes restrict the functions and depth of computations supported, limiting the models and data volumes that can be processed. Ongoing research is pushing these boundaries, but restrictions remain.\nHardware Acceleration: Homomorphic encryption requires specialized hardware, such as secure processors or coprocessors with TEEs, which adds design and infrastructure costs.\nHybrid Designs: Rather than encrypting entire workflows, selective application of homomorphic encryption to critical subcomponents can achieve protection while minimizing overheads.\n\n\n\n\n\n\nExercise 14.2: Homomorphic Encryption\n\n\n\n\n\nReady to unlock the power of encrypted computation? Homomorphic encryption is like a magic trick for your data! In this Colab, we’ll learn how to do calculations on secret numbers without ever revealing them. Imagine training a model on data you can’t even see – that’s the power of this mind-bending technology.\n\n\n\n\n\n\n\n14.8.5 Secure Multiparty Communication\n\nCore Idea\nThe overarching goal of Multi-Party Communication (MPC) is to enable different parties to jointly compute a function over their inputs while keeping those inputs private. For example, two organizations may want to collaborate on training a machine learning model by combining their respective data sets. Still, they cannot directly reveal that data due to Privacy or confidentiality constraints. MPC provides protocols and techniques that allow them to achieve the benefits of pooled data for model accuracy without compromising the privacy of each organization’s sensitive data.\nAt a high level, MPC works by carefully splitting the computation into parts that each party can execute independently using their private input. The results are then combined to reveal only the final output of the function and nothing about the intermediate values. Cryptographic techniques are used to guarantee that the partial results remain private provably.\nLet’s take a simple example of an MPC protocol. One of the most basic MPC protocols is the secure addition of two numbers. Each party splits its input into random shares that are secretly distributed. They exchange the shares and locally compute the sum of the shares, which reconstructs the final sum without revealing the individual inputs. For example, if Alice has input x and Bob has input y:\n\nAlice generates random \\(x_1\\) and sets \\(x_2 = x - x_1\\)\nBob generates random \\(y_1\\) and sets \\(y_2 = y - y_1\\)\nAlice sends \\(x_1\\) to Bob, Bob sends \\(y_1\\) to Alice (keeping \\(x_2\\) and \\(y_2\\) secret)\nAlice computes \\(x_2 + y_1 = s_1\\), Bob computes \\(x_1 + y_2 = s_2\\)\n\\(s_1 + s_2 = x + y\\) is the final sum, without revealing \\(x\\) or \\(y\\).\n\nAlice’s and Bob’s individual inputs (\\(x\\) and \\(y\\)) remain private, and each party only reveals one number associated with their original inputs. The random outputs ensure that no information about the original numbers disclosed.\nSecure Comparison: Another basic operation is a secure comparison of two numbers, determining which is greater than the other. This can be done using techniques like Yao’s Garbled Circuits, where the comparison circuit is encrypted to allow joint evaluation of the inputs without leaking them.\nSecure Matrix Multiplication: Matrix operations like multiplication are essential for machine learning. MPC techniques like additive secret sharing can be used to split matrices into random shares, compute products on the shares, and then reconstruct the result.\nSecure Model Training: Distributed machine learning training algorithms like federated averaging can be made secure using MPC. Model updates computed on partitioned data at each node are secretly shared between nodes and aggregated to train the global model without exposing individual updates.\nThe core idea behind MPC protocols is to divide the computation into steps that can be executed jointly without revealing intermediate sensitive data. This is accomplished by combining cryptographic techniques like secret sharing, homomorphic encryption, oblivious transfer, and garbled circuits. MPC protocols enable the collaborative computation of sensitive data while providing provable privacy guarantees. This privacy-preserving capability is essential for many machine learning applications today involving multiple parties that cannot directly share their raw data.\nThe main approaches used in MPC include:\n\nHomomorphic encryption: Special encryption allows computations to be carried out on encrypted data without decrypting it.\nSecret sharing: The private data is divided into random shares distributed to each party. Computations are done locally on the shares and finally reconstructed.\nOblivious transfer: A protocol where a receiver obtains a subset of data from a sender, but the sender does not know which specific data was transferred.\nGarbled circuits: The function to be computed is represented as a Boolean circuit that is encrypted (“garbled”) to allow joint evaluation without revealing inputs.\n\n\n\nTradeoffs\nWhile MPC protocols provide strong privacy guarantees, they come at a high computational cost compared to plain computations. Every secure operation, like addition, multiplication, comparison, etc., requires more processing orders than the equivalent unencrypted operation. This overhead stems from the underlying cryptographic techniques:\n\nIn partially homomorphic encryption, each computation on ciphertexts requires costly public-key operations. Fully homomorphic encryption has even higher overheads.\nSecret sharing divides data into multiple shares, so even basic operations require manipulating many shares.\nOblivious transfer and garbled circuits add masking and encryption to hide data access patterns and execution flows.\nMPC systems require extensive communication and interaction between parties to compute on shares/ciphertexts jointly.\n\nAs a result, MPC protocols can slow down computations by 3-4 orders of magnitude compared to plain implementations. This becomes prohibitively expensive for large datasets and models. Therefore, training machine learning models on encrypted data using MPC remains infeasible today for realistic dataset sizes due to the overhead. Clever optimizations and approximations are needed to make MPC practical.\nOngoing MPC research closes this efficiency gap through cryptographic advances, new algorithms, trusted hardware like SGX enclaves, and leveraging accelerators like GPUs/TPUs. However, in the foreseeable future, some degree of approximation and performance tradeoff is needed to scale MPC to meet the demands of real-world machine learning systems.\n\n\n\n14.8.6 Synthetic Data Generation\n\nCore Idea\nSynthetic data generation has emerged as an important privacy-preserving machine learning approach that allows models to be developed and tested without exposing real user data. The key idea is to train generative models on real-world datasets and then sample from these models to synthesize artificial data that statistically match the original data distribution but does not contain actual user information. For example, a GAN could be trained on a dataset of sensitive medical records to learn the underlying patterns and then used to sample synthetic patient data.\nThe primary challenge of synthesizing data is to ensure adversaries are unable to re-identify the original dataset. A simple approach to achieving synthetic data is adding noise to the original dataset, which still risks privacy leakage. When noise is added to data in the context of differential privacy, sophisticated mechanisms based on the data’s sensitivity are used to calibrate the amount and distribution of noise. Through these mathematically rigorous frameworks, differential Privacy generally guarantees Privacy at some level, which is the primary goal of this privacy-preserving technique. Beyond preserving privacy, synthetic data combats multiple data availability issues such as imbalanced datasets, scarce datasets, and anomaly detection.\nResearchers can freely share this synthetic data and collaborate on modeling without revealing private medical information. Well-constructed synthetic data protects Privacy while providing utility for developing accurate models. Key techniques to prevent reconstructing the original data include adding differential privacy noise during training, enforcing plausibility constraints, and using multiple diverse generative models. Here are some common approaches for generating synthetic data:\n\nGenerative Adversarial Networks (GANs): GANs are an AI algorithm used in unsupervised learning where two neural networks compete against each other in a game. Figure 14.16 is an overview of the GAN system. The generator network (big red box) is responsible for producing the synthetic data, and the discriminator network (yellow box) evaluates the authenticity of the data by distinguishing between fake data created by the generator network and the real data. The generator and discriminator networks learn and update their parameters based on the results. The discriminator acts as a metric on how similar the fake and real data are to one another. It is highly effective at generating realistic data and is a popular approach for generating synthetic data.\n\n\n\n\n\n\n\nFigure 14.16: Flowchart of GANs. Source: Rosa and Papa (2021).\n\n\nRosa, Gustavo H. de, and João P. Papa. 2021. “A Survey on Text Generation Using Generative Adversarial Networks.” Pattern Recogn. 119 (November): 108098. https://doi.org/10.1016/j.patcog.2021.108098.\n\n\n\nVariational Autoencoders (VAEs): VAEs are neural networks capable of learning complex probability distributions and balancing data generation quality and computational efficiency. They encode data into a latent space where they learn the distribution to decode the data back.\nData Augmentation: This involves transforming existing data to create new, altered data. For example, flipping, rotating, and scaling (uniformly or non-uniformly) original images can help create a more diverse, robust image dataset before training an ML model.\nSimulations: Mathematical models can simulate real-world systems or processes to mimic real-world phenomena. This is highly useful in scientific research, urban planning, and economics.\n\n\n\nBenefits\nWhile synthetic data may be necessary due to Privacy or compliance risks, it is widely used in machine learning models when available data is of poor quality, scarce, or inaccessible. Synthetic data offers more efficient and effective development by streamlining robust model training, testing, and deployment processes. It allows researchers to share models more widely without breaching privacy laws and regulations. Collaboration between users of the same dataset will be facilitated, which will help broaden the capabilities and advancements in ML research.\nThere are several motivations for using synthetic data in machine learning:\n\nPrivacy and compliance: Synthetic data avoids exposing personal information, allowing more open sharing and collaboration. This is important when working with sensitive datasets like healthcare records or financial information.\nData scarcity: When insufficient real-world data is available, synthetic data can augment training datasets. This improves model accuracy when limited data is a bottleneck.\nModel testing: Synthetic data provides privacy-safe sandboxes for testing model performance, debugging issues, and monitoring for bias.\nData labeling: High-quality labeled training data is often scarce and expensive. Synthetic data can help auto-generate labeled examples.\n\n\n\nTradeoffs\nWhile synthetic data tries to remove any evidence of the original dataset, privacy leakage is still a risk since the synthetic data mimics the original data. The statistical information and distribution are similar, if not the same, between the original and synthetic data. By resampling from the distribution, adversaries may still be able to recover the original training samples. Due to their inherent learning processes and complexities, neural networks might accidentally reveal sensitive information about the original training data.\nA core challenge with synthetic data is the potential gap between synthetic and real-world data distributions. Despite advancements in generative modeling techniques, synthetic data may only partially capture real data’s complexity, diversity, and nuanced patterns. This can limit the utility of synthetic data for robustly training machine learning models. Rigorously evaluating synthetic data quality through adversary methods and comparing model performance to real data benchmarks helps assess and improve fidelity. However, inherently, synthetic data remains an approximation.\nAnother critical concern is the privacy risks of synthetic data. Generative models may leak identifiable information about individuals in the training data, which could enable reconstruction of private information. Emerging adversarial attacks demonstrate the challenges in preventing identity leakage from synthetic data generation pipelines. Techniques like differential Privacy can help safeguard Privacy but come with tradeoffs in data utility. There is an inherent tension between producing useful synthetic data and fully protecting sensitive training data, which must be balanced.\nAdditional pitfalls of synthetic data include amplified biases, mislabeling, the computational overhead of training generative models, storage costs, and failure to account for out-of-distribution novel data. While these are secondary to the core synthetic-real gap and privacy risks, they remain important considerations when evaluating the suitability of synthetic data for particular machine-learning tasks. As with any technique, the advantages of synthetic data come with inherent tradeoffs and limitations that require thoughtful mitigation strategies.\n\n\n\n14.8.7 Summary\nWhile all the techniques we have discussed thus far aim to enable privacy-preserving machine learning, they involve distinct mechanisms and tradeoffs. Factors like computational constraints, required trust assumptions, threat models, and data characteristics help guide the selection process for a particular use case. However, finding the right balance between Privacy, accuracy, and efficiency necessitates experimentation and empirical evaluation for many applications. Table 14.2 is a comparison table of the key privacy-preserving machine learning techniques and their pros and cons:\n\n\n\nTable 14.2: Comparing techniques for privacy-preserving machine learning.\n\n\n\n\n\n\n\n\n\n\nTechnique\nPros\nCons\n\n\n\n\nDifferential Privacy\n\nStrong formal privacy guarantees\nRobust to auxiliary data attacks\nVersatile for many data types and analyses\n\n\nAccuracy loss from noise addition\nComputational overhead for sensitivity analysis and noise generation\n\n\n\nFederated Learning\n\nAllows collaborative learning without sharing raw data\nData remains decentralized improving security\nNo need for encrypted computation\n\n\nIncreased communication overhead\nPotentially slower model convergence\nUneven client device capabilities\n\n\n\nSecure Multi-Party Computation\n\nEnables joint computation on sensitive data\nProvides cryptographic privacy guarantees\nFlexible protocols for various functions\n\n\nVery high computational overhead\nComplexity of implementation\nAlgorithmic constraints on function depth\n\n\n\nHomomorphic Encryption\n\nAllows computation on encrypted data\nPrevents intermediate state exposure\n\n\nExtremely high computational cost\nComplex cryptographic implementations\nRestrictions on function types\n\n\n\nSynthetic Data Generation\n\nEnables data sharing without leakage\nMitigates data scarcity problems\n\n\nSynthetic-real gap in distributions\nPotential for reconstructing private data\nBiases and labeling challenges", + "text": "14.8 Privacy-Preserving ML Techniques\nMany techniques have been developed to preserve privacy, each addressing different aspects and data security challenges. These methods can be broadly categorized into several key areas: Differential Privacy, which focuses on statistical privacy in data outputs; Federated Learning, emphasizing decentralized data processing; Homomorphic Encryption and Secure Multi-party Computation (SMC), both enabling secure computations on encrypted or private data; Data Anonymization and Data Masking and Obfuscation, which alter data to protect individual identities; Private Set Intersection and Zero-Knowledge Proofs, facilitating secure data comparisons and validations; Decentralized Identifiers (DIDs) for self-sovereign digital identities; Privacy-Preserving Record Linkage (PPRL), linking data across sources without exposure; Synthetic Data Generation, creating artificial datasets for safe analysis; and Adversarial Learning Techniques, enhancing data or model resistance to privacy attacks.\nGiven the extensive range of these techniques, it is not feasible to dive into each in depth within a single course or discussion, let alone for anyone to know it all in its glorious detail. Therefore, we will explore a few specific techniques in relative detail, providing a deeper understanding of their principles, applications, and the unique privacy challenges they address in machine learning. This focused approach will give us a more comprehensive and practical understanding of key privacy-preserving methods in modern ML systems.\n\n14.8.1 Differential Privacy\n\nCore Idea\nDifferential Privacy is a framework for quantifying and managing the privacy of individuals in a dataset (Dwork et al. 2006). It provides a mathematical guarantee that the privacy of individuals in the dataset will not be compromised, regardless of any additional knowledge an attacker may possess. The core idea of differential Privacy is that the outcome of any analysis (like a statistical query) should be essentially the same, whether any individual’s data is included in the dataset or not. This means that by observing the analysis result, one cannot determine whether any individual’s data was used in the computation.\n\nDwork, Cynthia, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. “Calibrating Noise to Sensitivity in Private Data Analysis.” In Theory of Cryptography, edited by Shai Halevi and Tal Rabin, 265–84. Berlin, Heidelberg: Springer Berlin Heidelberg.\nFor example, let’s say a database contains medical records for 10 patients. We want to release statistics about the prevalence of diabetes in this sample without revealing one patient’s condition. To do this, we could add a small amount of random noise to the true count before releasing it. If the true number of diabetes patients is 6, we might add noise from a Laplace distribution to randomly output 5, 6, or 7 each with some probability. An observer now can’t tell if any single patient has diabetes based only on the noisy output. The query result looks similar to whether each patient’s data is included or excluded. This is differential Privacy. More formally, a randomized algorithm satisfies ε-differential Privacy if, for any neighbor databases D and Dʹ differing by only one entry, the probability of any outcome changes by at most a factor of ε. A lower ε provides stronger privacy guarantees.\nThe Laplace Mechanism is one of the most straightforward and commonly used methods to achieve differential Privacy. It involves adding noise that follows a Laplace distribution to the data or query results. Apart from the Laplace Mechanism, the general principle of adding noise is central to differential Privacy. The idea is to add random noise to the data or the results of a query. The noise is calibrated to ensure the necessary privacy guarantee while keeping the data useful.\nWhile the Laplace distribution is common, other distributions like Gaussian can also be used. Laplace noise is used for strict ε-differential Privacy for low-sensitivity queries. In contrast, Gaussian distributions can be used when Privacy is not guaranteed, known as (ϵ, 𝛿)-Differential Privacy. In this relaxed version of differential Privacy, epsilon and delta define the amount of Privacy guaranteed when releasing information or a model related to a dataset. Epsilon sets a bound on how much information can be learned about the data based on the output. At the same time, delta allows for a small probability of the privacy guarantee to be violated. The choice between Laplace, Gaussian, and other distributions will depend on the specific requirements of the query and the dataset and the tradeoff between Privacy and accuracy.\nTo illustrate the tradeoff of Privacy and accuracy in (\\(\\epsilon\\), \\(\\delta\\))-differential Privacy, the following graphs in Figure 14.12 show the results on accuracy for different noise levels on the MNIST dataset, a large dataset of handwritten digits (Abadi et al. 2016). The delta value (black line; right y-axis) denotes the level of privacy relaxation (a high value means Privacy is less stringent). As Privacy becomes more relaxed, the accuracy of the model increases.\n\n\n\n\n\n\nFigure 14.12: Privacy-accuracy tradeoff. Source: Abadi et al. (2016).\n\n\nAbadi, Martin, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. “Deep Learning with Differential Privacy.” In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 308–18. CCS ’16. New York, NY, USA: ACM. https://doi.org/10.1145/2976749.2978318.\n\n\nThe key points to remember about differential Privacy are the following:\n\nAdding Noise: The fundamental technique in differential Privacy is adding controlled random noise to the data or query results. This noise masks the contribution of individual data points.\nBalancing Act: There’s a balance between Privacy and accuracy. More noise (lower ϵ) in the data means higher Privacy but less accuracy in the model’s results.\nUniversality: Differential Privacy doesn’t rely on assumptions about what an attacker knows. This makes it robust against re-identification attacks, where an attacker tries to uncover individual data.\nApplicability: It can be applied to various types of data and queries, making it a versatile tool for privacy-preserving data analysis.\n\n\n\nTradeoffs\nThere are several tradeoffs to make with differential Privacy, as is the case with any algorithm. But let’s focus on the computational-specific tradeoffs since we care about ML systems. There are some key computational considerations and tradeoffs when implementing differential Privacy in a machine-learning system:\nNoise generation: Implementing differential Privacy introduces several important computational tradeoffs compared to standard machine learning techniques. One major consideration is the need to securely generate random noise from distributions like Laplace or Gaussian that get added to query results and model outputs. High-quality cryptographic random number generation can be computationally expensive.\nSensitivity analysis: Another key requirement is rigorously tracking the sensitivity of the underlying algorithms to single data points getting added or removed. This global sensitivity analysis is required to calibrate the noise levels properly. However, analyzing worst-case sensitivity can substantially increase computational complexity for complex model training procedures and data pipelines.\nPrivacy budget management: Managing the privacy loss budget across multiple queries and learning iterations is another bookkeeping overhead. The system must keep track of cumulative privacy costs and compose them to explain overall privacy guarantees. This adds a computational burden beyond just running queries or training models.\nBatch vs. online tradeoffs: For online learning systems with continuous high-volume queries, differentially private algorithms require new mechanisms to maintain utility and prevent too much accumulated privacy loss since each query can potentially alter the privacy budget. Batch offline processing is simpler from a computational perspective as it processes data in large batches, where each batch is treated as a single query. High-dimensional sparse data also increases sensitivity analysis challenges.\nDistributed training: When training models using distributed or federated approaches, new cryptographic protocols are needed to track and bound privacy leakage across nodes. Secure multiparty computation with encrypted data for differential Privacy adds substantial computational load.\nWhile differential Privacy provides strong formal privacy guarantees, implementing it rigorously requires additions and modifications to the machine learning pipeline at a computational cost. Managing these overheads while preserving model accuracy remains an active research area.\n\n\nCase Study\nApple’s implementation of differential Privacy in iOS and MacOS provides a prominent real-world example of how differential Privacy can be deployed at large scale. Apple wanted to collect aggregated usage statistics across their ecosystem to improve products and services, but aimed to do so without compromising individual user privacy.\nTo achieve this, they implemented differential privacy techniques directly on user devices to anonymize data points before sending them to Apple servers. Specifically, Apple uses the Laplace mechanism to inject carefully calibrated random noise. For example, suppose a user’s location history contains [Work, Home, Work, Gym, Work, Home]. In that case, the differentially private version might replace the exact locations with a noisy sample like [Gym, Home, Work, Work, Home, Work].\nApple tunes the Laplace noise distribution to provide a high level of Privacy while preserving the utility of aggregated statistics. Increasing noise levels provides stronger privacy guarantees (lower ε values in DP terminology) but can reduce data utility. Apple’s privacy engineers empirically optimized this tradeoff based on their product goals.\nApple obtains high-fidelity aggregated statistics by aggregating hundreds of millions of noisy data points from devices. For instance, they can analyze new iOS apps’ features while masking any user’s app behaviors. On-device computation avoids sending raw data to Apple servers.\nThe system uses hardware-based secure random number generation to sample from the Laplace distribution on devices efficiently. Apple also had to optimize its differentially private algorithms and pipeline to operate under the computational constraints of consumer hardware.\nMultiple third-party audits have verified that Apple’s system provides rigorous differential privacy protections in line with their stated policies. Of course, assumptions around composition over time and potential re-identification risks still apply. Apple’s deployment shows how differential Privacy can be realized in large real-world products when backed by sufficient engineering resources.\n\n\n\n\n\n\nExercise 14.1: Differential Privacy - TensorFlow Privacy\n\n\n\n\n\nWant to train an ML model without compromising anyone’s secrets? Differential Privacy is like a superpower for your data! In this Colab, we’ll use TensorFlow Privacy to add special noise during training. This makes it way harder for anyone to determine if a single person’s data was used, even if they have sneaky ways of peeking at the model.\n\n\n\n\n\n\n\n14.8.2 Federated Learning\n\nCore Idea\nFederated Learning (FL) is a type of machine learning in which a model is built and distributed across multiple devices or servers while keeping the training data localized. It was previously discussed in the Model Optimizations chapter. Still, we will recap it here briefly to complete it and focus on things that pertain to this chapter.\nFL trains machine learning models across decentralized networks of devices or systems while keeping all training data localized. Figure 14.13 illustrates this process: each participating device leverages its local data to calculate model updates, which are then aggregated to build an improved global model. However, the raw training data is never directly shared, transferred, or compiled. This privacy-preserving approach allows for the joint development of ML models without centralizing the potentially sensitive training data in one place.\n\n\n\n\n\n\nFigure 14.13: Federated Learning lifecycle. Source: Jin et al. (2020).\n\n\nJin, Yilun, Xiguang Wei, Yang Liu, and Qiang Yang. 2020. “Towards Utilizing Unlabeled Data in Federated Learning: A Survey and Prospective.” arXiv Preprint arXiv:2002.11545.\n\n\nOne of the most common model aggregation algorithms is Federated Averaging (FedAvg), where the global model is created by averaging all of the parameters from local parameters. While FedAvg works well with independent and identically distributed data (IID), alternate algorithms like Federated Proximal (FedProx) are crucial in real-world applications where data is often non-IID. FedProx is designed for the FL process when there is significant heterogeneity in the client updates due to diverse data distributions across devices, computational capabilities, or varied amounts of data.\nBy leaving the raw data distributed and exchanging only temporary model updates, federated learning provides a more secure and privacy-enhancing alternative to traditional centralized machine learning pipelines. This allows organizations and users to benefit collaboratively from shared models while maintaining control and ownership over sensitive data. The decentralized nature of FL also makes it robust to single points of failure.\nImagine a group of hospitals that want to collaborate on a study to predict patient outcomes based on their symptoms. However, they cannot share their patient data due to privacy concerns and regulations like HIPAA. Here’s how Federated Learning can help.\n\nLocal Training: Each hospital trains a machine learning model on patient data. This training happens locally, meaning the data never leaves the hospital’s servers.\nModel Sharing: After training, each hospital only sends the model (specifically, its parameters or weights ) to a central server. It does not send any patient data.\nAggregating Models: The central server aggregates these models from all hospitals into a single, more robust model. This process typically involves averaging the model parameters.\nBenefit: The result is a machine learning model that has learned from a wide range of patient data without sharing sensitive data or removing it from its original location.\n\n\n\nTradeoffs\nThere are several system performance-related aspects of FL in machine learning systems. It would be wise to understand these tradeoffs because there is no “free lunch” for preserving Privacy through FL (Li et al. 2020).\n\nLi, Tian, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. 2020. “Federated Learning: Challenges, Methods, and Future Directions.” IEEE Signal Process Mag. 37 (3): 50–60. https://doi.org/10.1109/msp.2020.2975749.\nCommunication Overhead and Network Constraints: In FL, one of the most significant challenges is managing the communication overhead. This involves the frequent transmission of model updates between a central server and numerous client devices, which can be bandwidth-intensive. The total number of communication rounds and the size of transmitted messages per round need to be reduced to minimize communication further. This can lead to substantial network traffic, especially in scenarios with many participants. Additionally, latency becomes a critical factor — the time taken for these updates to be sent, aggregated, and redistributed can introduce delays. This affects the overall training time and impacts the system’s responsiveness and real-time capabilities. Managing this communication while minimizing bandwidth usage and latency is crucial for implementing FL.\nComputational Load on Local Devices: FL relies on client devices (like smartphones or IoT devices, which especially matter in TinyML) for model training, which often have limited computational power and battery life. Running complex machine learning algorithms locally can strain these resources, leading to potential performance issues. Moreover, the capabilities of these devices can vary significantly, resulting in uneven contributions to the model training process. Some devices process updates faster and more efficiently than others, leading to disparities in the learning process. Balancing the computational load to ensure consistent participation and efficiency across all devices is a key challenge in FL.\nModel Training Efficiency: FL’s decentralized nature can impact model training’s efficiency. Achieving convergence, where the model no longer significantly improves, can be slower in FL than in centralized training methods. This is particularly true in cases where the data is non-IID (non-independent and identically distributed) across devices. Additionally, the algorithms used for aggregating model updates play a critical role in the training process. Their efficiency directly affects the speed and effectiveness of learning. Developing and implementing algorithms that can handle the complexities of FL while ensuring timely convergence is essential for the system’s performance.\nScalability Challenges: Scalability is a significant concern in FL, especially as the number of participating devices increases. Managing and coordinating model updates from many devices adds complexity and can strain the system. Ensuring that the system architecture can efficiently handle this increased load without degrading performance is crucial. This involves not just handling the computational and communication aspects but also maintaining the quality and consistency of the model as the scale of the operation grows. A key challenge is designing FL systems that scale effectively while maintaining performance.\nData Synchronization and Consistency: Ensuring data synchronization and maintaining model consistency across all participating devices in FL is challenging. Keeping all devices synchronized with the latest model version can be difficult in environments with intermittent connectivity or devices that go offline periodically. Furthermore, maintaining consistency in the learned model, especially when dealing with a wide range of devices with different data distributions and update frequencies, is crucial. This requires sophisticated synchronization and aggregation strategies to ensure that the final model accurately reflects the learnings from all devices.\nEnergy Consumption: The energy consumption of client devices in FL is a critical factor, particularly for battery-powered devices like smartphones and other TinyML/IoT devices. The computational demands of training models locally can lead to significant battery drain, which might discourage continuous participation in the FL process. Balancing the computational requirements of model training with energy efficiency is essential. This involves optimizing algorithms and training processes to reduce energy consumption while achieving effective learning outcomes. Ensuring energy-efficient operation is key to user acceptance and the sustainability of FL systems.\n\n\nCase Studies\nHere are a couple of real-world case studies that can illustrate the use of federated learning:\n\nGoogle Gboard\nGoogle uses federated learning to improve predictions on its Gboard mobile keyboard app. The app runs a federated learning algorithm on users’ devices to learn from their local usage patterns and text predictions while keeping user data private. The model updates are aggregated in the cloud to produce an enhanced global model. This allows for providing next-word predictions personalized to each user’s typing style while avoiding directly collecting sensitive typing data. Google reported that the federated learning approach reduced prediction errors by 25% compared to the baseline while preserving Privacy.\n\n\nHealthcare Research\nThe UK Biobank and American College of Cardiology combined datasets to train a model for heart arrhythmia detection using federated learning. The datasets could not be combined directly due to legal and Privacy restrictions. Federated learning allowed collaborative model development without sharing protected health data, with only model updates exchanged between the parties. This improved model accuracy as it could leverage a wider diversity of training data while meeting regulatory requirements.\n\n\nFinancial Services\nBanks are exploring using federated learning for anti-money laundering (AML) detection models. Multiple banks could jointly improve AML Models without sharing confidential customer transaction data with competitors or third parties. Only the model updates need to be aggregated rather than raw transaction data. This allows access to richer training data from diverse sources while avoiding regulatory and confidentiality issues around sharing sensitive financial customer data.\nThese examples demonstrate how federated learning provides tangible privacy benefits and enables collaborative ML in settings where direct data sharing is impossible.\n\n\n\n\n14.8.3 Machine Unlearning\n\nCore Idea\nMachine unlearning is a fairly new process that describes how the influence of a subset of training data can be removed from the model. Several methods have been used to perform machine unlearning and remove the influence of a subset of training data from the final model. A baseline approach might consist of simply fine-tuning the model for more epochs on just the data that should be remembered to decrease the influence of the data “forgotten” by the model. Since this approach doesn’t explicitly remove the influence of data that should be erased, membership inference attacks are still possible, so researchers have adopted other approaches to unlearn data from a model explicitly. One type of approach that researchers have adopted includes adjusting the model loss function to treat the losses of the “forget set explicitly” (data to be unlearned) and the “retain set” (remaining data that should still be remembered) differently (Tarun et al. 2022; Khan and Swaroop 2021). Figure 14.14 illustrates some of the applications of Machine-unlearning.\n\nTarun, Ayush K, Vikram S Chundawat, Murari Mandal, and Mohan Kankanhalli. 2022. “Deep Regression Unlearning.” ArXiv Preprint abs/2210.08196. https://arxiv.org/abs/2210.08196.\n\nKhan, Mohammad Emtiyaz, and Siddharth Swaroop. 2021. “Knowledge-Adaptation Priors.” In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, Virtual, edited by Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, 19757–70. https://proceedings.neurips.cc/paper/2021/hash/a4380923dd651c195b1631af7c829187-Abstract.html.\n\n\n\n\n\n\nFigure 14.14: Applications of Machine Unlearning. Source: BBVA OpenMind\n\n\n\n\n\nCase Study\nSome researchers have demonstrated a real-life example of machine unlearning approaches applied to SOTA machine learning models through training an LLM, LLaMA2-7b, to unlearn any references to Harry Potter (Eldan and Russinovich 2023). Though this model took 184K GPU hours to pre-train, it only took 1 GPU hour of fine-tuning to erase the model’s ability to generate or recall Harry Potter-related content without noticeably compromising the accuracy of generating content unrelated to Harry Potter. Figure 14.15 demonstrates how the model output changes before (Llama-7b-chat-hf column) and after (Finetuned Llama-b column) unlearning has occurred.\n\n\n\n\n\n\nFigure 14.15: Llama unlearning Harry Potter. Source: Eldan and Russinovich (2023).\n\n\nEldan, Ronen, and Mark Russinovich. 2023. “Who’s Harry Potter? Approximate Unlearning in LLMs.” ArXiv Preprint abs/2310.02238. https://arxiv.org/abs/2310.02238.\n\n\n\n\nOther Uses\n\nRemoving adversarial data\nDeep learning models have previously been shown to be vulnerable to adversarial attacks, in which the attacker generates adversarial data similar to the original training data, where a human cannot tell the difference between the real and fabricated data. The adversarial data results in the model outputting incorrect predictions, which could have detrimental consequences in various applications, including healthcare diagnosis predictions. Machine unlearning has been used to unlearn the influence of adversarial data to prevent these incorrect predictions from occurring and causing any harm.\n\n\n\n\n14.8.4 Homomorphic Encryption\n\nCore Idea\nHomomorphic encryption is a form of encryption that allows computations to be carried out on ciphertext, generating an encrypted result that, when decrypted, matches the result of operations performed on the plaintext. For example, multiplying two numbers encrypted with homomorphic encryption produces an encrypted product that decrypts the actual product of the two numbers. This means that data can be processed in an encrypted form, and only the resulting output needs to be decrypted, significantly enhancing data security, especially for sensitive information.\nHomomorphic encryption enables outsourced computation on encrypted data without exposing the data itself to the external party performing the operations. However, only certain computations like addition and multiplication are supported in partially homomorphic schemes. Fully homomorphic encryption (FHE) that can handle any computation is even more complex. The number of possible operations is limited before noise accumulation corrupts the ciphertext.\nTo use homomorphic encryption across different entities, carefully generated public keys must be exchanged for operations across separately encrypted data. This advanced encryption technique enables previously impossible secure computation paradigms but requires expertise to implement correctly for real-world systems.\n\n\nBenefits\nHomomorphic encryption enables machine learning model training and inference on encrypted data, ensuring that sensitive inputs and intermediate values remain confidential. This is critical in healthcare, finance, genetics, and other domains, which are increasingly relying on ML to analyze sensitive and regulated data sets containing billions of personal records.\nHomomorphic encryption thwarts attacks like model extraction and membership inference that could expose private data used in ML workflows. It provides an alternative to TEEs using hardware enclaves for confidential computing. However, current schemes have high computational overheads and algorithmic limitations that constrain real-world applications.\nHomomorphic encryption realizes the decades-old vision of secure multiparty computation by allowing computation on ciphertexts. Conceptualized in the 1970s, the first fully homomorphic cryptosystems emerged in 2009, enabling arbitrary computations. Ongoing research is making these techniques more efficient and practical.\nHomomorphic encryption shows great promise in enabling privacy-preserving machine learning under emerging data regulations. However, given constraints, one should carefully evaluate its applicability against other confidential computing approaches. Extensive resources exist to explore homomorphic encryption and track progress in easing adoption barriers.\n\n\nMechanics\n\nData Encryption: Before data is processed or sent to an ML model, it is encrypted using a homomorphic encryption scheme and public key. For example, encrypting numbers \\(x\\) and \\(y\\) generates ciphertexts \\(E(x)\\) and \\(E(y)\\).\nComputation on Ciphertext: The ML algorithm processes the encrypted data directly. For instance, multiplying the ciphertexts \\(E(x)\\) and \\(E(y)\\) generates \\(E(xy)\\). More complex model training can also be done on ciphertexts.\nResult Encryption: The result \\(E(xy)\\) remains encrypted and can only be decrypted by someone with the corresponding private key to reveal the actual product \\(xy\\).\n\nOnly authorized parties with the private key can decrypt the final outputs, protecting the intermediate state. However, noise accumulates with each operation, preventing further computation without decryption.\nBeyond healthcare, homomorphic encryption enables confidential computing for applications like financial fraud detection, insurance analytics, genetics research, and more. It offers an alternative to techniques like multiparty computation and TEEs. Ongoing research improves the efficiency and capabilities.\nTools like HElib, SEAL, and TensorFlow HE provide libraries for exploring implementing homomorphic encryption in real-world machine learning pipelines.\n\n\nTradeoffs\nFor many real-time and embedded applications, fully homomorphic encryption remains impractical for the following reasons.\nComputational Overhead: Homomorphic encryption imposes very high computational overheads, often resulting in slowdowns of over 100x for real-world ML applications. This makes it impractical for many time-sensitive or resource-constrained uses. Optimized hardware and parallelization can alleviate but not eliminate this issue.\nComplexity of Implementation The sophisticated algorithms require deep expertise in cryptography to be implemented correctly. Nuances like format compatibility with floating point ML models and scalable key management pose hurdles. This complexity hinders widespread practical adoption.\nAlgorithmic Limitations: Current schemes restrict the functions and depth of computations supported, limiting the models and data volumes that can be processed. Ongoing research is pushing these boundaries, but restrictions remain.\nHardware Acceleration: Homomorphic encryption requires specialized hardware, such as secure processors or coprocessors with TEEs, which adds design and infrastructure costs.\nHybrid Designs: Rather than encrypting entire workflows, selective application of homomorphic encryption to critical subcomponents can achieve protection while minimizing overheads.\n\n\n\n\n\n\nExercise 14.2: Homomorphic Encryption\n\n\n\n\n\nThe power of encrypted computation is unlocked through homomorphic encryption – a transformative approach in which calculations are performed directly on encrypted data, ensuring privacy is preserved throughout the process. This Colab explores the principles of computing on encrypted numbers without exposing the underlying data. Imagine a scenario where a machine learning model is trained on data that cannot be directly accessed – such is the strength of homomorphic encryption.\n\n\n\n\n\n\n\n14.8.5 Secure Multiparty Communication\n\nCore Idea\nMulti-Party Communication (MPC) enables multiple parties to jointly compute a function over their inputs while ensuring that each party’s inputs remain confidential. For instance, two organizations can collaborate on training a machine learning model by combining datasets without revealing sensitive information to each other. MPC protocols are essential where privacy and confidentiality regulations restrict direct data sharing, such as in healthcare or financial sectors.\nMPC divides computation into parts that each participant executes independently using their private data. These results are then combined to reveal only the final output, preserving the privacy of intermediate values. Cryptographic techniques are used to guarantee that the partial results remain private provably.\nLet’s take a simple example of an MPC protocol. One of the most basic MPC protocols is the secure addition of two numbers. Each party splits its input into random shares that are secretly distributed. They exchange the shares and locally compute the sum of the shares, which reconstructs the final sum without revealing the individual inputs. For example, if Alice has input x and Bob has input y:\n\nAlice generates random \\(x_1\\) and sets \\(x_2 = x - x_1\\)\nBob generates random \\(y_1\\) and sets \\(y_2 = y - y_1\\)\nAlice sends \\(x_1\\) to Bob, Bob sends \\(y_1\\) to Alice (keeping \\(x_2\\) and \\(y_2\\) secret)\nAlice computes \\(x_2 + y_1 = s_1\\), Bob computes \\(x_1 + y_2 = s_2\\)\n\\(s_1 + s_2 = x + y\\) is the final sum, without revealing \\(x\\) or \\(y\\).\n\nAlice’s and Bob’s individual inputs (\\(x\\) and \\(y\\)) remain private, and each party only reveals one number associated with their original inputs. The random outputs ensure that no information about the original numbers is disclosed.\nSecure Comparison: Another basic operation is a secure comparison of two numbers, determining which is greater than the other. This can be done using techniques like Yao’s Garbled Circuits, where the comparison circuit is encrypted to allow joint evaluation of the inputs without leaking them.\nSecure Matrix Multiplication: Matrix operations like multiplication are essential for machine learning. MPC techniques like additive secret sharing can be used to split matrices into random shares, compute products on the shares, and then reconstruct the result.\nSecure Model Training: Distributed machine learning training algorithms like federated averaging can be made secure using MPC. Model updates computed on partitioned data at each node are secretly shared between nodes and aggregated to train the global model without exposing individual updates.\nThe core idea behind MPC protocols is to divide the computation into steps that can be executed jointly without revealing intermediate sensitive data. This is accomplished by combining cryptographic techniques like secret sharing, homomorphic encryption, oblivious transfer, and garbled circuits. MPC protocols enable the collaborative computation of sensitive data while providing provable privacy guarantees. This privacy-preserving capability is essential for many machine learning applications today involving multiple parties that cannot directly share their raw data.\nThe main approaches used in MPC include:\n\nHomomorphic encryption: Special encryption allows computations to be carried out on encrypted data without decrypting it.\nSecret sharing: The private data is divided into random shares distributed to each party. Computations are done locally on the shares and finally reconstructed.\nOblivious transfer: A protocol where a receiver obtains a subset of data from a sender, but the sender does not know which specific data was transferred.\nGarbled circuits: The function to be computed is represented as a Boolean circuit that is encrypted (“garbled”) to allow joint evaluation without revealing inputs.\n\n\n\nTradeoffs\nWhile MPC protocols provide strong privacy guarantees, they come at a high computational cost compared to plain computations. Every secure operation, like addition, multiplication, comparison, etc., requires more processing orders than the equivalent unencrypted operation. This overhead stems from the underlying cryptographic techniques:\n\nIn partially homomorphic encryption, each computation on ciphertexts requires costly public-key operations. Fully homomorphic encryption has even higher overheads.\nSecret sharing divides data into multiple shares, so even basic operations require manipulating many shares.\nOblivious transfer and garbled circuits add masking and encryption to hide data access patterns and execution flows.\nMPC systems require extensive communication and interaction between parties to jointly compute on shares/ciphertexts.\n\nAs a result, MPC protocols can slow down computations by 3-4 orders of magnitude compared to plain implementations. This becomes prohibitively expensive for large datasets and models. Therefore, training machine learning models on encrypted data using MPC remains infeasible today for realistic dataset sizes due to the overhead. Clever optimizations and approximations are needed to make MPC practical.\nOngoing MPC research closes this efficiency gap through cryptographic advances, new algorithms, trusted hardware like SGX enclaves, and leveraging accelerators like GPUs/TPUs. However, in the foreseeable future, some degree of approximation and performance tradeoff is needed to scale MPC to meet the demands of real-world machine learning systems.\n\n\n\n14.8.6 Synthetic Data Generation\n\nCore Idea\nSynthetic data generation has emerged as an important privacy-preserving machine learning approach that allows models to be developed and tested without exposing real user data. The key idea is to train generative models on real-world datasets and then sample from these models to synthesize artificial data that statistically matches the original data distribution but does not contain actual user information. For example, a GAN could be trained on a dataset of sensitive medical records to learn the underlying patterns and then used to sample synthetic patient data.\nThe primary challenge of synthesizing data is to ensure adversaries cannot re-identify the original dataset. A simple approach to achieving synthetic data is adding noise to the original dataset, which still risks privacy leakage. When noise is added to data in the context of differential privacy, sophisticated mechanisms based on the data’s sensitivity are used to calibrate the amount and distribution of noise. Through these mathematically rigorous frameworks, differential Privacy generally guarantees Privacy at some level, which is the primary goal of this privacy-preserving technique. Beyond preserving privacy, synthetic data combats multiple data availability issues such as imbalanced datasets, scarce datasets, and anomaly detection.\nResearchers can freely share this synthetic data and collaborate on modeling without revealing private medical information. Well-constructed synthetic data protects Privacy while providing utility for developing accurate models. Key techniques to prevent reconstructing the original data include adding differential privacy noise during training, enforcing plausibility constraints, and using multiple diverse generative models. Here are some common approaches for generating synthetic data:\n\nGenerative Adversarial Networks (GANs): GANs are an AI algorithm used in unsupervised learning where two neural networks compete against each other in a game. Figure 14.16 is an overview of the GAN system. The generator network (big red box) is responsible for producing the synthetic data, and the discriminator network (yellow box) evaluates the authenticity of the data by distinguishing between fake data created by the generator network and the real data. The generator and discriminator networks learn and update their parameters based on the results. The discriminator acts as a metric on how similar the fake and real data are to one another. It is highly effective at generating realistic data and is a popular approach for generating synthetic data.\n\n\n\n\n\n\n\nFigure 14.16: Flowchart of GANs. Source: Rosa and Papa (2021).\n\n\nRosa, Gustavo H. de, and João P. Papa. 2021. “A Survey on Text Generation Using Generative Adversarial Networks.” Pattern Recogn. 119 (November): 108098. https://doi.org/10.1016/j.patcog.2021.108098.\n\n\n\nVariational Autoencoders (VAEs): VAEs are neural networks capable of learning complex probability distributions and balancing data generation quality and computational efficiency. They encode data into a latent space where they learn the distribution to decode the data back.\nData Augmentation: This involves transforming existing data to create new, altered data. For example, flipping, rotating, and scaling (uniformly or non-uniformly) original images can help create a more diverse, robust image dataset before training an ML model.\nSimulations: Mathematical models can simulate real-world systems or processes to mimic real-world phenomena. This is highly useful in scientific research, urban planning, and economics.\n\n\n\nBenefits\nWhile synthetic data may be necessary due to Privacy or compliance risks, it is widely used in machine learning models when available data is of poor quality, scarce, or inaccessible. Synthetic data offers more efficient and effective development by streamlining robust model training, testing, and deployment processes. It allows researchers to share models more widely without breaching privacy laws and regulations. Collaboration between users of the same dataset will be facilitated, which will help broaden the capabilities and advancements in ML research.\nThere are several motivations for using synthetic data in machine learning:\n\nPrivacy and compliance: Synthetic data avoids exposing personal information, allowing more open sharing and collaboration. This is important when working with sensitive datasets like healthcare records or financial information.\nData scarcity: When insufficient real-world data is available, synthetic data can augment training datasets. This improves model accuracy when limited data is a bottleneck.\nModel testing: Synthetic data provides privacy-safe sandboxes for testing model performance, debugging issues, and monitoring for bias.\nData labeling: High-quality labeled training data is often scarce and expensive. Synthetic data can help auto-generate labeled examples.\n\n\n\nTradeoffs\nWhile synthetic data tries to remove any evidence of the original dataset, privacy leakage is still a risk since the synthetic data mimics the original data. The statistical information and distribution are similar, if not the same, between the original and synthetic data. By resampling from the distribution, adversaries may still be able to recover the original training samples. Due to their inherent learning processes and complexities, neural networks might accidentally reveal sensitive information about the original training data.\nA core challenge with synthetic data is the potential gap between synthetic and real-world data distributions. Despite advancements in generative modeling techniques, synthetic data may only partially capture real data’s complexity, diversity, and nuanced patterns. This can limit the utility of synthetic data for robustly training machine learning models. Rigorously evaluating synthetic data quality through adversary methods and comparing model performance to real data benchmarks helps assess and improve fidelity. However, inherently, synthetic data remains an approximation.\nAnother critical concern is the privacy risks of synthetic data. Generative models may leak identifiable information about individuals in the training data, which could enable the reconstruction of private information. Emerging adversarial attacks demonstrate the challenges in preventing identity leakage from synthetic data generation pipelines. Techniques like differential privacy can help safeguard privacy, but they come with tradeoffs in data utility. There is an inherent tension between producing valid synthetic data and fully protecting sensitive training data, which must be balanced.\nAdditional pitfalls of synthetic data include amplified biases, mislabeling, the computational overhead of training generative models, storage costs, and failure to account for out-of-distribution novel data. While these are secondary to the core synthetic-real gap and privacy risks, they remain important considerations when evaluating the suitability of synthetic data for particular machine-learning tasks. As with any technique, the advantages of synthetic data come with inherent tradeoffs and limitations that require thoughtful mitigation strategies.\n\n\n\n14.8.7 Summary\nWhile all the techniques we have discussed thus far aim to enable privacy-preserving machine learning, they involve distinct mechanisms and tradeoffs. Factors like computational constraints, required trust assumptions, threat models, and data characteristics help guide the selection process for a particular use case. However, finding the right balance between Privacy, accuracy, and efficiency necessitates experimentation and empirical evaluation for many applications. Table 14.2 is a comparison table of the key privacy-preserving machine learning techniques and their pros and cons:\n\n\n\nTable 14.2: Comparing techniques for privacy-preserving machine learning.\n\n\n\n\n\n\n\n\n\n\nTechnique\nPros\nCons\n\n\n\n\nDifferential Privacy\n\nStrong formal privacy guarantees\nRobust to auxiliary data attacks\nVersatile for many data types and analyses\n\n\nAccuracy loss from noise addition\nComputational overhead for sensitivity analysis and noise generation\n\n\n\nFederated Learning\n\nAllows collaborative learning without sharing raw data\nData remains decentralized improving security\nNo need for encrypted computation\n\n\nIncreased communication overhead\nPotentially slower model convergence\nUneven client device capabilities\n\n\n\nSecure Multi-Party Computation\n\nEnables joint computation on sensitive data\nProvides cryptographic privacy guarantees\nFlexible protocols for various functions\n\n\nVery high computational overhead\nComplexity of implementation\nAlgorithmic constraints on function depth\n\n\n\nHomomorphic Encryption\n\nAllows computation on encrypted data\nPrevents intermediate state exposure\n\n\nExtremely high computational cost\nComplex cryptographic implementations\nRestrictions on function types\n\n\n\nSynthetic Data Generation\n\nEnables data sharing without leakage\nMitigates data scarcity problems\n\n\nSynthetic-real gap in distributions\nPotential for reconstructing private data\nBiases and labeling challenges", "crumbs": [ "14  Security & Privacy" ]