19-masking-data.Rmd

# Masking HR data {#masking-data}

```{r masking-data, include=FALSE}
chap <- 20
lc <- 0
rq <- 0
# **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**
# **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**

knitr::opts_chunk$set(
  tidy = FALSE, 
  out.width = '\\textwidth', 
  fig.height = 4,
  warning = FALSE
  )

options(scipen = 99, digits = 3)

# Set random number generator see value for replicable pseudorandomness. Why 76?
# https://www.youtube.com/watch?v=xjJ7FheCkCU
set.seed(76)
```


In the toolbox of every HR Analytics expert, the ability to quickly anonymize data for test and development environments finds an important place. Surely you want to be able to protect confidential data from inappropriate views.

Ensure all needed libraries are installed

```{r include=FALSE}
if(!require(tidyverse)) install.packages("tidyverse")
if(!require(randNames)) install.packages("randNames")
if(!require(smoothmest)) install.packages("smoothmest")
```

```{r}
library(tidyverse)
library(randNames)
library(smoothmest)
```

### Whitehouse dataset

Load a data set with first and last names and let us preview the dataset

```{r eval=FALSE}
whitehouse <- read_csv("https://hranalytics.netlify.com/data/2016-Report-White-House-Staff.csv")
```

```{r read_data7, echo=FALSE, warning=FALSE, message=FALSE}
whitehouse <- read_csv("data/2016-Report-White-House-Staff.csv")
```

Let us replace original names with fake names

```{r}
#Create a set of new names of equal length as the original dataset

fake <- nrow(whitehouse) %>%
  rand_names (nat="US")

#Function to capitalise the first letter
firstup <- function(x) {
  substr(x, 1, 1) <- toupper(substr(x, 1, 1))
  x
}

#Create a dataframe without the original names columns
whitehouse_with_no_names <- dplyr::select(whitehouse, -Name)

#Create a column of fake last and first names
fake$newnames <- paste0(firstup(fake$name.last), ", ", firstup(fake$name.first))

#Create a column of fake last and first names of exactly the length of the data frame without the original names
result <- fake[1:nrow(whitehouse_with_no_names),]

#Bind the dataset without names with the new dataset containing fake names
whitehouse_masked <- cbind(result$newnames, whitehouse_with_no_names)

#Rename the column title to "Name"
colnames(whitehouse_masked)[1] <- "Name"

```

Let us replace original names with random numbers

```{r}
# Set seed for reproducibility of the random numbers
set.seed(42)

# Replace names with random numbers from 1 to 1000
whitehouse_no_names <- whitehouse_masked %>%
  mutate(Name = sample(1:1000, nrow(whitehouse_masked)))
```

Let us anonymise the salary data

We can use four different methods to anonymise data:
* Rounding Salary to the nearest ten thousand
* Top coding consists of bringing data to an upper limit.
* Bottom coding consists of bringing data to a lower limit.

In data analysis exists various types of transformations, reciprocal, logarithm, cube root, square root and square, however in the following we will demonstrate only the square root transformation.


```{r}
# Rounding Salary to the nearest ten thousand
whitehouse_no_identifiers <- whitehouse_no_names %>%
  mutate(Salary = round(Salary, digits = -4))

# Top coding convert the salaries into three categories
whitehouse.gen <- whitehouse_masked %>%
  mutate(Salary = ifelse(Salary < 50000, 0, 
                         ifelse(Salary >= 50000 & Salary < 100000, 1, 2)))

# Bottom Coding
whitehouse.bottom <- whitehouse_masked %>%
  mutate(Salary = ifelse(Salary<=45000, 45000, Salary))

```

### Fertility dataset

Let us explore deeper anonymisation option of generating synthetic data. We will do on the basis of the fertility dataset available from UCI website: 
https://archive.ics.uci.edu/ml/datasets/Fertility

![UCI logo](images/UCI_logo.png)

It needs to be loaded first and let us preview it.


```{r eval=FALSE}
fertility <- read_csv("https://https://hranalytics.netlify.com/data/fertility_Diagnosis.txt", col_names = FALSE)
```

```{r read_data10, echo=FALSE, warning=FALSE, message=FALSE}
fertility <- read_csv("data/fertility_Diagnosis.txt", col_names = FALSE)
```


```{r}
# Let us assign significant column titles
colnames(fertility) <- c("Season", "Age", "Child_Disease", "Accident_Trauma", "Surgical_Intervention","High_Fevers", "Alcohol_Consumption","Smoking_Habit","Hours_Sitting","Diagnosis")

# View fertility data
fertility
```

Let us examine the dataset.

More on the data attributes is availalble here: 
https://archive.ics.uci.edu/ml/datasets/Fertility

```{r}
# Number of participants with Surgical_Intervention and Diagnosis
fertility %>%
  group_by(Diagnosis) %>%
  summarise_at(vars(Surgical_Intervention), sum)

# Number of participants with Surgical_Intervention and Diagnosis
fertility %>%
  summarise_at(vars(Age), funs(mean, sd))

# Counts of the Groups in High_Fevers
fertility %>%
  count(High_Fevers)

# Counts of the Groups in Child_Disease	and Accident_Trauma	
fertility %>%
  count(Child_Disease,Accident_Trauma)

# Calculate the average of Child_Disease
fertility %>%
  summarise_at(vars(Child_Disease), mean)
```

In the following we will generate synthetic data sampling from a normal distribution through the subsequent steps:
1. First create a new dataset called "fert", after applying a log transformation on the hours sitting variable.
2. Calculate the average and the standard deviation
3. Set a seed for reproducibility
4. Generate new data normally distributed for the hours sitting variable
5. Retransform back the log variable using exponential.
6. Hard bound data not falling in the right range
7. Recheck the range.
8. Substitute the synthtic data back into the initial fert dataset.

```{r}

fert <- fertility %>%
    mutate(Hours_Sitting = log(Hours_Sitting))

fert %>%
    summarise_at(vars(Hours_Sitting), funs(mean, sd))

set.seed(42)

hours.sit <- rnorm(100, -1.01, 0.50)

hours.sit <- exp(hours.sit)

hours.sit[hours.sit < 0] <- 0
hours.sit[hours.sit > 1] <- 1

range(hours.sit)

fert$Hours_Sitting <- hours.sit
```

In data analysis exists various types of transformations, reciprocal, logarithm, cube root, square root and square, however in the following we will demonstrate only the square root transformation.

```{r}
# Square root Transformation of Salary
whitehouse.salary <- whitehouse_masked %>%
  mutate(Salary = sqrt(Salary))

# Calculate the mean and standard deviation
stats <- whitehouse.salary %>% 
  summarise_at(vars(Salary), funs(mean, sd))

# Generate Synthetic data with the same mean and standard deviation
salary_transformed <- rnorm(nrow(whitehouse_masked), mean(whitehouse.salary$Salary), sd(whitehouse.salary$Salary))

# Power transformation
salary_original <- salary_transformed^2

# Hard bound
salary <- ifelse(salary_original < 0, 0, salary_original)
```

Let us introduce now the concept of differential privacy, a mathematical concept used by big names like Google, Census Bureau and Apple. 
Why Differential Privacy? It quantifies the privacy loss via a privacy budget, called epsilon. It assumes the worst-case scenario about the data intruder. Smaller privacy budget means less information or a noiser answer, however epsilon cannot be zero or lower.

Global Sensitivity of Other Queries
n is total number of observations
a is the lower bound of the data
b is the upper bound of the data
Counting: 1
Proportion: 1 / n
Mean: (b - a) / n

small global sensitivity results in less noise
large global sensitivity results in more noise

Number of observations
n <- nrow(fertility)

Global sensitivity of counts
gs.count <- 1

Global sensitivity of proportions
gs.prop <- 1/n

Lower bound
a <- 0

Upper bound
b <- 1

Global sensitivity of mean
gs.mean <- (b - a) / n

Global sensitivity of proportions
gs.var <- (b - a)^2 / n

```{r}
fertility %>%
    summarise_at(vars(Child_Disease), sum)

library(smoothmest)

#rdoublex(draws, mean, shaping) 

#rdoublex is a random number generator. It creates a vector of random numbers generated by the double exponential distribution.
```

The double exponential distribution is also commonly referred to as the Laplace distribution. The following is the plot of the double exponential probability density function.

![plot of double exponential probability density function](images/dexpdf.png)

```{r}
set.seed(42)

rdoublex(1, 87, 1 / 10)
#[1] 87.01983

set.seed(42)    
rdoublex(1, 87, 1 / 0.1)
#[1] 88.98337
```

Sequential Composition

Suppose a set of privacy mechanisms M are sequentially performed on a dataset, and each M provides the max epsilon privacy guarantee.

The sequential composition undertakes the privacy guarantee for a sequence of differentially private computations. When a set of randomized mechanisms has been performed sequentially on a dataset, the final privacy guarantee is determined by the summation of total privacy budgets.

The privacy budget must be divided by two.


```{r}

#Set Value of Epsilon
eps <- 0.1 / 2

# GS of Mean and Variance
gs.mean <- 0.01
gs.var <- 0.01

# Apply the Laplace mechanism
set.seed(42)
rdoublex(1, 0.41, gs.mean / eps)
#[1] 0.4496674
rdoublex(1, 0.19, gs.var / eps)
#[1] 0.2466982

```

Parallel Composition

Suppose a set of privacy mechanisms M are sequentially performed on a dataset, and each M provides the sum epsilon privacy guarantee.

The sequential composition undertakes the privacy guarantee for a sequence of differentially private computations. When a set of randomized mechanisms has been performed sequentially on a dataset, the final privacy guarantee is determined by the summation of total privacy budgets.

The privacy budget does not need to be divided.
The query with the most epsilon is the budget for the data.

```{r}
#High_Fevers and Mean of Hours_Sitting

fertility %>%
  filter(High_Fevers >= 0) %>%
  summarise_at(vars(Hours_Sitting), mean)

#Hours_Sitting 0.3932967

# No High_Fevers and Mean of Hours_Sitting
fertility %>%
  filter(High_Fevers == -1) %>%
  summarise_at(vars(Hours_Sitting), mean)

#Hours_Sitting 0.5433333

#Set Value of Epsilon
eps <- 0.1

# GS of mean for Hours_Sitting
gs.mean <- 1 / 100

# Apply the Laplace mechanism
set.seed(42)
rdoublex(1, 0.39, gs.mean / eps)
#[1] 0.4098337
rdoublex(1, 0.54, gs.mean / eps)
#[1] 0.5683491
```

Prepping up data

```{r}
# Set Value of Epsilon
eps <- 0.01
# GS of counts
gs.count <- 1

fertility %>%
  count(Smoking_Habit)

#Smoking Count
#-1      56
# 0      23
# 1     21

#Apply the Laplace mechanism
set.seed(42)

smoking1 <- rdoublex(1, 56, gs.count / eps / 2) %>%
  round()

smoking2 <- rdoublex(1, 23, gs.count / eps / 2) %>%
  round()

# Post-process based on previous queries
smoking3 <- nrow(fertility) - smoking1 - smoking2

# Checking the noisy answers
smoking1
#[1] 60
smoking2
#[1] 29
smoking3
#[1] 11
```

Impossible and
Inconsistent Answers

```{r}

# Set Value of Epsilon
eps <- 0.01
# GS of counts
gs.count <- 1

# Display Participants with Abnormal Diagnosis
Number_abnormal <- fertility %>% filter(Diagnosis=="O") %>% summarise(sum_x1 = sum(Diagnosis=="O"))

#Negative Counts: Applying the Laplace mechanism
# Apply the Laplace mechanism and set.seed(22)
set.seed(22)
rdoublex(1, 12, gs.count / eps) %>%
  round()
#[1] -79

# Apply the Laplace mechanism and set.seed(22)
set.seed(22)
rdoublex(1, 12, gs.count / eps) %>%
  round() %>%
  max(0)
#[1] 0

# Suppose we set a different seed
set.seed(12)
noisy_answer <- rdoublex(1, 12, gs.count / eps) %>%
  round() %>%
  max(0)

n <- nrow(fertility)

# ifelse example
ifelse(noisy_answer > n, n, noisy_answer)
#[1] 100

#Normalising
# Set Value of Epsilon
eps <- 0.01
# GS of Counts
gs.count <- 1
fertility %>%
  count(Smoking_Habit)

#Smoking Count
# -1     56
#  0     23
#  1     21

# Apply the Laplace mechanism and set.seed(42)
set.seed(42)
smoking1 <- rdoublex(1, 56, gs.count / eps / 2) %>%
  max(0)
smoking2 <- rdoublex(1, 23, gs.count / eps / 2) %>%
  max(0)
smoking3 <- rdoublex(1, 21, gs.count / eps / 2) %>%
  max(0)

# Checking the noisy answers
smoking <- c(smoking1, smoking2, smoking3)
smoking
#[1] 65.91684 37.17455 0.00000

# Normalize smoking
normalized <- (smoking/sum(smoking)) * (nrow(fertility))

# Round the values
round(normalized)
#[1] 64 36 0

```