-
Notifications
You must be signed in to change notification settings - Fork 10
/
19-masking-data.Rmd
466 lines (333 loc) · 11.9 KB
/
19-masking-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
# Masking HR data {#masking-data}
```{r masking-data, include=FALSE}
chap <- 20
lc <- 0
rq <- 0
# **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**
# **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
knitr::opts_chunk$set(
tidy = FALSE,
out.width = '\\textwidth',
fig.height = 4,
warning = FALSE
)
options(scipen = 99, digits = 3)
# Set random number generator see value for replicable pseudorandomness. Why 76?
# https://www.youtube.com/watch?v=xjJ7FheCkCU
set.seed(76)
```
In the toolbox of every HR Analytics expert, the ability to quickly anonymize data for test and development environments finds an important place. Surely you want to be able to protect confidential data from inappropriate views.
Ensure all needed libraries are installed
```{r include=FALSE}
if(!require(tidyverse)) install.packages("tidyverse")
if(!require(randNames)) install.packages("randNames")
if(!require(smoothmest)) install.packages("smoothmest")
```
```{r}
library(tidyverse)
library(randNames)
library(smoothmest)
```
### Whitehouse dataset
Load a data set with first and last names and let us preview the dataset
```{r eval=FALSE}
whitehouse <- read_csv("https://hranalytics.netlify.com/data/2016-Report-White-House-Staff.csv")
```
```{r read_data7, echo=FALSE, warning=FALSE, message=FALSE}
whitehouse <- read_csv("data/2016-Report-White-House-Staff.csv")
```
Let us replace original names with fake names
```{r}
#Create a set of new names of equal length as the original dataset
fake <- nrow(whitehouse) %>%
rand_names (nat="US")
#Function to capitalise the first letter
firstup <- function(x) {
substr(x, 1, 1) <- toupper(substr(x, 1, 1))
x
}
#Create a dataframe without the original names columns
whitehouse_with_no_names <- dplyr::select(whitehouse, -Name)
#Create a column of fake last and first names
fake$newnames <- paste0(firstup(fake$name.last), ", ", firstup(fake$name.first))
#Create a column of fake last and first names of exactly the length of the data frame without the original names
result <- fake[1:nrow(whitehouse_with_no_names),]
#Bind the dataset without names with the new dataset containing fake names
whitehouse_masked <- cbind(result$newnames, whitehouse_with_no_names)
#Rename the column title to "Name"
colnames(whitehouse_masked)[1] <- "Name"
```
Let us replace original names with random numbers
```{r}
# Set seed for reproducibility of the random numbers
set.seed(42)
# Replace names with random numbers from 1 to 1000
whitehouse_no_names <- whitehouse_masked %>%
mutate(Name = sample(1:1000, nrow(whitehouse_masked)))
```
Let us anonymise the salary data
We can use four different methods to anonymise data:
* Rounding Salary to the nearest ten thousand
* Top coding consists of bringing data to an upper limit.
* Bottom coding consists of bringing data to a lower limit.
In data analysis exists various types of transformations, reciprocal, logarithm, cube root, square root and square, however in the following we will demonstrate only the square root transformation.
```{r}
# Rounding Salary to the nearest ten thousand
whitehouse_no_identifiers <- whitehouse_no_names %>%
mutate(Salary = round(Salary, digits = -4))
# Top coding convert the salaries into three categories
whitehouse.gen <- whitehouse_masked %>%
mutate(Salary = ifelse(Salary < 50000, 0,
ifelse(Salary >= 50000 & Salary < 100000, 1, 2)))
# Bottom Coding
whitehouse.bottom <- whitehouse_masked %>%
mutate(Salary = ifelse(Salary<=45000, 45000, Salary))
```
### Fertility dataset
Let us explore deeper anonymisation option of generating synthetic data. We will do on the basis of the fertility dataset available from UCI website:
https://archive.ics.uci.edu/ml/datasets/Fertility
![UCI logo](images/UCI_logo.png)
It needs to be loaded first and let us preview it.
```{r eval=FALSE}
fertility <- read_csv("https://https://hranalytics.netlify.com/data/fertility_Diagnosis.txt", col_names = FALSE)
```
```{r read_data10, echo=FALSE, warning=FALSE, message=FALSE}
fertility <- read_csv("data/fertility_Diagnosis.txt", col_names = FALSE)
```
```{r}
# Let us assign significant column titles
colnames(fertility) <- c("Season", "Age", "Child_Disease", "Accident_Trauma", "Surgical_Intervention","High_Fevers", "Alcohol_Consumption","Smoking_Habit","Hours_Sitting","Diagnosis")
# View fertility data
fertility
```
Let us examine the dataset.
More on the data attributes is availalble here:
https://archive.ics.uci.edu/ml/datasets/Fertility
```{r}
# Number of participants with Surgical_Intervention and Diagnosis
fertility %>%
group_by(Diagnosis) %>%
summarise_at(vars(Surgical_Intervention), sum)
# Number of participants with Surgical_Intervention and Diagnosis
fertility %>%
summarise_at(vars(Age), funs(mean, sd))
# Counts of the Groups in High_Fevers
fertility %>%
count(High_Fevers)
# Counts of the Groups in Child_Disease and Accident_Trauma
fertility %>%
count(Child_Disease,Accident_Trauma)
# Calculate the average of Child_Disease
fertility %>%
summarise_at(vars(Child_Disease), mean)
```
In the following we will generate synthetic data sampling from a normal distribution through the subsequent steps:
1. First create a new dataset called "fert", after applying a log transformation on the hours sitting variable.
2. Calculate the average and the standard deviation
3. Set a seed for reproducibility
4. Generate new data normally distributed for the hours sitting variable
5. Retransform back the log variable using exponential.
6. Hard bound data not falling in the right range
7. Recheck the range.
8. Substitute the synthtic data back into the initial fert dataset.
```{r}
fert <- fertility %>%
mutate(Hours_Sitting = log(Hours_Sitting))
fert %>%
summarise_at(vars(Hours_Sitting), funs(mean, sd))
set.seed(42)
hours.sit <- rnorm(100, -1.01, 0.50)
hours.sit <- exp(hours.sit)
hours.sit[hours.sit < 0] <- 0
hours.sit[hours.sit > 1] <- 1
range(hours.sit)
fert$Hours_Sitting <- hours.sit
```
In data analysis exists various types of transformations, reciprocal, logarithm, cube root, square root and square, however in the following we will demonstrate only the square root transformation.
```{r}
# Square root Transformation of Salary
whitehouse.salary <- whitehouse_masked %>%
mutate(Salary = sqrt(Salary))
# Calculate the mean and standard deviation
stats <- whitehouse.salary %>%
summarise_at(vars(Salary), funs(mean, sd))
# Generate Synthetic data with the same mean and standard deviation
salary_transformed <- rnorm(nrow(whitehouse_masked), mean(whitehouse.salary$Salary), sd(whitehouse.salary$Salary))
# Power transformation
salary_original <- salary_transformed^2
# Hard bound
salary <- ifelse(salary_original < 0, 0, salary_original)
```
Let us introduce now the concept of differential privacy, a mathematical concept used by big names like Google, Census Bureau and Apple.
Why Differential Privacy? It quantifies the privacy loss via a privacy budget, called epsilon. It assumes the worst-case scenario about the data intruder. Smaller privacy budget means less information or a noiser answer, however epsilon cannot be zero or lower.
Global Sensitivity of Other Queries
n is total number of observations
a is the lower bound of the data
b is the upper bound of the data
Counting: 1
Proportion: 1 / n
Mean: (b - a) / n
small global sensitivity results in less noise
large global sensitivity results in more noise
Number of observations
n <- nrow(fertility)
Global sensitivity of counts
gs.count <- 1
Global sensitivity of proportions
gs.prop <- 1/n
Lower bound
a <- 0
Upper bound
b <- 1
Global sensitivity of mean
gs.mean <- (b - a) / n
Global sensitivity of proportions
gs.var <- (b - a)^2 / n
```{r}
fertility %>%
summarise_at(vars(Child_Disease), sum)
library(smoothmest)
#rdoublex(draws, mean, shaping)
#rdoublex is a random number generator. It creates a vector of random numbers generated by the double exponential distribution.
```
The double exponential distribution is also commonly referred to as the Laplace distribution. The following is the plot of the double exponential probability density function.
![plot of double exponential probability density function](images/dexpdf.png)
```{r}
set.seed(42)
rdoublex(1, 87, 1 / 10)
#[1] 87.01983
set.seed(42)
rdoublex(1, 87, 1 / 0.1)
#[1] 88.98337
```
Sequential Composition
Suppose a set of privacy mechanisms M are sequentially performed on a dataset, and each M provides the max epsilon privacy guarantee.
The sequential composition undertakes the privacy guarantee for a sequence of differentially private computations. When a set of randomized mechanisms has been performed sequentially on a dataset, the final privacy guarantee is determined by the summation of total privacy budgets.
The privacy budget must be divided by two.
```{r}
#Set Value of Epsilon
eps <- 0.1 / 2
# GS of Mean and Variance
gs.mean <- 0.01
gs.var <- 0.01
# Apply the Laplace mechanism
set.seed(42)
rdoublex(1, 0.41, gs.mean / eps)
#[1] 0.4496674
rdoublex(1, 0.19, gs.var / eps)
#[1] 0.2466982
```
Parallel Composition
Suppose a set of privacy mechanisms M are sequentially performed on a dataset, and each M provides the sum epsilon privacy guarantee.
The sequential composition undertakes the privacy guarantee for a sequence of differentially private computations. When a set of randomized mechanisms has been performed sequentially on a dataset, the final privacy guarantee is determined by the summation of total privacy budgets.
The privacy budget does not need to be divided.
The query with the most epsilon is the budget for the data.
```{r}
#High_Fevers and Mean of Hours_Sitting
fertility %>%
filter(High_Fevers >= 0) %>%
summarise_at(vars(Hours_Sitting), mean)
#Hours_Sitting 0.3932967
# No High_Fevers and Mean of Hours_Sitting
fertility %>%
filter(High_Fevers == -1) %>%
summarise_at(vars(Hours_Sitting), mean)
#Hours_Sitting 0.5433333
#Set Value of Epsilon
eps <- 0.1
# GS of mean for Hours_Sitting
gs.mean <- 1 / 100
# Apply the Laplace mechanism
set.seed(42)
rdoublex(1, 0.39, gs.mean / eps)
#[1] 0.4098337
rdoublex(1, 0.54, gs.mean / eps)
#[1] 0.5683491
```
Prepping up data
```{r}
# Set Value of Epsilon
eps <- 0.01
# GS of counts
gs.count <- 1
fertility %>%
count(Smoking_Habit)
#Smoking Count
#-1 56
# 0 23
# 1 21
#Apply the Laplace mechanism
set.seed(42)
smoking1 <- rdoublex(1, 56, gs.count / eps / 2) %>%
round()
smoking2 <- rdoublex(1, 23, gs.count / eps / 2) %>%
round()
# Post-process based on previous queries
smoking3 <- nrow(fertility) - smoking1 - smoking2
# Checking the noisy answers
smoking1
#[1] 60
smoking2
#[1] 29
smoking3
#[1] 11
```
Impossible and
Inconsistent Answers
```{r}
# Set Value of Epsilon
eps <- 0.01
# GS of counts
gs.count <- 1
# Display Participants with Abnormal Diagnosis
Number_abnormal <- fertility %>% filter(Diagnosis=="O") %>% summarise(sum_x1 = sum(Diagnosis=="O"))
#Negative Counts: Applying the Laplace mechanism
# Apply the Laplace mechanism and set.seed(22)
set.seed(22)
rdoublex(1, 12, gs.count / eps) %>%
round()
#[1] -79
# Apply the Laplace mechanism and set.seed(22)
set.seed(22)
rdoublex(1, 12, gs.count / eps) %>%
round() %>%
max(0)
#[1] 0
# Suppose we set a different seed
set.seed(12)
noisy_answer <- rdoublex(1, 12, gs.count / eps) %>%
round() %>%
max(0)
n <- nrow(fertility)
# ifelse example
ifelse(noisy_answer > n, n, noisy_answer)
#[1] 100
#Normalising
# Set Value of Epsilon
eps <- 0.01
# GS of Counts
gs.count <- 1
fertility %>%
count(Smoking_Habit)
#Smoking Count
# -1 56
# 0 23
# 1 21
# Apply the Laplace mechanism and set.seed(42)
set.seed(42)
smoking1 <- rdoublex(1, 56, gs.count / eps / 2) %>%
max(0)
smoking2 <- rdoublex(1, 23, gs.count / eps / 2) %>%
max(0)
smoking3 <- rdoublex(1, 21, gs.count / eps / 2) %>%
max(0)
# Checking the noisy answers
smoking <- c(smoking1, smoking2, smoking3)
smoking
#[1] 65.91684 37.17455 0.00000
# Normalize smoking
normalized <- (smoking/sum(smoking)) * (nrow(fertility))
# Round the values
round(normalized)
#[1] 64 36 0
```