-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
293 lines (214 loc) · 10.2 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# rrclust <img src="man/figures/logo.png" align="right" height="139" />
<!-- badges: start -->
<!-- badges: end -->
The goal of rrclust is to cluster the Swiss Pension Register (CCO/FSIO) using the [`kamila`](https://github.com/ahfoss/kamila) R package.
The anonymous data of the Swiss Pension Register (CCO/FSIO) are typically used to estimate and project (in the short, middle and long term) the revenues and the expenditures of the Old-Age and Survivors’ Insurance (OASI). In this perspective, it is essential to have a clear look at the register's main statistical features. To better understand it and benefit more from its richness, we propose analysing the raw data by an appropriate clustering method.
We face three main difficulties:
i) As not only continuous but also nominal or categorical variables structure the register, we have to choose a clustering method that considers any types of variables;
ii) The a priori number of clusters should be in the first step determined, and thus the question of how to fix it is essential;
iii) The method should run over big data.
Recently, [A. Foss et al. (2016)](https://doi.org/10.1007/s10994-016-5575-7) and [A. H. Foss and Markatou (2018)](https://doi.org/10.18637/jss.v083.i13) proposed the kamila Method (KAy-means for MIxed LArge data), which is specifically designed to manage a clustering process for mixed distributions.
Furthermore, a simple rewriting of the kamila's algorithm permits an easy implementation in a map-reduce framework like Hadoop, thus being run on very large data sets.
On the other hand, [Tibshirani and Walther (2005)](https://www.jstor.org/stable/27594130) advocate the use of the "Prediction Strength" as a measure to find the optimal number of clusters.
We applied the kamila clustering method on the more than 2 000 000 observations of the Swiss Pension Register (CCO/FSIO) data.
The technique allows us to determine the optimal number of clusters.
On this basis, we can analyse the partition of our data. Indeed, each cluster is then analysed, and its principal features are described.
As a result, it becomes possible to recognise the similarities and dissimilarities between the OASI pensioners subgroups according to their socio-demographic characteristics.
These pieces of information are crucial to predicting revenues and expenditures of the OASI.
## Installation
You can install the development version of rrclust from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("asam-group/rrclust")
```
## Code of Conduct
Please note that the rrclust project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/1/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.
## Related Paper
One needs to have the proper Swiss Pension Register (CCO/FSIO) data in order to run this code. Since
this register is not public, you can find how it is used and learn more about the
results in this [Working Paper](https://folia.unifr.ch/unifr/documents/324081).
## Future work
The next step is to implement some classification methods in the package [rrml](https://github.com/asam-group/rrml) which will be applied
to the kamila-clustered Swiss Pension Register (CCO/FSIO) thanks to the package [rrclust](https://github.com/asam-group/rrclust).
## Flow
The global workflow of the rrclust package is depicted in the figure shown below.
The green ellipses correspond to modules defined as functions accepting only a
certain class of inputs, namely `tibbles`.
These `tibbles` can enter the modules individually or in the form of lists of `tibbles`, namely `tidylists`, containing the `tibbles` in a `tidy` form needed
by the modules.
The red rectangles give the name of the `tibbles` which either are inputs or
outputs of the modules. Therefore, they are the outputs of a transformation of the initial `tibbles`.
The blue rectangles depict the top level outputs, such as `LOG` indicating the run
`log` including the rrclust package version, the `dplyr` library version,
the date and the time of the code execution.
The arrows indicate the direction of the process. If there are two arrows between an ellipse and a rectangle in both directions, this means that an input has been renamed especially for this module and is given back as an output. This is the case of `FULL_CONT_DF` and `FULL_CATEG_DF` which are renamed `tibbles` of resp. `CONT_DF` and `CATEG_DF` and which contains the outcome variables `aadr` and `monthly_pension` opposite to their siblings.
```{r message=FALSE, warning=FALSE, include=FALSE}
library(rrclust)
# directory for the output
path_out <- tempdir()
# generate random demo data
path_random_data <- gen_demo_data(data_size = 100)
# read the demo data
demo_data <- tidylist_read(path_random_data)
# relative path to the params container
path <- file.path(getwd(), "inst", "extdata", "params_kamila_large")
# read PARAM_GLOBAL
tl_PARAM_GLOBAL <- param_tidylist_read(path)
# replace empty variable with the temporary path
tl_PARAM_GLOBAL$PARAM_GLOBAL[["path_data"]] <- stringr::str_remove(
dirname(path_random_data), "/all"
)
PARAM_GLOBAL <- tl_PARAM_GLOBAL$PARAM_GLOBAL |>
dplyr::mutate_all(as.character) |>
tidyr::pivot_longer(
tidyselect::everything(),
names_to = "key",
values_to = "value"
)
# rewrite PARAM_GLOBAL with the demo data path
tidylist_write(tidylist(PARAM_GLOBAL), path = path)
# trace flow
TF0 <- trace_flow({
run_kamila(path = path, path_out = path_out)
})
# remove the parameters files
TF <- TF0 |>
# rename df into df_renamed to avoid the confusion with the function df()
dplyr::mutate(df_renamed = df) |>
dplyr::filter(!grepl("^PARAM", df_renamed)) |>
dplyr::select(-df_renamed)
# five a name to the plot
name_plot <- "rrclust_flow.png"
# define a path for the graph
path_graphs <- file.path("man", "figures")
# create the directory if it does not exist
if (!file.exists(path_graphs)) {
dir.create(path_graphs, recursive = TRUE)
}
# create the plot
plot <- draw_flow(TF)
# render the plot as SVG
tmp0 <- DiagrammeRsvg::export_svg(plot)
# convert to a raw vector
tmp <- charToRaw(tmp0) # flatten
# render svg image into a high quality bitmap
rsvg::rsvg_png(tmp, file.path(path_graphs, name_plot))
# go to the graphs directory
browseURL(file.path(path_graphs, name_plot))
```
<img src="man/figures/rrclust_flow.png" align="center"/>
## Examples
Since the Swiss Pension Register (CCO/FSIO) data are not public, we offer two examples with randomly
generated data which are used to demonstrate the workflow of this package.
### Example step by step with randomly generated data
```{r demo-step-by-step, message=FALSE, warning=FALSE, eval=FALSE}
# enumerate required packages and test if they are installed
required_pkgs <- c(
"data.table", "DiagrammeRsvg", "dplyr", "fs", "magick", "memoise",
"rlang", "rrclust", "rstudioapi", "rsvg", "shiny", "tidyr", "utils"
)
for (pkg in required_pkgs) {
requireNamespace(pkg)
}
# load rrclust
library(rrclust)
# directory for the output
path_out <- tempdir()
# generate random demo data
path_random_data <- gen_demo_data()
# read the demo data
demo_data <- tidylist_read(path_random_data)
# relative path to the params container
path <- file.path(getwd(), "inst", "extdata", "params_kamila_large")
# read PARAM_GLOBAL
tl_PARAM_GLOBAL <- param_tidylist_read(path)
# replace empty variable with the temporary path
tl_PARAM_GLOBAL$PARAM_GLOBAL[["path_data"]] <- stringr::str_remove(
dirname(path_random_data), "/all"
)
PARAM_GLOBAL <- tl_PARAM_GLOBAL$PARAM_GLOBAL |>
dplyr::mutate_all(as.character) |>
tidyr::pivot_longer(
everything(),
names_to = "key",
values_to = "value"
)
# rewrite PARAM_GLOBAL with the demo data path
tidylist_write(tidylist(PARAM_GLOBAL), path = path)
# input
tl_inp_kamila <- mod_inp_kamila(path = path, method_name = "kamila")
# computations
tl_out_kamila <- wrap_kamila(tl_inp_kamila = tl_inp_kamila)
# output
path_out_identifier <- mod_out_kamila(
path = path,
path_out = path_out,
tl_inp_kamila = tl_inp_kamila,
tl_out_kamila = tl_out_kamila
)
# write the csv files
tidylist_write(
c(tl_out_kamila, mod_log()),
path_out_identifier
)
# store parameters (as a reference)
copy_param(path, path_out_identifier)
# access the output
browseURL(path_out_identifier)
```
### Example in one step with randomly generated data
```{r demo-all-in-one, message=FALSE, warning=FALSE, eval=FALSE}
# enumerate required packages and test if they are installed
required_pkgs <- c(
"data.table", "DiagrammeRsvg", "dplyr", "fs", "magick", "memoise",
"rlang", "rrclust", "rstudioapi", "rsvg", "shiny", "tidyr", "utils"
)
for (pkg in required_pkgs) {
requireNamespace(pkg)
}
# load rrclust
library(rrclust)
# directory for the output
path_out <- tempdir()
# generate random demo data
path_random_data <- gen_demo_data()
# read the demo data
demo_data <- tidylist_read(path_random_data)
# relative path to the params container
path <- file.path(getwd(), "inst", "extdata", "params_kamila_large")
# read PARAM_GLOBAL
tl_PARAM_GLOBAL <- param_tidylist_read(path)
# replace empty variable with the temporary path
tl_PARAM_GLOBAL$PARAM_GLOBAL[["path_data"]] <- stringr::str_remove(
dirname(path_random_data), "/all"
)
PARAM_GLOBAL <- tl_PARAM_GLOBAL$PARAM_GLOBAL |>
dplyr::mutate_all(as.character) |>
tidyr::pivot_longer(
everything(),
names_to = "key",
values_to = "value"
)
# rewrite PARAM_GLOBAL with the demo data path
tidylist_write(tidylist(PARAM_GLOBAL), path = path)
# execute the whole workflow
rrclust::run_kamila(path = path, path_out = path_out)
# access the output
browseURL(path_out)
```
## Results
You can see the results in the CSV file named `KM_RES_FINAL.csv`. The best number of
clusters, i.e. the parameter `kstar`, is given by the value of the variable `num_clust`.
The CSV file `KM_RES.csv` shows you all the prediction strengths values for each number of clusters for which you ran the program (see parameter `numberofclusters` in `PARAM_KAMILA.csv`) and for each iteration.