-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathREADME.Rmd
143 lines (110 loc) · 6.54 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
output: github_document
---
[![Build Status](https://travis-ci.org/mitre/sparklyr.nested.svg?branch=master)](https://travis-ci.org/mitre/sparklyr.nested) [![CRAN Status Badge](http://www.r-pkg.org/badges/version/sparklyr.nested)](https://cran.r-project.org/package=sparklyr.nested) ![downloads](http://cranlogs.r-pkg.org/badges/grand-total/sparklyr.nested)
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
A package to extend the capabilities available in the `sparklyr` package with support for working with nested data.
## Installation & Documentation
To install:
```{r, eval=FALSE}
install.packages("sparklyr.nested")
```
Or to get the development version:
```{r, eval=FALSE}
devtools::install_github("mitre/sparklyr.nested")
```
Note that per the `sparklyr` installation instructions, you will need to install Spark if you have not already done so or are not using a cluster where it is already installed.
Full documentation is available here: https://mitre.github.io/sparklyr.nested/
## Nested Operations
The `sparklyr` package makes working with Spark in R easy.
The goal of this package is to extend `sparklyr` so that working with nested data is easy.
The flagship functions are `sdf_select`, `sdf_explode`, `sdf_unnest`, and `sfd_schema_viewer`.
### Schema Viewer
Suppose I have data about aircraft phase of flight (e.g., climb, cruise, descent).
The data is somewhat complex, storing radar data points marked as the start and end points of a given phase.
Furthermore, the data is structured such that for a given flight, there are several phases (disjoint in time) in a nested array.
This is a data set that is not very natural for more R use cases (though the `tidyr` package helps close this gap) but is fairly typical for Hadoop storage (e.g., using Avro or Parquet).
The schema viewer (coupled with a json schema getter `sdf_schema_json`) makes understanding the structure of the data simple.
Suppose that `spark_data` is a Spark data frame.
The structure may be understood by expanding/collapsing the schema via
```{r, eval=FALSE}
spark_data %>%
sdf_schema_viewer()
```
![schema viewer](./README-images/schema_viewer.png)
It is also handy to use the schema viewer to quickly inspect what a pipeline of operations is going to return.
This enables you to anticipate the structure of the output you are going to get (and make sure the operations are valid with respect to schema modification) without doing any actual computation on your data.
For example:
```{r, eval=FALSE}
spark_data %>%
sdf_unnest(phases) %>%
select(aircraft_id, start_point) %>%
sdf_schema_viewer()
```
### Nested Select
The `sdf_select` function makes it possible to select nested elements.
For example, given the schema displayed above, we may be interested in only the phase start point; and within that, only the time and altitude fields.
Grabbing these elements is not possible with a simple `select`, but can be done via:
```{r, eval=FALSE}
spark_data %>%
sdf_select(aircraft_id, time=phases.start_point.time, altitude=phases.start_point.altitude)
```
In java dots are not valid characters in a field name so the dot-operator is handled.
However, since the dollar sign is typically used for this purpose in R, that is supported as well.
The following will trigger the same operation as the above:
```{r, eval=FALSE}
spark_data %>%
sdf_select(aircraft_id, time=phases$start_point$time, altitude=phases$start_point$altitude)
```
What if you want to keep all top level fields, except phases, from which you still want to start point time and altitude?
This can be accomplished via:
```{r, eval=FALSE}
spark_data %>%
sdf_select(aircraft_id, phase_sequence, primary_key,
time=phases$start_point$time, altitude=phases$start_point$altitude)
```
In this particular example things are not so bad.
But imagine a case where you have a wide data set.
Listing out each top level field would quickly become cumbersome.
To ease this pain, the `dplyr` selection helpers are supported, so the above operation may be accomplished using:
```{r, eval=FALSE}
spark_data %>%
sdf_select(everything(), time=phases$start_point$time, altitude=phases$start_point$altitude)
```
### Explode
The example above would work, but would return something of a mess.
The `time` and `altitude` fields would contain vectors of time and altitude values (respectively) for each row-wise element.
Thus for a single `aircraft_id` you would have N values in the corresponding `time` and `altitude` fields.
The explode functionality will flatten the data in the sense that it will replicate the top level record once for each element in a nested array along which you are exploding.
It will *not* change the schema beyond transforming arrays to structures.
The above example could be repeated like so to get something more typical for R where there is a single value per field per "row" (record).
In this case the `aircraft_id` values would be replicated N times instead of having vectors in the `time`/`altitude` fields.
```{r, eval=FALSE}
spark_data %>%
sdf_explode(phases) %>%
sdf_select(aircraft_id, time=phases$start_point$time, altitude=phases$start_point$altitude)
```
### Unnest
Unnesting here works essentially the same way as in the `tidyr` package.
It is necessary here because `tidyr` functions cannot be called directly on spark data frames.
An unnest operation is a combination of an explode with a corresponding nested select to promote the (exploded) nested fields up one level of the data schema.
Thus the following operations are equivalent:
```{r, eval=FALSE}
spark_data %>%
sdf_unnest(phases)
```
and
```{r, eval=FALSE}
spark_data %>%
sdf_explode(phases) %>%
sdf_select(aircraft_id, terrain_fusion_key, vertical_taxonomy_key, sequence,
start_point=phases.start_point, start_point=phases.start_point, phase=phases.phase,
takeoff_mode=phases.takeoff_model, landing_model=phases.landing_model,
phases_primary_key=phases.primary_key, primary_key)
```
There are a few things to notice about how `sdf_unnest` does things:
- It will only dig one level deep into the schema. If there are fields nested 2 levels deep, they will still be nested (albeit only down 1 level) after `sdf_unnest` is executed. Therefore you are *not* guaranteed totally flat data after calling `sdf_unnest`.
- It will promote *every* one-level-deep field, even if you only care about a few of them.
- In the event of a name conflict (e.g., there is a top level `primary_key` and a `primary_key` nested inside the `phases` field) then all of the nested fields will be disambiguated using the name of the field in which it was nested.