forked from randrescastaneda/joyn
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
180 lines (116 loc) · 7.48 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
output: github_document
editor_options:
markdown:
wrap: 72
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# joyn
<!-- badges: start -->
`r badger::badge_cran_checks("joyn")` `r badger::badge_cran_release("joyn", "orange")` `r badger::badge_devel("randrescastaneda/joyn", "blue")` `r badger::badge_codecov("randrescastaneda/joyn")` `r badger::badge_lifecycle("maturing", "green")`
<!-- badges: end -->
`joyn` empowers you to assess the results of joining data frames, making it easier and more efficient to combine your tables. Similar in philosophy to the `merge` command in `Stata`, `joyn` offers matching key variables and detailed join reports to ensure accurate and insightful results.
## Motivation
Merging tables in R can be tricky. Ensuring accuracy and understanding the joined data fully can be tedious tasks. That's where `joyn` comes in. Inspired by Stata's informative approach to merging, `joyn` makes the process smoother and more insightful.
While standard R merge functions are powerful, they often lack features like assessing join accuracy, detecting potential issues, and providing detailed reports. `joyn` fills this gap by offering:
* **Intuitive join handling:** Whether you're dealing with one-to-one, one-to-many, or many-to-many relationships, `joyn` helps you navigate them confidently.
* **Informative reports:** Get clear insights into the join process with helpful reports that identify duplicate observations, missing values, and potential inconsistencies.
## What makes `joyn` special?
While standard R merge functions offer basic functionality, `joyn` goes above and beyond by providing comprehensive tools and features tailored to your data joining needs:
**1. Flexibility in join types:** Choose your ideal join type ("left", "right", or "inner") with the `keep` argument. Unlike R's default, `joyn` performs a full join by default, ensuring all observations are included, but you have full control to tailor the results.
**2. Seamless variable handling:** No more wrestling with duplicate variable names! `joyn` offers multiple options:
* **Update values:** Use `update_values` or `update_NA` to automatically update conflicting variables in the left table with values from the right table.
* **Keep both (with different names):** Enable `keep_common_vars = TRUE` to retain both variables, each with a unique suffix.
* **Selective inclusion:** Choose specific variables from the right table with `y_vars_to_keep`, ensuring you get only the data you need.
**3. Relationship awareness:** `joyn` recognizes one-to-one, one-to-many, many-to-one, and many-to-many relationships between tables. While it defaults to many-to-many for compatibility, **remember this is often not ideal**. **Always specify the correct relationship using `by` arguments** for accurate and meaningful results.
**4. Join success at a glance:** Get instant feedback on your join with the automatically generated reporting variable. Identify potential issues like unmatched observations or missing values to ensure data integrity and informed decision-making.
By addressing these common pain points and offering enhanced flexibility, `joyn` empowers you to confidently and effectively join your data frames, paving the way for deeper insights and data-driven success.
## Performance and flexibility
### The cost of Reliability
While raw speed is essential, understanding your joins every step of the way is equally crucial. `joyn` prioritizes providing **insightful information** and preventing errors over solely focusing on speed. Unlike other functions, it adds:
* **Meticulous checks:** `joyn` performs comprehensive checks to ensure your join is accurate and avoids potential missteps, like unmatched observations or missing values.
* **Detailed reporting:** Get a clear picture of your join with a dedicated report, highlighting any issues you should be aware of.
* **User-friendly summary:** Quickly grasp the join's outcome with a concise overview presented in a clear table.
These valuable features contribute to a slightly slower performance compared to functions like `data.table::merge.data.table()` or `collapse::join()`. However, the benefits of **preventing errors and gaining invaluable insights** far outweigh the minor speed difference.
### Know your needs, choose your tool
* **Speed is your top priority for massive datasets?** Consider using `data.table` or `collapse` directly.
* **Seek clear understanding and error prevention for your joins?** `joyn` is your trusted guide.
### Protective by design
`joyn` intentionally restricts certain actions and provides clear messages when encountering unexpected data configurations. This might seem **opinionated**, but it's designed to **protect you from accidentally creating inaccurate or misleading joins**. This "safety net" empowers you to confidently merge your data, knowing `joyn` has your back.
### Flexibility
Currently, `joyn` focuses on the most common and valuable join types. Future development might explore expanding its flexibility based on user needs and feedback.
## `joyn` as wrapper: Familiar Syntax, Familiar Power
While `joyn::join()` offers the core functionality and Stata-inspired arguments, you might prefer a syntax more aligned with your existing workflow. `joyn` has you covered!
**Embrace base R and `data.table`:**
* `joyn::merge()`: Leverage familiar base R and `data.table` syntax for seamless integration with your existing code.
**Join with flair using `dplyr`:**
* `joyn::{dplyr verbs}()`: Enjoy the intuitive [verb-based](https://dplyr.tidyverse.org/reference/mutate-joins.html) syntax of `dplyr` for a powerful and expressive way to perform joins.
**Dive deeper:** Explore the corresponding vignettes to unlock the full potential of these alternative interfaces and find the perfect fit for your data manipulation style.
## Installation
You can install the stable version of `joyn` from
[CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("joyn")
```
The development version from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("randrescastaneda/joyn")
```
## Examples
```{r example}
library(joyn)
library(data.table)
x1 = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_),
t = c(1L, 2L, 1L, 2L, NA_integer_),
x = 11:15)
y1 = data.table(id = c(1,2, 4),
y = c(11L, 15L, 16))
x2 = data.table(id = c(1, 4, 2, 3, NA),
t = c(1L, 2L, 1L, 2L, NA_integer_),
x = c(16, 12, NA, NA, 15))
y2 = data.table(id = c(1, 2, 5, 6, 3),
yd = c(1, 2, 5, 6, 3),
y = c(11L, 15L, 20L, 13L, 10L),
x = c(16:20))
# using common variable `id` as key.
joyn(x = x1,
y = y1,
match_type = "m:1")
# keep just those observations that match
joyn(x = x1,
y = y1,
match_type = "m:1",
keep = "inner")
# Bad merge for not specifying by argument
joyn(x = x2,
y = y2,
match_type = "1:1")
# good merge, ignoring variable x from y
joyn(x = x2,
y = y2,
by = "id",
match_type = "1:1")
# update NAs in var x in table x from var x in y
joyn(x = x2,
y = y2,
by = "id",
update_NAs = TRUE)
# update values in var x in table x from var x in y
joyn(x = x2,
y = y2,
by = "id",
update_values = TRUE)
# do not bring any variable from y into x, just the report
joyn(x = x2,
y = y2,
by = "id",
y_vars_to_keep = NULL)
```