-
Notifications
You must be signed in to change notification settings - Fork 1
/
learningR.Rmd
846 lines (447 loc) · 20.4 KB
/
learningR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
---
title: "learningR"
author: "Alejandro Hagan"
date: "2022-08-27"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
#Working directory and file manipulation concepts
- Before you start working you need to set up the space you are working in.
- If you don't have a git or version sharing repository
- When you save a file, data file, read inputs or otherwise, the folderpath branching will generally speaking come from here
- `getwd()` will tell you what is your current working directory
- `setwd()` can be used to change your working directly
Other popular system funtions are
- the abilily to list files or filders in a folder path
list.files(pattern=$^, recursive=T)
-the ability to check if a file exists?
file.exists dir.exists basename("c:/filedir/file.txt") -> "file.txt dirname("c:/filedir/file.txt")->
- create files
create.dir("./newdir") file.create("filename.txt") shell.exec("filename.txt")
filesstrings::move_file("c:home/filename.txt","c:newdir") file.copy("filename", "destination") file.rename(directory,"existing file name","new file name") -if directory is blank will assume getwd() file.choose() base::unlink("filename", recursive=T) -recursve must be T to delete directories
Link to learn more
how to connect to github
while all this may seem to new, quite honestly its because we have never been told good versioning and collaboration practices so we are reliant on saving to a LAN system, amending the name with versioning control
create a new github repsository in github
go to global settings (not repository settings), go to develrop settings and click personal access tokens
generate new token, copy (consider saving as you can only see it once)
usethis::git_branch_default()
check local branch
gert::git_remote_ls()
get information
resources
{r}
usethis::use_git_remote("origin",url=NULL,overwrite=TRUE)
usethis::use_github()
usethis::gh_token_help()
usethis::use_git_remote(name = "origin",url = "https://github.com/alejandrohagan/learningR.git",overwrite=TRUE)
gitcreds::gitcreds_set(url = "https://github.com/alejandrohagan/learningR.git")
usethis::use_github()
usethis::git_default_branch()
gh::gh_whoami()
usethis::git_remotes()
usethis::pr_push()
usethis::pr_fetch()
usethis::pr_pull()
usethis::pr_init(branch = "main")
usethis::use_git_remote(name = "origin",url = "https://github.com/alejandrohagan/learningR.git",overwrite=TRUE))
• To create a personal access token, call `create_github_token()`
• To store a token for current and future use, call `gitcreds::gitcreds_set()`
5 steps to change GitHub default branch from master to main | R-bloggers
Don't Lose your HEAD over Default Branches | R-bloggers
Git: Moving from Master to Main | R-bloggers
push & committ and commenting
How to manipulate & tidy data
how to import data
read.csv()
read_csv()
fread()
how to change column types
how to create data
seq() runif() sample(data,# of times, replace,) c() data.frame()
how to create data with basic loops / repitition
types of data
vector data.frame list
some base R basics that will be helpful to you as you read the forums
how to import files
by folder
by website
excel spreadsheet
powerbi model
how to find files by their type
how to append files together
how to automate file importation
file column names
how to change column names
statically
dynamically
best practices when naming columns
how to change column types
statically
dynamically
how to check data structure
-unique
how to clean data
how to shape data
how to subset data
filter
dynamic
static
select
dynamic
static
grouby
summarize
mutate
fill / blanks/ replace values
recode
pivot / unpivot
merge
iterate
apply a single function (single input or multiple inputs) to every/ some/ based on criteria column and get its outputs into a data frame (to be accessed)
apply a single function with multiple inputs
apply a functions to nested data frames
explatory analysis
start with variance analysis
how to extract information out of a table
for (i in range){
do something }
basic framework 1. is to define the vector that you want as an output and its size with vector("type",vectorsize) 2. assign that vector names with names() so that data has headings 3. create for (i in range) 4. define the task it will do individually to thedataset 4. assign that to the output vector
tips & tricks
check the task individually against a single element to ensure it works
you can extract other elements (typically row / column names) using other techniques (names()) and assign to output (doesn't need to be in the loop) -use seq_along(), ncol(), [[]] to define parameters and extract elements
{r loop_example, message=FALSE, warning=FALSE}
output <- vector("double",ncol(mtcars)) #assigns the output vector
cols_names <- names(mtcars) # assigns names of the dataframe to variable
names(output) <- cols_names # assigns the names
for (i in cols_names) { # sometimes useful to use seq_along() here as well
output[i] <- mean(mtcars[[i]]) # the double brackets ensures we only take out one item, this needs to be adjusted if two iems are expected, change the vector to a list
}
output
### Categorical Variables
In general you will need to distinguish betwen your character values as either straight character or categorical with levels
This becomes critical as you look to create categories and relationships in your data
forecats is the gotopackage in particular:
rename
recode() to change values in column
recode(col,newvalue=oldvalue)
reorder
fct_relevel
fct_relevel(col,level1,level2,etc)
fct_reorder(col,col_to_be_reorderby,function)
3)group variables into another group
case_when()
-typically used in combination with mutate() -can reference multiple conditions
case_when(col1==var1 ~ val1, col1==var2 & col3==var3 ~ var 2, is.na(col1) ~ "missingvalue", TRUE ~ "defaultvalue )` cut()
##purr
{ } is used to stop a data frame from passing into as first agurmen
. is a place holder for the data frame
{r }
list.len=3
str(mpg,list.len=3)
str(mpg)
listviewer::jsonedit(mpg)
#patchwork
can organize with easy convention +,/,| is two charts on top and one chart beneath
but can also supplement with additional functions
plot_layout can also arrnage by rows
plot_layout(nrow = 3, byrow = FALSE) arguments: width= changes the graphs relative width size, when given as a numeric c(2,1) then the first columsn graphs are twice as large as the second columns height= changes the graphs reltive row heigh, ncol= numeric, changes number of columsn guides="collect" to remove duplicate guides theme(legend.position='bottom') moves the legend position
guide_area() to create area that guides=collect move towards
https://patchwork.data-imaginist.com/reference/plot_layout.html
plot_annotation() to add annotation title = 'The surprising story about mtcars' tag_levels = 'I' or "A" or "1" to set tag on each plot caption="Text" theme = theme(plot.title = element_text(size = 16))
use the below to add a blank text tile next to a plot grid::textGrob('Some really important text') or a table gridExtra::tableGrob(mtcars$$1:10, c('mpg', 'disp')$$)
plot_spacer() inserts an empty plot
inset_element() to insert a sub graph ontop of a new one
left = 0.6, bottom = 0.6, right = 1, top = 1 align_to = 'full
helpful tips:
When creating a patchwork, the resulting object remain a ggplot object referencing the last added plot. This means that you can continue to add objects such as geoms, scales, etc. to it as you would a normal ggplot:When creating a patchwork, the resulting object remain a ggplot object referencing the last added plot. This means that you can continue to add objects such as geoms, scales, etc. to it as you would a normal ggplot:geom_jitter(aes(gear, disp))
Often, especially when it comes to theming, you want to modify everything at once. patchwork provides two additional operators that facilitates this. & will add the element to all subplots in the patchwork, and * will add the element to all the subplots in the current nesting level. As with | and /, be aware that operator precedence must be kept in mind.
str_replace_all //s+ = all white spaces
how to write tables
font 1. Numerical data is right-aligned 2. Textual data is left-aligned 3. Headers are aligned with their data 3½. Don't use center alignment.
#visual guide ## axis title - axis title always all caps - align top y axis or left axis - color to match axis color ## graph title left alignment
Across(), if_any,if_all
summarize/mutuate/pivot_longer/pivot_wider
across // character based
starts_with
ends_with
contains
matches
num_range()
last_col
where()// with a function that has bolean condition eg. is.numeric
used to select columns by name, position or type (requires where() wrapp)
c(column names), position or type (where)
function, or list( function1=function(), function2=function())
{} is used to refenence preivously declared variables in the glue package or in functions that rerence glue package
some attributes have sepcial references, such as {.col} to reference a column and {.fn} to refernece a function
used in the .names argument of across
`across()` doesn't work with `select()` or `rename()`
mutate, group_by,count,distinct,summarize
filter is excluded and instead use if_any and if_all with exceltiion of
filter(across(everything), ~function)
Examples for filter
* `if_any()` keeps the rows where the predicate is true for *at least one* selected
column:
```{r}
starwars %>%
filter(if_any(everything(), ~ !is.na(.x)))
```
* `if_all()` keeps the rows where the predicate is true for *all* selected columns:
```{r}
starwars %>%
filter(if_all(everything(), ~ !is.na(.x)))
```
* Find all rows where no variable has missing values:
```{r}
starwars %>% filter(across(everything(), ~ !is.na(.x)))
```
Need to investigate rename_with and and itsimpact on select as it appears to be superseded
dplyr/colwise.Rmd at main · tidyverse/dplyr · GitHub
glamour of graphics
alignemtn
top left aligned to the chart left (plot.titile.position="plot"
add_count(dim,name="text") %\>% mutate(colname= glue::glue("{col}{text}")
rotate lebels, either by swapping axis or removing axis all together
remove borders
remove gridlines
left /right align text to create clean borders
indicate legend in title
graphing tips and tricks
if you want to plot a subset of the data but show atrend agains the full data, leverage the data argument in the each individual geom (rather that defining this globally) (example below)
R-Ladies Freiburg (English) - Level up your ggplot: Adding labels, arrows and other annotations - YouTube
geom_curve
aes(x,y,xend,yend)
arrow=arrow(length=unit(x,"inch)),
size
colr
curvature(0 is straight line, positive is right hand curve, negative is left hand curve)
ggforce package has advanced annotation options
geom_mark_circle
geom_mark_rect
geom_mark_hull
geom_mark_elipse`
aes(label,filter,description)
expand
label.lineheight
label.fontsize
show.legend
ggforce() package
with_blur() can blur the geoms (ten need to seperately map the geoms that you do want to show
need to warp the geom_jitter comand in with_blur
with_blur(
geom_jitter(),
sigma = unit(#,"mm") #blur impact
facet_zoom()# takes a larger dataset and then adds in a zoomed up graph
facet_zoom(axis =argument==filterar_gument)
facet_zoom(x=country=="spain")
facet_zoom(y=length <20)
Tidy models
data load
rsamples
assign to training df: split
based on proportions and strata
assign to training df: assign trianing
assign cross validation sets (if ncessary)
assign to testingdf: assign testing
initial_split()
training()
testing()
recipes
prepare the data with prepeprosing
assign vairables
other transformation steps
recipte()
update_role()
parsnip
specify and fit the model
model (decision_tree()
set_engine()
set_mode()
workflow
workflow()
add_receipe()
add_model()
Tidy evaluation
Programming with dplyr • dplyr (tidyverse.org)
Argument type: tidy-select — dplyr_tidy_select • dplyr (tidyverse.org)
Tidy evaluation is not all-or-nothing, it encompasses a wide range of features and techniques. Here are a few techniques that are easy to pick up in your workflow:
Passing expressions through {{ and ....
Passing column names to .data[[ and one_of().
All these techniques make it possible to reuse existing comp
When creating forumlas how to referece to names?
STart with fixed names (only if you are sure it wont change) and try wrapping that around a test to ensur eit exists
duoble currly braces {{}}
When you want to reference a data variable from an env. variable in function, you pass the dataframe to the function then wrap the data var with {{var}} in order pull it from the env frame (instead of saying data$x)
where() for search paramters
.data[[var]]
if you env variable is a character frame that you must use .data[[var]]
all_of or any_of for character vectors for search praramters
<!-- -->
compute_bmi <- function(data) {
if (!all(c("mass", "height") %in% names(data))) {
stop("`data` must contain `mass` and `height` columns")
}
data %>% transmute(bmi = mass / height^2)
}
how to use as nmaes
"mean_{{var}}" := mean({{var}})
Open questions
when do you sue data and when do use .data (okay answer apparently when you use … you start other variables with "." eg. .data to avoid conflictino (20 Dot prefix | Tidyverse design guide
)
f you want the user to provide a set of data-variables that are then transformed, use across():
my_summarise <- function(data, summary_vars) {
data %>%
summarise(across({{ summary_vars }}, ~ mean(., na.rm = TRUE)))
}
starwars %>%
group_by(species) %>%
my_summarise(c(mass, height))
#> # A tibble: 38 × 3
#> species mass height
#> <chr> <dbl> <dbl>
#> 1 Aleena 15 79
#> 2 Besalisk 102 198
#> 3 Cerean 82 198
#> 4 Chagrian NaN 196
#> # … with 34 more rows
You can use this same idea for multiple sets of input data-variables:
my_summarise <- function(data, group_var, summarise_var) {
data %>%
group_by(across({{ group_var }})) %>%
summarise(across({{ summarise_var }}, mean))
}
Use the .names argument to across() to control the names of the output.
my_summarise <- function(data, group_var, summarise_var) {
data %>%
group_by(across({{ group_var }})) %>%
summarise(across({{ summarise_var }}, mean, .names = "mean_{.col}"))
}
Action versb to know how to use
Argument type: tidy-select — dplyr_tidy_select • dplyr (tidyverse.org)
everything(): Matches all variables.
last_col(): Select last variable, possibly with an offset.
These helpers select variables by matching patterns in their names:
starts_with(): Starts with a prefix.
ends_with(): Ends with a suffix.
contains(): Contains a literal string.
matches(): Matches a regular expression.
num_range(): Matches a numerical range like x01, x02, x03.
These helpers select variables from a character vector:
all_of(): Matches variable names in a character vector. All names must be present, otherwise an out-of-bounds error is thrown.
any_of(): Same as all_of(), except that no error is thrown for names that don't exist.
This helper selects variables with a function:
where(): Applies a function to all variables and selects those for which the function returns TRUE.
arrange(), count(), filter(), group_by(), mutate(), and summarise() use data masking so that you can use data variables as if they were variables in the environment (i.e. you write my_variable not df$myvariable).
across(), relocate(), rename(), select(), and pull()
rowwise()
colwise()
Tricks
mean in summarize will give you the portion of that variable per the group
purr
resources
9 Basic map functions | Functional Programming (stanford.edu)
Map and Nested Lists | R-bloggers
pattern
take one element .x<-list[[1]]
do the formula based on that element
set_names() without argument sets the names equal to the values
map returns list, control map outcomes with map alternatives eg. map_df, map_dbl
if function has more than one argument then define a function upfront in global environment and pass the second y argument as explicit command in map [follow up how to do this in anonymous way)
map(1:5,custom_function,y=2)
pmap for more than one vector
can also pass through functions as objects not just data
funs<- list(mean,median,sd)
map(funs,~map_dbl(mtcars,.x))
start on the inside and then work your way to the outside
walk similiar ot map but is design for function that you want to run soley for hte side effects
so walk will always return the origional vector eg. walk(.x,.f)=> .x whereas map will return map(.x,.f)=> .f(.x.
So why use walk? when you want the formula side effectt (eg saving a picture)
accumulate
applies same function again and again and again
applies function to first argument then takes that result and applies that outcome to second argument
eg. accumulate(letters,paste), will produce a prymid of values of all the letters
so accumlate will show all the interim values
reduce will only show the final value
this is recursive
However, if you want pair wise actions 1*1, 2*2, etc then you need map2
tidytext
unnest_tokens basically takes a string and breaks t into characters, words, or others ngrams,
From there use typical dplybs to graph, popular geoms are geom_text to plot the words aagainst their proportion.
typiecal tokenize methodologies use ICI (international components of unicode) which defines word boundaries )
Chapter 2 Tokenization | Supervised Machine Learning for Text Analysis in R (smltar.com)
packages
tidytext
tokenize
stopwords
SnowballC for stemming
hunspell also for stemming / spell check
types of toekn
characters
words,
sentences
lines,
paragraphs
ngrams
you can use tidytext package or tokenize package but here is your pattern
grab text
add a dimenion factor (eg. chapter, author, book)
nest the data by the dimension
if data is nested use mutate(map()) pattern to perform transformations
if doing setences or paragraphs you may need to paste() the text and add paragraph breaks "\n" (paragraphs) or space breaks " " (sentences)
then use either tidytext(returns tibble) or tokenizer (returns lists) to do unnesting work
unnest data
anti_join(stop_words)
regex considerations
[:alpha:] brings in non US lettesr where as [a-zA-Z] only brings in US letters
? is optional (will match or not)
^ starts with
$ends with
| or
stop words
stopwords package with snowball, iso and other packages
stemming
can stem words tree, tree's into single word
however also has impact of creating new words to stem by
SnowballC package offers wordStem
tokenizer::tokenize_word_stems
hunspell:hunspell_stem
more resources
Chapter 2 Tokenization | Supervised Machine Learning for Text Analysis in R (smltar.com)
tidytext
tidy evaluation
resources
2 Why and how | Tidy evaluation (tidyverse.org)
Programming with dplyr • dplyr (tidyverse.org)
Implementing tidyselect interfaces • tidyselect (r-lib.org)
Technical description of tidyselect • tidyselect (r-lib.org)
13 Tidy evaluation basics | Functional Programming (stanford.edu)
principles
data masking is when you delay the evaluation of a code so that the code can find relevant columns for computation. th
This is why you can can do::
{r}
#this will work
starwars %>% filter(
height < 200,
gender == "male"
)
#but really the program needs this, reference to dataframe and column
starwars[starwars$height < 200 & starwars$gender == "male", ]
technical term for delaying code is quoting
this delaying of code evalution can help you when using code but also make things more complicated when writing code
vectoring can occur when either input has 1 or same length of input object to ensure all columns have same length
some functions can repeat values if recycling completes the length
howevre other,like tidyverse family don't
!! takes a variable defined outside of function and allows you to use it within a function x<-1 function(x) !!x+1, qq_show() allows you to see what is happening
!! works for assiging a variable not a column name or only variable
!! is simliar to := but := is only for left hand side eg setting a name
when working with lists you need !!! to pull out each element and pass it through otherwise !! just pass through the list as is also need to use enquos() vs. enquo
sym) is how you quote for column names and !! is how you unquote, however sym() only works for character strings so "mpg" vs. mpg
enquo is used for non character strings and then !! is how you unquote them
rlang::qq_show will help show what how !! is being evaluated
pattersn
enquo() and !!
:= and !!
enquos() and !!!