-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathתרגול ggplot_עם פתרונות.Rmd
544 lines (409 loc) · 20.2 KB
/
תרגול ggplot_עם פתרונות.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
---
title: "Untitled"
author: "Gilad Ravid"
date: "11/14/2021"
output:
html_document: default
pdf_document: default
---
תרגול ב ggplot
בתרגילים אלו נשתמש בסדרות נתונים המסופקים עם חבילת ggplot2 . ניתן לראות את כל סדרות הנתונים על ידי הפקודה data(package = "ggplot2")
```{r setup}
library(ggplot2)
library(dplyr)
```
1. כיצד תתארו את היחס בין cty ל hwy (נתוני mpg)? האם ישם בעיה בהסקת מסקנות מגרף זה?
To understand the relationship, we need to make a plot:
```{r e.2.3.1.1_cty_hwy_plot}
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point()
```
It appears that there is a positive linear relationship between cty and hwy.
2. מה הנתונים שהגרף
ggplot(mpg, aes(model, manufacturer)) + geom point()
מראה? האם זה שימושי? כיצד ניתן לשנות את הנתונים ו/או הגרף על מנת להפוך אותו לאינפורמטיבי יותר?
```{r e.2.3.1.2_manu_model_plot}
ggplot(mpg, aes(model, manufacturer)) +
geom_point()
```
This plot has problems. First, the x-axis names in model are too long, that the plot doesn't
show all the full names. This makes it impossible for people to understand. Second, this plot
doesn't really help people understand the relationship between model and manufacturer for the
manufacturer may have several models such as e.g. audi and camry.
A better approach is to check a manufacturer+model combination count
```{r e.2.3.1.2_manu_model}
df <- mpg %>%
mutate(manuModel = paste(manufacturer, model, sep = " "))
df %>%
select(manufacturer, model, manuModel)
ggplot(df, aes(x = manuModel)) +
geom_bar() +
coord_flip()
```
3. תאר את הנתונים, אסטתיקה, מיפוי ושכבות המשמשים בכל אחד מהגרפים הבאים
ggplot(mpg, aes(cty, hwy)) + geom point()
ggplot(diamonds, aes(carat, price)) + geom point()
ggplot(economics, aes(date, unemploy)) + geom line()
ggplot(mpg, aes(cty)) + geom histogram()
```{r e.2.3.1.3_summry_plot}
summary(ggplot(mpg, aes(cty, hwy)) + geom_point())
```
As you can see, we can use summary() function to get full details about a chunk of
plot codes. But in general, the codes above has one dataset, mapping to two variables
in that data set, and has one layer of plots.
4. מה קורה כאשר ממפים משתנה רציף לאסטתיקת colour, size? ומה קורב במשתנה קטגוריאלי? מה קורה כאשר משתמשים ביותר מאסטתיקה אחת?
Using mpg dataset as an example, first I map color, shape, and size to continuous
variables:
```{r e.2.4.1.1_plot1}
ggplot(mpg, aes(cty, hwy, color = +displ)) +
geom_jitter()
```
What you get is a color scale, which you can use +/- sign to change the direction of
color scale.
But the problem is that, color and size might work with continuous variables, but shape
doesn't. Because the various numbers could deplete the current availble shapes that
represent them.
```{r e.2.4.1.1_plot2, eval=FALSE}
ggplot(mpg, aes(cty, hwy, shape = displ)) +
geom_point()
# You get:
# Error: A continuous variable can not be mapped to shape
```
You can use more than one aesthetic in a plot, such as:
```{r e.2.4.1.1_plot3}
ggplot(mpg, aes(cty, hwy, size = displ, color = displ)) +
geom_point()
```
5. מה קורה כאשר ממפים משתנה רציף לצורה ? למה? מה קורה כאשר ממפים את המשתנה trans (מנתוני mpg) לצורה? למה?
The first part has been answered in the previous question.
The second part to map trans to shape:
```{r e.2.4.1.2}
ggplot(mpg, aes(cty, hwy, shape = trans)) +
geom_point()
```
The plot generates a warning that shape for more than 6 discrete values becomes hard to discriminate.
6. כיצד הנעת הרכב (drv) קשורה לצריכת הדלק בעיר?
```{r e.2.4.1.3_plot1}
ggplot(mpg, aes(drv, cty)) +
geom_boxplot() +
scale_x_discrete(labels = c("Front wheel", "Rear wheel", "Four wheel"),
limits = c("f", "r", "4"))
```
Four wheel appears to be most efficient for city miles per gallon.
For drive train, engine size, and class, we need to reorder the class based on engine size first
with median, and then plot class on x-axis and engine size on y-axis, with drive train as color.
```{r e.2.4.1.3_plot2}
ggplot(mpg, aes(reorder(class, displ, FUN = median), displ, color = drv)) +
geom_jitter(width = 0.5)
```
7. כיצד הנעת הרכב קשורה לגודל המנוע (displ) וסוג הרכב?
8. מה קורה כאשר מיצרים לוחות (facet) למשתנה רציף כמו hwy? ומה קורה ב cyl? מה ההבדלים?
```{r e.2.5.1.1_plot1}
ggplot(mpg, aes(x = cty, y = displ)) +
geom_point() +
facet_wrap(~ hwy)
```
When you run facet_wrap(~continuous) with continuous variable, the whole plot becomes hard to
grasp because there are too many graphs.
Then we try to run the same thing with cyl:
```{r e.2.5.1.1_plot2}
ggplot(mpg, aes(displ, cty)) +
geom_point() +
facet_wrap(~ cyl)
```
With cyl, which only has four different values, this picture is much easier to read. The key
difference is that hwy has way too many variation in values than cyl does.
9. השתמש בלוחות כדי לחקור את הקשר בין צריכת דלק, גודל מנוע, ומספר צלינדרים. כיצד השימוש בלוחות לפי מספר צלינדרים משנה את ההערכתם?
The pattern can be seen in the last plot.
10. מה הבעיה עם התרשים שנוצר על ידי
ggplot(mpg, aes(cty, hwy)) + geom point()
כיצד ניתן לפתור זאת?
```{r e.2.6.6.1_plot1}
ggplot(mpg, aes(cty, hwy)) +
geom_point()
```
The problem with this plot is that, there is overplotting, so that this graph doesn't show all the
availble data points in the dataset.
The solution to this problem is to use geom_jitter().
```{r e.2.6.6.1_plot2}
ggplot(mpg, aes(cty, hwy)) +
geom_jitter()
```
11. חקור את ההתפלגות של המשתנה carat בסט נתוני diamonds. איזה binwidth חושף את הצורה המעניינת ביותר
We can generate several plots with different binwidth:
```{r e.2.6.6.3_plot1}
ggplot(diamonds, aes(carat)) +
geom_bar(binwidth = 1) +
ggtitle(expression(atop("Carat Barplot with", "binwidth = 1"))) +
xlab("Carat") +
ylab("Count/Number") +
theme(plot.title = element_text(hjust = 0.5), title = element_text(size = 14),
text = element_text(size = 9))
```
As we change the binwidth:
```{r e.2.6.6.3_plot2}
ggplot(diamonds, aes(carat)) +
geom_histogram(binwidth = 0.5) +
ggtitle(expression(atop("Carat Barplot with", "binwidth = 0.5"))) +
xlab("Carat") +
ylab("Count/Number") +
theme(plot.title = element_text(hjust = 0.5), title = element_text(size = 14),
text = element_text(size = 9))
```
```{r r e.2.6.6.3_plot3}
ggplot(diamonds, aes(x = carat)) +
geom_histogram(binwidth = 0.01) +
ggtitle(expression(atop("Carat Barplot with", "binwidth = 0.01"))) +
xlim(0.3, 3) +
xlab("Carat") +
ylab("Count/Number") +
theme(plot.title = element_text(hjust = 0.5), title = element_text(size = 14),
text = element_text(size = 9))
```
I am not familiar with diamonds industry, but the pattern looks interesting and there must be
a reason for this pattern.
12. חקור את התפלגות משתנה המחיר בנתוני היהלומים. כיצד ההתפלגות משתנה לפי cut?
To check this::
```{r e.2.6.6.4_plot1}
ggplot(diamonds, aes(x = cut, y = price, color = cut)) +
geom_boxplot()
```
```{r e.2.6.6.4_plot2}
ggplot(diamonds, aes(x = price, y =..density.., color = cut)) +
geom_freqpoly(binwidth = 200)
```
Fair cut diamonds have higher price than very good cut. One of the reasons could be
these fair diamonds are big in terms of their sizes, so people are likely to spend
money for the size than for the cut, because not everyone is an expert in diamonds.
13. צייר בוקספלוט ל hwy לכל ערך של cyl, nckh kvpul t, בטך לפקטור. איז אסטיקה נוספת היית צריך להוסיף?
```{r e.3.5.5.1}
ggplot(mpg, aes(cyl, hwy, group = cyl)) +
geom_boxplot()
```
You simply add a "group = cyl" argument within the overall aes in ggplot()
14. שנה את הפקודה
ggplot(mpg, aes(displ, cty)) + geom_boxplot()
כך שיהיה בוקספלוט לכל ערך שלם של displ
```{r e.3.5.5.2}
ggplot(mpg, aes(displ, cty)) +
geom_boxplot(aes(group = displ))
```
15. כמה עמודות בכל אחד מהגרפים הבאים
ggplot(mpg, aes(drv)) +geom_bar()
ggplot(mpg, aes(drv, fill = hwy, group = hwy)) +geom_bar()
library(dplyr)
mpg2 <- mpg %>% arrange(hwy) %>% mutate(id = seq_along(hwy))
ggplot(mpg2, aes(drv, fill = hwy, group = id)) +geom_bar()
```{r}
ggplot(mpg, aes(drv)) + geom_bar(colour="white")
ggplot(mpg, aes(drv, fill = hwy, group = hwy)) + geom_bar(colour="white")
library(dplyr)
mpg2 <- mpg %>% arrange(hwy) %>% mutate(id = seq_along(hwy))
ggplot(mpg2, aes(drv, fill = hwy, group = id)) + geom_bar(colour="white")
```
All have 3 bars.
16. תקן את הגרף שנוצר מהפקודות הבאות. מה הבעיה בגרף?
library(babynames)
hadley <- dplyr::filter(babynames, name == "Hadley")
ggplot(hadley, aes(year, n)) +
geom_line()
```{r e.3.5.5.5}
library(babynames)
hadley <- dplyr::filter(babynames, name == "Hadley")
ggplot(hadley, aes(year, n)) +
geom_line()
```
We want to see the growth of the new born babies with name called "Hadley". This graph doesn't
show the full picture, and the shape of the line is not a good representation of the number of names
we want to see. In addition, if you check the data, the count is separated by sex(gender). As a
result, this line plot cannot show us the full details of the name Hadley's number's variation
across the years. Starting in 1960s, there are female babies named as "Hadley".
```{r e.3.5.5.5_plot}
hadley
male <- hadley %>%
filter(sex == "M")
male
female <- hadley %>%
filter(sex == "F")
female
ggplot(hadley) +
geom_line(aes(year, n, color = sex))
```
17. העזר בפקודה
class <- mpg %>% group_by(class) %>% summarise(n = n(), hwy = mean(hwy))
על מנת לייצר את הגרף
```{r e.5.3.1.2}
class <- mpg %>% group_by(class) %>% summarise(n = n(), hwy = mean(hwy))
ggplot(mpg, aes(class, hwy)) +
geom_jitter(width = 0.05, size = 2) +
geom_point(aes(y = hwy), data = class, size = 4, color = "red") +
geom_text(aes(y = 10, label = paste0("n = ", n)), data = class)
```
The process behind the scene:
First, you use the ggplot() function to define the main dataset and aesthetics you want to use. Here, we want to plot the hwy(y-axis) as points against class(x-axis). This is the first line of code.
Second, remember, ggplot2 works by plotting a graph layer by layer. The second line of code is the first layer that we are going to add. Instead of a scatterplot using geom_point(), we want to use jitter plot to avoid overplotting in geom_point(). Here, we are using the same dataset - mpg, and using the same aesthetics (x = class, y = hwy) as we indicated in ggplot(), so we do not change anything. But, to make it look similar to the original plot in the textbook, we want to squeeze the width = 0.05 (or you can adjust this number to 0.1, 0.2 to take a look), and we can add a size = 2 or 1 or 3, it doesn't matter.
Third, we want to create the same red dots in the original graph. This red dot, is the third layer. Here I used geom_point() because these red dots are basically points. However, red dots means the mean value of the y values on corresponding x values. In the original dataset mpg, we do not have this variable. So, here in the second layer of red dots, we need to reset some of the aesthetics and data. We want aesthetics to have y=hwy, this is the same as the ggplot(), but we change the dataset so data = class. Now, the layer knows that each class matches 1 y-axis value which is the hwy as mean value. We have the red dots there, but we also want to shape the size so it looks big, and color to red.
Finally, to add the layer of labels. Any annotation, we use geom_text() layer to do it. We set the aes(y = 10), because this sets the height of the label near position of y = 10. Then we want to set the label value. But here, we only know the n, which is the count. But in the original graph, it is n = integer. So we need to use the paste0() function to concatenate string and numbers. write, label = paste0("n = ", n) to have an effect of n = integer, then, we set the data = class, because we are not using the original mpg.
18. פשט את הגדרות הגרפים הבאים
ggplot(mpg) + geom_point(aes(mpg$disp, mpg$hwy))
ggplot() +geom_point(mapping = aes(y = hwy, x = cty), data = mpg) +geom_smooth(data = mpg, mapping = aes(cty, hwy))
ggplot(diamonds, aes(carat, price)) +geom_point(aes(log(brainwt), log(bodywt)), data = msleep)
```{r e.5.4.3.1_1}
ggplot(mpg, aes(displ, hwy)) + geom_point()
```
```{r e.5.4.3.1_2}
ggplot(mpg) + geom_point(aes(cty, hwy)) + geom_smooth(aes(cty, hwy))
```
```{r e.5.4.3.1_3}
ggplot(aes(log(brainwt), log(bodywt)), data = msleep) + geom_point()
```
19. מה עושה הקוד הבא. האם הוא עובד? האם הוא הגיוני? למה/למה לא?
ggplot(mpg) +geom_point(aes(class, cty)) +geom_boxplot(aes(trans, hwy))
```{r}
ggplot(mpg) + geom_point(aes(class, cty)) + geom_boxplot(aes(trans, hwy)) + coord_flip()
```
This plot is using mpg dataset as a blank background without setting aesthetics. Then
it adds the first layer using scatterplot with class as x-axis and cty on y axis.
After that, it adds another layer using boxplot with trans on x-axis and hwy on y-axis.
It doesn't work because the x-axis: 1. too crowded, need to add coord_flip() to read all
the class and trans labels. 2. Trans and class are mixed together, it doesn't make sense.
Also, boxplot is mixed with scatterplot. Y-axis has two different value groups, the hwy
and the cty. These two contain both continuous values but are different in nature because one
is city miles per gallon, the other is highway miles per gallon.
In the end, the x-axis and y-axis labels only have class and cty, it doesn't label trans
and hwy.
20. מה קורה כאשר מנסים בשכבה אחת להשתמש במשתנה רציף ולאחר מכן בשכבה אחרת במשתנה קטגוריאלי? מה קורה עם עושים זאת בסדר הפוך
```{r e.5.4.3_2}
ggplot(mpg) +
geom_point(aes(drv, cty)) +
geom_point(aes(hwy, cyl))
```
If you set first x-axis with categorical values, and then with continuous values, the plot
will run but the result doesn't make sense at all. However, if you do it the opposite way,
R will report errors.
21. באיזה geom תשתמש בכל אחד מהמצבים הבאים: א. לתאר השתנות משתנה לאורך הזמן ב. לתאור התפלגות משתנה יחיד ג. למקד את תשומת הלב במגמה הכללית בסט נתונים גדול ג. לצייר מפה ד. לשיים (לשים תוויות) על נקודות חריגות
-Display how a variable has changed over time.
geom_line()
-Show the detailed distribution of a single variable.
geom_histogram()
-Focus attention on the overall trend in a large dataset.
geom_line(), geom_area()
-Draw a map.
geom_sf(), geom_polygon(), coord_quickmap()
-Label outlying points.
geom_point(), geom_text()
22. מה קורה כאשר משייכים משתנה דיסקרטי לסקאלה רציפה? מה קורה כאשר משייכים משתנה רציף לסקאלה דיסקרטית?
Pair a discrete variable to continuous scale:
```{r e.6.2.1.1_plot1}
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous() +
scale_y_continuous()
```
```{r e.6.2.1.1_plot2}
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_discrete() +
scale_y_discrete()
```
Let's compare the two graphs. The first plot used both scale_x_continuous() and scale_y_continuous().
The second plot used scale_x_discrete() and scale_y_discrete for both axes. The difference between
the two graphs doesn't lie within the positions of the points, but the background and the units on
the two axes. If you use discrete for continuous, you won't see the hwy units on the y-axis, nor will
you see the units for displ on x-axis.
On the other hand, we can try continuous scale on discrete variables:
```{r e.6.2.1.1_plot3}
ggplot(mpg, aes(class, hwy)) +
geom_jitter(width = 0.05, height = 0.05)
```
Now we change the scale on the previous plot:
```{r e.6.2.1.1_plot4, eval=FALSE}
ggplot(mpg, aes(class, hwy)) +
geom_jitter(width = 0.05, height = 0.05) +
scale_x_continuous()
# Error: Discrete value supplied to continuous scale
```
We were not allowed to do so because of the error message: Discrete value
supplied to continuous scale.
So in conclusion, we could supply discrete scale to continuous variables, but not
vice versa.
23. פשט את הגרפים הבאים שיהיו קלים להבנה
ggplot(mpg, aes(displ)) +scale_y_continuous("Highway mpg") +scale_x_continuous() + geom_point(aes(y = hwy))
ggplot(mpg, aes(y = displ, x = class)) +scale_y_continuous("Displacement (l)") + scale_x_discrete("Car type") +scale_x_discrete("Type of car") + scale_colour_discrete() + geom_point(aes(colour = drv)) +scale_colour_discrete("Drive\ntrain")
The codes can be simplified as below:
```{r e.6.2.1.2_plot1}
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
ylab("Highway mpg")
```
This can be simplified down to:
```{r e.6.2.1.2_plot2}
ggplot(mpg, aes(class, displ)) +
geom_point(aes(color = drv)) +
labs(x = "Type of car", y = "Displacement (l)", colour = "Drive\ntrain")
```
24. שחזר את הגרף הבא, סדר את תווית ציר y כך שהסוגריים יהיו בגודל הנכון
```{r e.6.3.3.1}
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous("Displacement",
breaks = c(2,3,4,5,6,7),
labels = c("2k", "3k", "4k", "5k", "6k", "7k")) +
scale_y_continuous(quote(Highway (Miles/Gallon)))
```
25. שחזר את הגרף הבא:
```{r e.6.3.3.3}
ggplot(mpg, aes(displ, hwy, color = drv)) +
geom_point() +
scale_color_discrete(labels = c("4wd", "fwd", "rwd"))
```
26. מה הבעיה בגרף המיוצר באמצעות
ggplot(mpg, aes(displ, hwy)) +geom_point(aes(colour = drv, shape = drv)) + scale_colour_discrete("Drive train")
כיצד ניתן לתקן זאת?
The plot created two legends on the right-hand side, where you could just use one.
Here, quoting from the book:"In order for legends to be merged, they must have the same name. So if you
change the name of one of the scales, you’ll need to change it for all of them."
As a result, here, the original plot sets only colour with new name, while shape doesn't have
the same new name, the result is that these two legends cannot merge. There are two ways to fix this
plot:
```{r e.6.4.4.2_plot1}
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv, shape = drv)) +
labs(color = "Drive train", shape = "Drive train")
```
```{r e.6.4.4.2_plot2}
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = drv, shape = drv)) +
scale_colour_discrete("Drive train") +
scale_shape_discrete("Drive train")
```
27. צור את הגרף
28. הקוד הבא מייצר שני גרפים. שנה את הקוד כך שהצירים והמקרא יהיו זהים (אין להשתמש ב facets)
fwd <- subset(mpg, drv == "f")
rwd <- subset(mpg, drv == "r")
ggplot(fwd, aes(displ, hwy, colour = class)) + geom_point()
ggplot(rwd, aes(displ, hwy, colour = class)) + geom_point()
Use exactly expand_limits() function to set both plots' legend title to have all types
of drv with the same set of colors.
We can also set the xlim and ylim so that both plots have the same axes scales. But this
is optional.
```{r, e.6.5.1.1_plot1}
fwd <- subset(mpg, drv == "f")
rwd <- subset(mpg, drv == "r")
ggplot(fwd, aes(displ, hwy, colour = class)) +
geom_point() +
scale_color_discrete("Drive train") +
xlim(0, 10) +
ylim(0, 45) +
expand_limits(color = c("2seater", "compact", "midsize", "minivan",
"pickup", "subcompact", "suv"))
```
```{r, e.6.5.1.1_plot2}
ggplot(rwd, aes(displ, hwy, colour = class)) +
geom_point() +
scale_color_discrete("Drive train")+
xlim(0, 10) +
ylim(0, 45) +
expand_limits(color = c("2seater", "compact", "midsize", "minivan",
"pickup", "subcompact", "suv"))
```