-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathredWine.rmd
653 lines (501 loc) · 25.1 KB
/
redWine.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
---
title: "Red Wine Exploratory Data Analysis"
author: "Shrey Marwaha"
date: "September 20, 2017"
output: html_document
---
```{r echo=FALSE, message=FALSE, warning=FALSE}
# Loading required libraries
library(ggplot2)
library(stringr)
library(GGally)
library(reshape2)
library(dplyr)
library(tidyr)
library(gridExtra)
library(corrplot)
```
## Objectives
Through this project, I aim to explore and analyse the Red Wine Data Set, and infer the
factors on which the quality of wine depends. I will start by exploring the data to get a
feel of the information available and then, if possible, I hope to draw out the correlations
between the various variables and Wine Quality.
```{r echo=FALSE, Load_the_Data}
# Load the Data
rdw <- read.csv('wineQualityReds.csv')
# Wrangling the data
rdw$quality <- factor(rdw$quality, ordered = TRUE)
#Creating new ordered categorical variable Class
rdw$class <- factor(ifelse(rdw$quality < 5, 'poor',
ifelse(rdw$quality < 7, 'normal', 'exceptional')
))
rdw$class <- ordered(rdw$class, levels = c('poor', 'normal', 'exceptional'))
```
I have created a new variable **class** here, which is based on quality ratings. It has been divided as follows:
* *poor* : Quality rating 3 and 4
* *normal* : Quality rating 5 and 6
* *exceptional* : Quality rating 7 and 8
## Structure and summary of the Dataframe
```{r echo=FALSE, message=FALSE, warning=FALSE, packages}
str(rdw)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
summary(rdw)
```
-----
# Univariate Plots Section
As a start, I see from the above statistics that there are a lot more number of
**normal** red wines in the data, than 'poor' and 'exceptional'. Let's have a
visual look at this by checking quality factor.
```{r echo=FALSE, Univariate_Plots}
#Data Quality wise
ggplot(data = rdw, aes(x = quality))+
geom_bar(fill = I('#40B0B5'))
```
That shows it quite clearly, as is mentioned in the Info file. I wonder if wines
with quality rating 8 (exceptional) and quality rating 3 (poor) are outliers.
```{r echo=FALSE, message=FALSE, warning=FALSE}
# Plotting all variables together, ommitting factor variables
rdw_compiled <- gather(rdw, "variable", "value", 2:12, factor_key = TRUE)
ggplot(aes(x = value), data = rdw_compiled)+
geom_histogram(fill = I('#40B0B5') )+
facet_wrap(~variable, scales = "free")
```
From the above initial plots, a lot of interesting things can be seen. Many variables
are showing some kind of a relationship which can be of particular interest to us.
Let's examine some of these further. I'll go about doing this category wise,
namely: Acidity, Chemicals and Other features.
###Acidity
From the data, we can see that pH, fixed.acidity, volatile.acidity and citric.acid
basically correspond to this category. Let's have a look at their distributions, together.
```{r echo=FALSE, message=FALSE, warning=FALSE}
# acidity related plots
p1 <- ggplot(data = rdw, aes(x = fixed.acidity))+
geom_histogram(fill = '#338C90', binwidth = 0.1)+
coord_cartesian(xlim = c(4, 14))
p2 <- ggplot(data = rdw, aes(x = volatile.acidity))+
geom_histogram(fill = '#338C90', binwidth = 0.01)+
coord_cartesian(xlim = c(0, 1))
p3 <- ggplot(data = rdw, aes(x = citric.acid))+
geom_histogram(fill = '#338C90', binwidth = 0.01)
p4 <- ggplot(data = rdw, aes(x = pH))+
geom_histogram(fill = '#338C90')
grid.arrange(p1, p2, p3, p4, ncol = 1)
```
From the above plots, we can see that fixed.acidity has more of a positively skewed
distribution, it probably may be bimodal. The volatile.acidity variable also looks like a
skewed distribution, though we may want to look at it further. There isn't much to be
infered from the citric.acid plot, and the pH plot has a uniform distribution with most
values concentrated between 3.0 and 3.5. That's quite accurate as wines are mildly acidic.
The acidity adds to it's crispiness.
I would like to see some x-axis transformations in citric.acid plot to extract any
available information.
```{r echo=FALSE, message=FALSE}
#transformed citric acid to sqrt scale
ggplot(data = rdw, aes(x = citric.acid))+
geom_histogram(fill = '#338C90', binwidth = 0.01)+
scale_x_sqrt()
```
After a sqrt tranformation on x-axis, I can see that although there isn't
relation visible, but many of the values have zero amount of citric acid in them.
This probably depends on the flavor of the wine, and probably affects the quality.
###Chemicals
Let's check for presence of chemicals such as chlorides, sulphates & sulfur dioxide.
```{r echo=FALSE, message=FALSE}
#Chemical components
c1 <- ggplot(data = rdw, aes(x = chlorides))+
geom_histogram(fill = '#47979B')
c2 <- ggplot(data = rdw, aes(x = sulphates))+
geom_histogram(fill = '#47979B')
grid.arrange(c1, c2, ncol = 1)
```
Both these curves show skewed distribution, and a large number of outliers.
It's possible that these abnormal values may distinguish the quality/class of
the wine, as presence of long tailed data is visible.
```{r echo=FALSE, message=FALSE}
#Sulfur related plots
s1 <- ggplot(data = rdw, aes(x = free.sulfur.dioxide))+
geom_histogram(fill = '#7EB6B9')+
coord_cartesian(xlim = c(0, 60))
s2 <- ggplot(data = rdw, aes(x = total.sulfur.dioxide))+
geom_histogram(fill = '#7EB6B9')+
coord_cartesian(xlim = c(0, 200))
grid.arrange(s1, s2, heights = unit(1, 'npc'))
```
Both these variables show somewhat similar skewness. The free sulphur dioxide
amount peaks at two places, first at 7 which is the median, and then around 9-10.
Both have a small number of outliers that have been omitted from the visualizations.
###Other Parameters
Let's look at residual sugar, density and alcohol levels.
```{r echo=FALSE, message=FALSE}
# Residual Sugar
o1 <- ggplot(data = rdw, aes(x = residual.sugar))+
geom_histogram(fill = '#7eb6b9')
# Density
o2 <- ggplot(data = rdw, aes(x = density))+
geom_histogram(fill = '#7e99b9')
#Alcohol
o3 <- ggplot(data = rdw, aes(x = alcohol))+
geom_histogram(fill = '#7eb99f')
grid.arrange(o1, o2, o3, ncol = 1)
```
Residual sugar has very high outliers, while the 75th percentile is at 2.60 only.
The distribution is positively skewed with median around 2.20. Density seems to be
a uniform distribution, which can hence be a deciding factor in quality. Alcohol
percentages are mostly positively skewed.
# Univariate Analysis
### Structure of the Dataset
The Red Wine Dataset has 1599 rows and 13 columns. I've added a new column called
'class' based on the quality ratings. This is to help in distinguishing various trends between chemical properties across classes, as sometimes sensory perception of two people may not exactly mean the same thing. So this type of grouping helps to approximate the trends across adjacent ratings.
Also, both class and Quality have been transformed into factor variables for easy analysis. Final number of variables is 14.
### Feature(s) of Interest
Most data points belong to the **normal** class of wines. This is either a good
indication, or implies that data is incomplete. Analysis of multiple variables
with quality may clear this up further.
### Supporting Points of Interest
From the intial analysis it seems that variables such as factors related to acidity
and pH may be some deciding factors for quality, since the taste is affected by these.
By the same clause, chlorides and residual sugars may also affect the perception
of quality.
### Unusual Observations
1. The plot of citric.acid qauntity seemed to have no particular relationship going
on. A large number of values were seemingly zero, which is either an indication of
quality, or incomplete data. To better visualize it, I used the square root
transformation, because log transformation would have lead to non-finite values.
2. Other variables such as chlorides, sulphates and residual sugar have many
outliers. It intrigues me if these actually propose a measure of quality or are
just dirty values.
------
# Bivariate Plots Section
Although I have some bivariate analyses in mind, I will start by creating a
correlation plot to double check my thoughts.
```{r echo=FALSE, message=FALSE, Bivariate_Plots}
# Creating an all numeric df
rdw_new <- select(rdw, -X, -class)
rdw_new$quality <- as.numeric(rdw_new$quality)
# Finding correlation matrix
rdw_cor <- cor(x = rdw_new, method = 'pearson')
# Plotting correlation plot
col <- colorRampPalette(c("#4799d3", "#9ac3e0", "#FFFFFF", "#aaffc8", "#33ce6a"))
corrplot(rdw_cor, method="color", col = col(200), order="original",
addCoef.col = "black",
tl.col="black", tl.srt = 90, tl.cex = 0.9, # Variable Labels
number.cex = 0.7, number.digits = 2, # Correlation Coeffs.
cl.cex = 0.7, cl.align.text = "c", # Color Label Text
diag=TRUE)
```
From the above map, the following are some of the facts that intrigue me:
1. Quality is **highly correlated** with **Alcohol and Volatile Acidity**, although in
opposite manners.
2. Fixed acidity seems to be **highly correlated** with **citric acid, density**
**and pH**.
3. pH seems to be **positively correlated** with **Volatile Acidity**. I assumed it to
be the other way round, i.e. similar to fixed acidity.
4. Sulphates and Chlorides seem to be **moderately positively correlated**. This may
not lead to an actual relationship though between the two chemicals specifically,
but it may be the case that they are present in proportionate amounts in good
quality wines.
####Let's look at some plots to verify our above hypothesis.
```{r echo=FALSE, message=FALSE}
# Quality vs Alcohol
ggplot(data = rdw, aes(y = alcohol, x = quality))+
geom_boxplot(alpha = 1/3, color = "#2e70b2", fill = "#6fa9e2")+
stat_summary(fun.y = mean, color = "red", geom = "point", shape = 4, size = 3)
```
So the alcohol content increases with quality, the exception in this plot being quality
rating 5 with a large number of outliers. The mean values as well as the median values
clearly express this fact. We can create percentage ranges for Alcohol levels and see the
distribution of quality rating in each range as well.
```{r echo=FALSE}
rdw.alcohol.ranges <- cut(rdw$alcohol, breaks = c(8,9,10,11,13,15))
table(rdw.alcohol.ranges)
```
```{r echo=FALSE}
# Percentage wise Quality distribution
qplot(x=rdw.alcohol.ranges, data = rdw, fill = ordered(quality, c(8,7,6,5,4,3)))+
scale_fill_brewer(type = "seq", direction = -1)
```
This graph shows that though the majority of **normal**(5 & 6) quality red wines
are present in the 9-11% alcohol content range, the **exceptional** ones (7 & 8)
tend to be on the higher side in alcohol levels.
Taking a look at volatile.acidity vs Quality, which is expected to have a high
negative correlation.
```{r, echo=FALSE}
# volatile acidity vs quality
ggplot(aes(x = quality, y = volatile.acidity, group = 1), data = rdw)+
geom_jitter(alpha = 1/3, color = "#37768e")+
stat_summary(geom = "line", fun.y = "quantile", fun.args = list(probs = 0.97),
color = "red", linetype = 2)+
stat_summary(geom = "line", fun.y = "median", color = "black")+
stat_summary(geom = "line", fun.y = "quantile", fun.args = list(probs = 0.03),
color = "red", linetype = 2)
```
From the median, 97th percentile, and the 3rd percentile, we clearly see a declining
relationship with quality. And volatile acidity values for normal quality are packed
mostly in the 0.2 to 0.8 range.
Let's have a look at fixed.acidity and variables correlated with it.
```{r echo=FALSE}
# citric acid vs fixed acidity
ggplot(aes(x = fixed.acidity, y = citric.acid), data = rdw)+
geom_jitter(color = "#37768e", alpha = 1/2)+
geom_smooth(method = "lm", color = "#ce1644")
```
Fixed acidity definitely increases with higher amounts of Citric acid. Probably
because citric acid may be the main constituent of acidic nature of the wine.
Volatile acidity has a negative relation, so we might need to investigate these
together in Multivariate Analyses section.
Let's check the relationship between fixed.acidity and density.
```{r echo=FALSE, message=FALSE}
# fixed acidity vs density
ggplot(aes(x = fixed.acidity, y = density), data = rdw)+
geom_point(color = "#163470", alpha = 1/4)+
geom_smooth(method = "lm", color = "#ce1644")
```
Clearly a positive relationship between these two factors. Let's see if density
affects quality majorly.
```{r echo=FALSE, message=FALSE}
# Density vs Quality
ggplot(aes(x = quality, y = density), data = rdw)+
geom_boxplot(alpha = 1/3, color = "#2e70b2", fill = "#6fa9e2")+
stat_summary(fun.y = mean, color = "red", geom = "point", shape = 4, size = 3)
```
This plot doesn't really lead to any new conclusions. Density isn't affective
in determining quality.
What about acidity vs pH? Correlation coefficient for fixed.acidity is negative,
which is to be expected as higher acidity implies lower pH Value. Let's check for
volatile.acidity and citric.acid. We will keep the x axis in log scale to better
view the data.
```{r echo=FALSE, message=FALSE}
# Fixed acidity vs pH
ggplot(aes(x = fixed.acidity, y = pH), data = rdw)+
scale_x_log10()+
geom_point(color = "#163470", alpha = 1/4)+
geom_smooth(method = "lm", color = "#ce1644")
# Volatile acidity vs pH
ggplot(aes(x = volatile.acidity, y = pH), data = rdw)+
scale_x_log10()+
geom_point(color = "#163470", alpha = 1/4)+
geom_smooth(method = "lm", color = "#ce1644")
# Citric acid vs pH
ggplot(aes(x = citric.acid, y = pH), data = subset(rdw, rdw$citric.acid > 0))+
scale_x_log10()+
geom_point(color = "#163470", alpha = 1/4)+
geom_smooth(method = "lm", color = "#ce1644")
```
Clearly, volatile acidity has a positive relationship with pH as opposed to
fixed.acidity and citric.acid. This finding is crucial, and will be evaluated
further in the next section.
Let's have a look at Sulphates and Chlorides.
```{r echo=FALSE, message=FALSE, warning=FALSE}
# Sulphates vs Chlorides
ggplot(aes(x = sulphates, y = chlorides), data = rdw)+
geom_jitter(alpha = 1/3, color = "#163470")+
geom_smooth(method = "lm", color = "#ce1644")+
coord_cartesian(xlim = c(0.2, 1.5), ylim = c(0, 0.5))
```
There is a basic linear relationship between the two, but a lot of outliers do exist.
Let's compare them with Quality.
```{r echo =FALSE, message=FALSE}
# Sulphates vs Quality
ggplot(aes(x = quality, y = sulphates), data = rdw)+
geom_boxplot(alpha = 1/3)+
stat_summary(geom = "point", fun.y = mean, color = "red", shape = 3)
# Chlorides vs Quality
ggplot(aes(x = quality, y = chlorides), data = rdw)+
geom_boxplot(alpha = 1/3)+
stat_summary(geom = "point", fun.y = mean, color = "red", shape = 3)
```
Though for chlorides, the distribution across quality remains more or less constant,
we can see a large number of outliers for both chlorides and sulphates in the
normal (5 & 6) quality wines. Also, for quality rating 3 (lowest), we see a large
average quantity of chlorides, while it remains fairly same across higher ratings.
# Bivariate Analysis
### Observations with respect to Feature of Interest
Our main feature of interest is quality. So from our above analysis we can hypothesize
the following:
* Quality is majorly dependent on **Alcohol content, acidity and sulphates.**
* Volatile acidity has a negative correlation with quality, and a positive correlation
with pH.
* Citric acid probably contributes to fixed acidity, hence both of these lead to
decline in pH values, which increases the quality.
* Density does not play a major role in determining quality, but *exceptional quality*
*(7 & 8) wines* tend to have a **lower** density. That is probably due to
**more alcohol content** which reduces density.
* Chlorides are present in low amounts in Red wines, and their value above a certain
threshold hampers the quality.
### Other Interesting Features
I expected volatile acidity to add to the total acidity factor through pH. A reverse
trend was seen, with higher volatile acidity aiding in increasing the pH.
Also, it amazed me to find a lot of citric acid values to be zero, when its amount
was found to be higher in higher quality wines. That might be because of personal
prefernces, as it's mentioned in the information text that no values are missing from
the dataset.
### Strongest Relationship
The strongest relationship I found, measured on the basis of correlation coefficient,
is between pH and fixed.acidity. The r value is **-0.68**.
------
# Multivariate Plots Section
```{r echo=FALSE, Multivariate_Plots}
#volatile acidity vs total acidity faceted by quality
ggplot(aes(x = volatile.acidity, y = pH), data = rdw)+
scale_x_log10(breaks = c(0, 0.2, 0.4, 0.8, 1.2))+
geom_jitter(alpha = 1/4, color = "#3b67dd")+
geom_smooth(method= "lm", color="red")+
facet_wrap(~class)
```
There isn't much difference to the relation between pH and volatile acidity, across
quality. Let's check it w.r.t. alcohol levels as it seems to be a major factor.
```{r echo = FALSE, message=FALSE}
ggplot(aes(x = alcohol, y = volatile.acidity), data = rdw)+
geom_jitter(alpha = 1/4, color = "#3b67dd")+
geom_smooth(method = "lm", color = "red")+
facet_wrap(~class)
```
Clearly shows that higher volatile acidity content leads to poor quality. The declining
trend in **normal** class, confirms this. I will crosscheck it's scientific reason, and
document it at a later stage.
```{r echo = FALSE, message=FALSE}
# Alcohols and sulphates dsitributed by class
ggplot(aes(x = alcohol, y = sulphates), data = rdw)+
geom_jitter(alpha = 1/4, color = "#3b67dd")+
facet_wrap(~class)
```
Exceptional quality wines tend to have larger quantities of sulphates and alcohol,
relatively. It could be that higher levels of alcohol content coupled with the same
amount of sulphates may also be a factor here.
```{r echo=FALSE}
#Citric acid and pH across class and quality
ggplot(aes(x = citric.acid, y = pH, color = quality), data = rdw)+
geom_jitter(alpha = 1/4)+
geom_smooth(method = "lm")+
facet_wrap(~class)+
scale_color_brewer(type = "seq")+
theme_dark()
ggplot(aes(y = citric.acid, x = fixed.acidity, color = quality), data = rdw)+
geom_jitter(alpha = 1/2)+
geom_smooth(method = "lm")+
facet_wrap(~class)+
scale_color_brewer(type = "seq")+
theme_dark()
```
Clearly, fixed acidity is directly correlated with amount of citric acid. And it
leads to lower pH values which aid to better ratings.
Let's once check again the correlation coefficients of the above analysed variables
with respect to quality only.
```{r echo=FALSE}
#Subsetting data to calculate r vs quality
qual_factors <- select(rdw, alcohol, citric.acid, volatile.acidity, sulphates)
quality_cor <- cor(qual_factors, as.numeric(rdw$quality))
quality_cor
```
# Multivariate Analysis
### Interesting Observations
The final analysis yieldid a significant relationship between **Quality and **
**Alcohol levels, sulphates and volatile acidity**. To a certain extent,
**citric acid** also favours quality, by increasing the fixed acidity and hence the pH.
### Other Relevant Observations
1. Volatile acid finally cleared up to hamper quality with its increasing levels.
2. Citric acid is the major contributor to pH.
3. Sulphates, though generally preferable, may not add any value after a certain amount.
*The above mentioned observations cannot be finalised based on the data provided, but I
do feel that a larger dataset would've made them concrete.*
------
# Final Plots and Summary
### Plot One
```{r echo=FALSE, Plot_One}
# Sulphates vs Quality
ggplot(aes(x = quality, y = sulphates, color = quality), data = rdw)+
geom_boxplot(alpha = 0.5)+
stat_summary(geom = "point", fun.y = mean, color = "red", shape = 3)+
scale_color_brewer(type = "seq")+
theme_dark()+
labs(title = "Sulphate Quantity vs Quality",
x = "Quality Rating",
y = "Potassium Sulphate (g/dm^3)")+
theme(plot.title = element_text(hjust = 0.5))
# Alcohols and sulphates dsitributed by class
ggplot(aes(x = alcohol, y = sulphates), data = rdw)+
geom_jitter(alpha = 1/4, color = "#3b67dd")+
geom_smooth(method = "lm", se = FALSE, color = "#d6224c")+
facet_wrap(~class)+
labs(title = "Alcohol vs Sulphates (per Quality Class)",
x = "Alcohol (% by Volume)", y = "Potassium Sulphate (gm/dm^3)")+
theme(plot.title = element_text(hjust = 0.5))
```
### Description One
These two plots together quite reasonably show the importance of sulphates in Wine Quality. The first plot has a clear **upward trend in Potassium sulphate** amount across quality. *The mean moves mostly along the median line.* This established this factor as an important ingredient. The second plot shows how crucial a role does Sulphate amount play across classes, along with the most prominent ingredient, alcohol. It's clear that for poor quality wines, the sulphate amount drops down, for normal quality wines, it shows an increasing trend with in increase in alcohol levels, but for exceptional classes, it doesn't really make a significant difference. At higher alcohol levels, a bare threshold amount of sulphate compounds is sufficient.
### Plot Two
```{r echo=FALSE, Plot_Two}
# volatile acidity vs quality
ggplot(aes(x = quality, y = volatile.acidity, group = 1), data = rdw)+
geom_jitter(alpha = 1/3, color = "#3b67dd")+
stat_summary(geom = "line", fun.y = "quantile", fun.args = list(probs = 0.97),
color = "#d6224c", linetype = 2)+
stat_summary(geom = "line", fun.y = "median", color = "black")+
stat_summary(geom = "line", fun.y = "quantile", fun.args = list(probs = 0.03),
color = "#d6224c", linetype = 2)+
labs(title = " Quality vs Volatile Acidity", x = "Quality Rating",
y = "Volatile Acidity (gm/dm^3)")+
theme(plot.title = element_text(hjust = 0.5))
#volatile acidity vs total acidity faceted by quality
ggplot(aes(x = volatile.acidity, y = pH), data = rdw)+
scale_x_log10(breaks = c(0, 0.2, 0.4, 0.8, 1.2 ))+
geom_jitter(alpha = 1/4, color = "#3b67dd")+
geom_smooth(method= "lm", color="#d6224c", se = FALSE)+
facet_wrap(~class)+
labs(title = "Volatile Acidity (log10) vs pH (per Quality Class)", y = "pH",
x = "Volatile Acidity (g/dm^3)")+
theme(plot.title = element_text(hjust = 0.5))
```
### Description Two
These two plots together serve the purpose of clarifying the mystery behind volatile
acidity. After I confirmed this relationship through these charts, I was sure about
one thing that Volatile Acidity (VA) is actually not contributing to the acidity of the wine!
And after searching the internet, I found out that increased levels of
**acetic acid, the source of volatile acidity**, implies overfermentation, which
spoils the wine. It's undesireable!
And secondly, a **low pH value inhibits development of microorganisms!** Hence,
at higher pH values, we can see higher VA values! So VA becomes a definite factor in deciding quality!
### Plot Three
```{r echo=FALSE, Plot_Three}
ggplot(aes(y = citric.acid, x = fixed.acidity, color = quality), data = rdw)+
geom_jitter(alpha = 1/4)+
geom_smooth(method = "lm", se = FALSE)+
facet_wrap(~class)+
theme_dark()+
scale_color_brewer(type = "seq")+
labs(title = "Fixed Acidity vs Citric Acid", y = "Citric Acid (g/dm^3)",
x = "Fixed Acidity (g/dm^3)", color = "Quality")+
theme(plot.title = element_text(hjust = 0.5))
```
### Description Three
Though not exactly giving a relationship towards quality, this graph quite clearly
depicts the linear relationship between fixed acidity and **citric acid**.
Fixed acidity is composed of **tartaric acid** in majority, and it seems that
citric acid is added in proportion, to enhance the fruitiness of the wine.
------
# Reflection
Having no previous perceptory or scientific knowledge of Red wines, it was initially
difficult for me to get a feel of the provided information. After the initial
analysis, it got cleared up to a certain degree. I approached this analysis with the
methodology of generating some conclusions through my analysis and then verifying them
scientifically. This strategy worked in my favour.
Throughout this analysis, I did feel that the outcomes were being hindered by the
amount of data. The dataset contains most values in the normal quality class, which
makes it biased to a certain extent. I couldn't decide if modelling with such a
dataset would yield conclusive results or not.
The main noteworthy finding for me was the presence of Volatile Acidity in Red Wines.
When I saw the declining trend with Quality in the Univariate section, it intrigued me
to see reversal in trends between volatile acidity and fixed acidity. Through further
analysis in Bivariate and Multivariate sections, I strengthened this hypothesis. And
the scientific reasoning for it confirmed that higher levels of Volatile acidity are
not preferred and it's increase is aided by higher pH values.
I do wish that this dataset was larger in size and without skewed data for a single
quality measure, for effective modelling purposes. Although I have never tried wine,
I also believe that these quality ratings may be subjective, and dependent on the
preferences of the ones grading them. So, for a better analysis, it would be prudent
to have multiple, universally agreed on parameters, to provide a score for. These can
be sweetness, sourness, density (ordered levels), freshness etc.. This, would enable
us data scientists to make a more thorough and exhaustive analysis, and create a more
robust model for predicting the quality of Red Wine.
-----------