-
Notifications
You must be signed in to change notification settings - Fork 0
/
Coffee_Trade_Research.Rmd
129 lines (91 loc) · 7.9 KB
/
Coffee_Trade_Research.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
title: "PS Final Research Project"
author: "Roman Mutel, Taras Yaroshko"
output: html_document
---
# Introduction
Since the coffee was brought to Europe, it became the most popular morning drink and formed a huge amount of different cultures around growing, roasting, brewing and consuming coffee beans. Back in the days it replaced beer, fueling the industrialization! And now it provides us an interesting idea for a research project for probability and statistics course!
## Tested Hypothesis
The aim of this work is to test if the coffee consumption per capita is **normally distributed**. Also, we will see, if Europeans **consume more coffee** than Americans. Another interesting question is how strong is the relation between the amount of imported coffee beans and reexported (roasted and packed) coffee. Is the imported amount relies more on the consumers, or on the roasters?
## Notes on the Data
The research is based on [International Coffee Organization](https://www.ico.org/new_historical.asp?section=Statistics) data, which brings us the data about coffee production, trade, consumption, etc. of ICO member countries from 1990 up to 2019. We will analyze only the 2019 year statistics, since it is the last possible year, and also the trade was not affected by COVID-19 before 2020.
Another important data source is [World Development Indicators](https://data.worldbank.org/indicator/SP.POP.TOTL), which provides the data about countries population. Since we would like **identical** random variables, using absolute values **is not correct**, so we have to preprocess the data and transform it from *coffee-bags-per-country* into *kilograms-per-capita*. The data is preprocessed using the Python script one can find in the Github repository.
Another important moment -- Luxembourg statistic is abnormally large due to the size of the country and the high percentage of urban citizens. Therefore we decided to exclude it from our analysis to have no anomalies. Sorry, Luxembourg.
# Data Overview
Our dataset contains data about 32 countries which import, consume and reexport coffee. The values are in kilogram per capita Let us see main statistics of the dataframe and visualize it!
```{r}
coffee_data <- read.csv(file="datasets/Coffee_Dataset_2019.csv")
coffee_data
summary(coffee_data)
hist(coffee_data$imports,
col = "blanchedalmond",
main = "Imports in kg per capita",
xlab = "kg/person"
)
hist(coffee_data$reexports,
col = "chocolate",
main = "Re-exports in kg per capita",
xlab = "kg/person"
)
hist(coffee_data$consumption,
col = "brown",
main = "Consumption in kg per capita",
xlab = "kg/person"
)
```
From the first look, we can conclude that most of the countries reexport quite small amount of coffee. This might happen because of the the traditions and/or because of the location of the big ports (Belgium, Netherlands, Germany, and, unexpectedly, Slovenia and Switzerland are among the leaders)
Imports and consumption values are more "normally" distributed and there seems to be less anomalies. Let us proceed with the first test!
# Testing the Normality
As was mentioned, imports and consumption seem more "normally" distributed at the first look (even though Finland is far ahead of the other countries, most probably because of the small daytime length). But we should not be fooled and trust the numbers! To acquire the numbers, let us perform a **Kolmogorov goodness-of-fit test!**
Kolmogorov test for the **imports**:
```{r}
mu <- mean(coffee_data$imports)
sigma <- sd(coffee_data$imports)
ks.test(coffee_data$imports, "pnorm", mu, sigma)
```
Kolmogorov test on normality of the imports gives us **p-value of 0.088**, which is small, but not enough ( (\>0.05) to reject the null hypothesis, which states that the data is normally distributed. We cannot surely reject the hypothesis that the data is normally distributed, but since the p-value is relatively small, it would not be wise to say that the data is perfectly normal.
```{r}
mu <- mean(coffee_data$consumption)
sigma <- sd(coffee_data$consumption)
ks.test(coffee_data$consumption, "pnorm", mu, sigma)
```
Unlike the case with the imports, KS test on the consumption data results in **p-value of 0.4**, which is a large p-value. Thus, we cannot reject the hypothesis which suggested the data to be normally distributed. We should take into account, that the sample is of medium size (32 random variables), thus Kolmogorov test can be **inaccurate**, yet it still has meaning and shows us, that **consumption of coffee is normally distributed** among selected countries.
# Testing the equality of means
An interesting question is who drinks more coffee on average: Europeans or Americans? We all know that every American starts their day with a fresh cup of coffee, but, at the same time, there are lots of European countries that are famous for their delicious cappuccino or espresso. So, who is the winner here?
To determine it, we will use the **t-test**, which is applicable when we want to check the equality of means (and the variance of samples is unknown).
Constructing the null-hypothesis and the alternative hypothesis: $H_0: \mu = \mu_0$ vs. $H_1: \mu > \mu_0$. For the one-sided alternative $H_1$ we are going to use $C_{\alpha} = \{x \in \mathbb{R}^n | t(x) \geq t_{1-\alpha}^{(n-1)}\}$.\
```{r}
us_mean <- coffee_data$consumption[32] # index of United States row
european_data <- coffee_data$consumption[0:31]
t.test(european_data, mu=us_mean, alternative = "greater")
```
## Summary
What does this tell us? Well, the **p-value** that we got is 0.1086, therefore, we fail to reject our null-hypothesis (which stated that the mean of coffee consumption in European countries is equal to the US consumption rate). That means that, although the daily life of those who consume coffee in Europe and US may be different, they drink pretty much the same amount of coffee per day.
# Testing the relation between import and re-export
Who is more responsible for what happens with imported coffee: domestic consumers or roasters, which process and resell coffee beans?
Let's start with defining our linear model:
```{r}
import <- coffee_data$imports
reexport <- coffee_data$reexports
import_reexport_lm <- lm(reexport~import)
plot(import, reexport, pch=19, col="blue")
abline(import_reexport_lm, col = "red")
print(summary(import_reexport_lm))
```
```{r}
cat("Correlation coefficient between re-export and import is: ", cor(reexport, import))
```
As we see, the correlation coefficient between re-export and import is really high (0.9196934). As we know, a correlation of -1.0 indicates a perfect negative correlation, and a correlation of 1.0 indicates a perfect positive correlation. Therefore, we can conclude that there is a surprisingly high relationship between those two processes. For instance, when the amount of import increases, the amount of re-export increases, too. We can say that when import moves higher (or lower), re-export moves in the same direction with *almost* the same magnitude.
```{r}
consumption <- coffee_data$consumption
consumption_reexport_lm <- lm(consumption~import)
plot(import, consumption, pch=19, col="blue")
abline(consumption_reexport_lm, col = "red")
print(summary(consumption_reexport_lm))
```
```{r}
cat("Correlation coefficient between consumption and import is: ", cor(consumption, import))
```
## Summary
On the other hand, the correlation coefficient for consumption and import is a little bit higher than **0.5**. So, one can say that these processes are *moderately* *correlated*. This is an interesting observation, because we mostly associate more import with more consumption, however, coffee roasters have more influence on the import statistics.
Also, we have to remember that correlation does not imply causation -- a lot of different other things (often called confounding variables) may be involved, and the more accurate explanation we want to get, the more variables we have to include in our calculations so that we can understand the data better.