-
Notifications
You must be signed in to change notification settings - Fork 0
/
Assignment_1.qmd
128 lines (92 loc) · 6.07 KB
/
Assignment_1.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
title: "Assignment 1"
author: "Gözde Uğur"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
html_document:
toc: true
toc_depth: 1
number_sections: true
editor: visual
---
## 1.1 About Me
I'm **Gözde Uğur**. I've been working as Senior Data Analyst at Trendyol for almost 2 years. Since graduating from university, I have worked in various data analyst roles where I mainly used SQL. I want to make sophisticated analyzes and visualizations of data with what I learned in this program. You can check [My Linkedin Page](https://www.linkedin.com/in/gozde-ugur-000/) for further info.
## 1.2 Bridging the Gap between SQL and R
::: {style="text-align: center;"}
<iframe width="650" height="350" src="https://www.youtube.com/embed/JwP5KdWSgqE" frameborder="0" allowfullscreen style="display: inline-block;">
</iframe>
:::
The reason I chose this video called [*Bridging the Gap between SQL and R*](https://www.youtube.com/watch?v=JwP5KdWSgqE&list=PL9HYL-VRX0oTOK4cpbCbRk15K2roEgzVW&t=1094s) was that I was curious about ways to use SQL, which I use frequently in daily life, in R. In this video, I learned an R package, tidyquery, with which I can run SQL Queries directly.
## 1.3 Dataset
### Amazon Products Dataset 2023
```{r}
library(dplyr)
# Using read.csv()
myData = read.csv("/Users/gozde.ugur/Downloads/archive (3)/amazon_products.csv")
en_cok_satanlar <- myData %>%
arrange(desc(boughtInLastMonth)) %>% # satisadedi sütununa göre azalan sırada sırala
head(5) # İlk 5 kaydı getirilk_5_kayit <- head(myData, 5)
# Show only selected columns
secilen_sutunlar <- en_cok_satanlar[, c("title","stars", "price" ,"boughtInLastMonth")]
print(secilen_sutunlar)
```
Since I work in the e-commerce industry, [**Amazon Products Dataset 2023**](https://www.kaggle.com/datasets/asaniczka/amazon-products-dataset-2023-1-4m-products) attracted my attention. Reasons why I find this dataset useful for out course:
1\. We may have the opportunity to get more insight about the trends and behaviors in the e-commerce industry, where we are usually on the customer side.
2\. This dataset allows us to work with different data types such as boolean, integer, float and character.
3\. This dataset contains 1.4M records. Although we work with much larger datasets in real business life, working with this dataset can also be a good opportunity to learn and overcome the difficulties of large datasets.
## 1.4 R posts relevant to my interests
## 1.4.1 R Histogram
A histogram is like a bar chart as it uses bars of varying height to represent data distribution. However, in a histogram values are grouped into consecutive intervals called bins. In a Histogram, continuous values are grouped and displayed in these bins whose size can be varied.
**Example:**
```{r}
# Histogram for Maximum Daily Temperature
data(airquality)
hist(airquality$Temp, main ="La Guardia Airport's\
Maximum Temperature(Daily)",
xlab ="Temperature(Fahrenheit)",
xlim = c(50, 125), col ="yellow",
freq = TRUE)
```
## 1.4.2 Hypothesis Testing in R Programming
hypothesis is made by the researchers about the data collected for any experiment or data set. A hypothesis is an assumption made by the researchers that are not mandatory true. In simple words, a hypothesis is a decision taken by the researchers based on the data of the population collected. [**Hypothesis Testing**](https://www.geeksforgeeks.org/understanding-hypothesis-testing/) in [**R Programming**](https://www.geeksforgeeks.org/introduction-to-r-programming-language/) is a process of testing the hypothesis made by the researcher or to validate the hypothesis. To perform hypothesis testing, a random sample of data from the population is taken and testing is performed. Based on the results of the testing, the hypothesis is either selected or rejected. This concept is known as **Statistical Inference**. In this article, we'll discuss the four-step process of hypothesis testing, One sample T-Testing, Two-sample T-Testing, Directional Hypothesis, one sample ![\\mu](https://www.geeksforgeeks.org/wp-content/ql-cache/quicklatex.com-be5d4c2c19a6cdf008a0e995c1d7a684_l3.svg "Rendered by QuickLaTeX.com"){alt="\\mu "}-test, two samples ![\\mu](https://www.geeksforgeeks.org/wp-content/ql-cache/quicklatex.com-be5d4c2c19a6cdf008a0e995c1d7a684_l3.svg "Rendered by QuickLaTeX.com"){alt="\\mu "}-test and correlation test in R programming.
#### One Sample T-Testing
One sample T-Testing approach collects a huge amount of data and tests it on random samples. To perform T-Test in R, normally distributed data is required. This test is used to test the mean of the sample with the population. For example, the height of persons living in an area is different or identical to other persons living in other areas.
> **Syntax:** t.test(x, mu) **Parameters:** **x:** represents numeric vector of data **mu:** represents true value of the mean
To know about more optional parameters of **t.test()**, try the below command:
```
help("t.test")
```
**Example:**
```{r}
# Defining sample vector
x <- rnorm(100)
# One Sample T-Test
t.test(x, mu = 5)
```
## 1.4.3 Reading a .csv File in R
**read.csv()**: read.csv() is used for reading "comma separated value" files (".csv"). In this also the data will be imported as a data frame.
> **Syntax:** read.csv(file, header = TRUE, sep = ",", dec = ".", ...)
>
> **Parameters:**
>
> - file: the path to the file containing the data to be imported into R.
>
> - header: logical value. If TRUE, read.csv() assumes that your file has a header row, so row 1 is the name of each column. If that's not the case, you can add the argument header = FALSE.
>
> - sep: the field separator character
>
> - dec: the character used in the file for decimal points.
>
> ```{r}
> library(dplyr)
> # Using read.csv()
> myData = read.csv("/Users/gozde.ugur/Downloads/movies.csv")
> # Show only first 5 record
> comedy_filmler <- myData %>% filter(Genre == "Comedy")
>
> ilk_5_kayit <- head(comedy_filmler, 5)
> # Show only selected columns
> secilen_sutunlar <- ilk_5_kayit[, c("Film", "Genre", "Year")]
> print(secilen_sutunlar)
>
> ```