Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a2-robethx-Rob-Chiocchio #50

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
415 changes: 415 additions & 0 deletions .gitignore

Large diffs are not rendered by default.

133 changes: 19 additions & 114 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,140 +3,45 @@
Assignment 2 - Data Visualization, 5 Ways
===

Now that you have successfully made a "visualization" of shapes and lines using d3, your next assignment is to successfully make a *actual visualization*... 5 times.

The goal of this project is to gain experience with as many data visualization libraries, languages, and tools as possible.

I have provided a small dataset about cars, `cars-sample.csv`.
Each row contains a car and several variables about it, including miles-per-gallon, manufacturer, and more.

Your goal is to use 5 different tools to make the following chart:

![ggplot2](img/ggplot2.png)

These features should be preserved as much as possible in your replication:

- Data positioning: it should be a downward-trending scatterplot as shown. Weight should be on the x-axis and MPG on the y-axis.
- Scales: Note the scales do not start at 0.
- Axis ticks and labels: both axes are labeled and there are tick marks at 10, 20, 30, etcetera.
- Color mapping to Manufacturer.
- Size mapping to Weight.
- Opacity of circles set to 0.5 or 50%.

Other features are not required. This includes:

- The background grid.
- The legends.

Note that some software packages will make it **impossible** to perfectly preserve the above requirements.
Be sure to note where these deviate.

Improvements are also welcome as part of Technical and Design achievements.

Libraries, Tools, Languages
Matplotlib + Seaborn + Pandas + Jupyter
---

You are required to use 5 different tools or libraries.
Of the 5 tools, you must use at least 3 libraries (libraries require code of some kind).
This could be `Python, R, Javascript`, or `Java, Javascript, Matlab` or any other combination.
Dedicated tools (i.e. Excel) do not count towards the language requirement.

Otherwise, you should seek tools and libraries to fill out your 5.

Below are a few ideas. Do not limit yourself to this list!
Some may be difficult choices, like Matlab or SPSS, which require large installations, licenses, and occasionally difficult UIs.

I have marked a few that are strongly suggested.

- R + ggplot2 `<- definitely worth trying`
- Excel
- d3 `<- since the rest of the class uses this, we're requiring it`
- Matplotlib
- three.js `<- well, it's a 3d library. not really recommended, but could be "interesting"`
- p5js `<- good for playing around. not really a chart lib`
- Tableau
- Java 2d
- GNUplot
- Vega-lite <- `<- recently much better. look for the high level js implementations`
- Flourish <- `<- popular last year`
- PowerBI
- SPSS
Python datascience tools are by far where the majority of my experience lies. Seaborn + Pandas in a Jupyter notebook is my go to for assignments like this as they provide an easy way to import datasets and generate plots. In fact, I've used this toolset to generate this exact plot with an unabridged version of the dataset previously. Seaborn is a datavis library that extends Matplotlib to make nicer looking charts that interface better with other tools like Pandas, my go-to dataset creation and manipulation library.

You may write everything from scratch, or start with demo programs from books or the web.
If you do start with code that you found, please identify the source of the code in your README and, most importantly, make non-trivial changes to the code to make it your own so you really learn what you're doing.
![matplotlib](img/matplotlib.png)

Tips
R + ggplot2 + R Markdown
---

- If you're using d3, key to this assignment is knowing how to load data.
You will likely use the [`d3.json` or `d3.csv` functions](https://github.com/mbostock/d3/wiki/Requests) to load the data you found.
Beware that these functions are *asynchronous*, meaning it's possible to "build" an empty visualization before the data actually loads.

- *For web languages like d3* Don't forget to run a local webserver when you're debugging.
See this [ebook](http://chimera.labs.oreilly.com/books/1230000000345/ch04.html#_setting_up_a_web_server) if you're stuck.
The R + ggplot2 + Rmd workflow is incredibly similar to the Python-based one described above. It's fairly simple and it allowed me to create the scatterplot quickly and with ease. Despite all this, I still prefer the previous method due to the surplus of quality data analysis tools available for Python. R + Rmd and Python + Jupyter fill a similar niche for me and, if given the choice between the two, it's unlikely I'll pick this method in the future.

![ggplot2](img/ggplot2.png)

Readme Requirements
d3
---

A good readme with screenshots and structured documentation is required for this project.
It should be possible to scroll through your readme to get an overview of all the tools and visualizations you produced.
My d3 visualization is just a basic scatterplot without any extra features like grid lines or a legend. While I don't find d3 hard to use, it can be very tedious, especially in small projects like this. I think d3 really starts to shine when paired with helper functions or libraries to simplify the SVG generation. As the goal was just one static chart, I do not think d3 was the best tool for the job. Despite d3's shortcomings on projects like this, I will continue to use d3 when I need to generate dynamic data visualizations in the future.

- Each visualization should start with a top-level heading (e.g. `# d3`)
- Each visualization should include a screenshot. Put these in an `img` folder and link through the readme (markdown command: `![caption](img/<imgname>)`.
- Write a paragraph for each visualization tool you use. What was easy? Difficult? Where could you see the tool being useful in the future? Did you have to use any hacks or data manipulation to get the right chart?
![d3](img/d3.png)

Other Requirements
Excel
---

0. Your code should be forked from the GitHub repo.
1. Place all code, Excel sheets, etcetera in a named folder. For example, `r-ggplot, matlab, mathematica, excel` and so on.
2. Your writeup (readme.md in the repo) should also contain the following:

- Description of the Technical achievements you attempted with this visualization.
- Some ideas include interaction, such as mousing over to see more detail about the point selected.
- Description of the Design achievements you attempted with this visualization.
- Some ideas include consistent color choice, font choice, element size (e.g. the size of the circles).

GitHub Details
---
Excel was by far my least favorite tool to work with for this project. I have used Excel quite a bit in the past, and am well acquainted with its many limitations. I was initially going to attempt to control the size (and maybe color) of each point with a VBA script as there is no good built-in way to do so, but abandoned that idea after spending more time than I should have struggling to get the manufacturer labels to play nice with the plot. I should have spent more time figuring out bubble charts, but I desperately wanted to be done with Excel. I know I will have to use Excel more in the future, but I will always pursue alternative tools first.

- Fork the GitHub Repository. You now have a copy associated with your username.
- Make changes to fulfill the project requirements.
- To submit, make a [Pull Request](https://help.github.com/articles/using-pull-requests/) on the original repository.
![excel](img/excel.png)

Grading
SPSS
---

Grades on a 120 point scale.
24 points will be based on your Technical and Design achievements, as explained in your readme.

Make sure you include the files necessary to reproduce your plots.
You should structure these in folders if helpful.
We will choose some at random to run and test.

**NOTE: THE BELOW IS A SAMPLE ENTRY TO GET YOU STARTED ON YOUR README. YOU MAY DELETE THE ABOVE.**

# R + ggplot2 + R Markdown

R is a language primarily focused on statistical computing.
ggplot2 is a popular library for charting in R.
R Markdown is a document format that compiles to HTML or PDF and allows you to include the output of R code directly in the document.

To visualized the cars dataset, I made use of ggplot2's `geom_point()` layer, with aesthetics functions for the color and size.

While it takes time to find the correct documentation, these functions made the effort creating this chart minimal.

![ggplot2](img/ggplot2.png)

# d3...

(And so on...)
Visualizing the data using SPSS took surprisingly little time. I have never used SPSS before (only PSPP, a FOSS counterpart), but it was easy to import and plot the data. As a license SPSS is provided by WPI, I will definitely use it again in the future. The only quirk in importing the cars-sample.csv into SPSS was that I had to select and re-label the columns to convert the dataset to SPSS's format.

![spss](img/spss.png)

---
## Technical Achievements
- **Proved P=NP**: Using a combination of...
- **Solved AI Forever**: ...
- Dynamically linked the CSV dataset to the Excel sheet

### Design Achievements
- **Re-vamped Apple's Design Philosophy**: As demonstrated in my colorscheme...
- Themed Matplotlib's usually ugly plots with Seaborn
- Generated a nice [looking table for the Rmd notebook](ggplot2/ggplot2.html)
10 changes: 10 additions & 0 deletions d3/d3.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
<!DOCTYPE html>
<html lang="en">
<head>
<script src="https://d3js.org/d3.v7.min.js"></script>
</head>

<body>
<svg id="scatterplot"></svg>
<script src="main.js"></script>
</body>
86 changes: 86 additions & 0 deletions d3/main.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
console.log(d3);

//datapath = "../cars_example.csv";
datapath = "https://raw.githubusercontent.com/cs480x-21c/02-DataVis-5Ways/main/cars-sample.csv"; // easy fix for CORS issue

var margin = {top: 20, right: 20, bottom: 30, left: 40},
width = 960 - margin.left - margin.right,
height = 500 - margin.top - margin.bottom;

var svg = d3.select("#scatterplot")
.attr("width", width + margin.left + margin.right)
.attr("height", height + margin.top + margin.bottom)
.append("g")
.attr("transform", "translate(" + margin.left + "," + margin.top + ")");

function pointColor(manufacturer) {
switch(manufacturer) {
case "bmw":
return "green";
case "ford":
return "blue";
case "honda":
return "red";
case "mercedes":
return "purple";
case "toyota":
return "orange";
default:
return "black";
}
}

d3.csv(datapath).then(function(data) {
data.MPG = parseFloat(data.MPG);
data.Weight = parseInt(data.Weight);

data = data.filter(function(d) {
if (isNaN(d.MPG) || isNaN(d.Weight)) {
return false;
} else {
return true;
}
});

var x = d3.scaleLinear() // x axis
.domain([1500, 5000]) // d3.max(data, function(d) { return d.Weight; })
.range([0, width]);
svg.append("g")
.call(d3.axisBottom(x))
.attr("transform", "translate(0," + height + ")");
svg.append("text")
.attr("transform", "translate(" + (width / 2) + " ," + (height + margin.top + 5) + ")")
.style("text-anchor", "middle")
.text("Weight");

var y = d3.scaleLinear() // y axis
.domain([5, 50]) // d3.max(data, function(d) { return d.mpg; })
.range([height, 0]);
svg.append("g")
.call(d3.axisLeft(y));
svg.append("text")
.attr("transform", "rotate(-90)")
.attr("y", 0 - margin.left)
.attr("x", 0 - (height / 2))
.attr("dy", "1em")
.style("text-anchor", "middle")
.text("MPG");

var pointSize = d3.scaleLinear()
.domain([0, d3.max(data, function(d) { return d.Weight; })])
.range([2, 10]);

svg.append("g") // points
.selectAll("dot")
.data(data)
.enter()
.append("circle")
.attr("cx", function(d) { return x(d.Weight); } )
.attr("cy", function(d) { return y(d.MPG); } )
.attr("r", function(d) { return pointSize(d.Weight); })
.attr("fill", function(d) { return pointColor(d.Manufacturer); })
.attr("opacity", 0.5);

console.log(function(d) { return d.Weight; })
});

Binary file added excel/excel.xlsx
Binary file not shown.
29 changes: 29 additions & 0 deletions ggplot2/ggplot2.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
title: "02-DataVis-5Ways R+ggplot2"
author: "Rob Chiocchio"
date: "8/13/2022"
output:
html_document:
df_print: paged
pdf_document:
df_print: paged
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r load-packages, message=FALSE, warning=FALSE}
library(tibble)
library(ggplot2)
library(kableExtra)
```

```{r load-data}
df <- read.csv("../cars-sample.csv") %>% column_to_rownames(var="X")
kbl(head(df)) %>% kable_styling(bootstrap_options="striped", font_size=12)
```

```{r plot}
ggplot(data=df, aes(x=Weight, y=MPG, size=Weight, color=Manufacturer)) + geom_point(alpha=0.5)
```
483 changes: 483 additions & 0 deletions ggplot2/ggplot2.html

Large diffs are not rendered by default.

Binary file added img/d3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/excel.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/ggplot2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/matplotlib.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/spss.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
19 changes: 19 additions & 0 deletions matplotlib/Pipfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
[[source]]
url = "https://pypi.python.org/simple"
verify_ssl = true
name = "pypi"

[packages]
numpy = "*"
pandas = "*"
jupyter = "*"
matplotlib = "*"
ipympl = "*"
seaborn = "*"

[dev-packages]
nbconvert = "*"
pyppeteer = "*"

[requires]
python_version = "3.8"
Loading