index.Rmd

---
title: "Introduction to Data Science using Python"
subtitle: "A Hands-On Guide"
author: "Abhijit Dasgupta, Ph.D."
date: " Last updated: `r format(Sys.Date(), '%B %d, %Y')`"
github-repo: 'araastat/BIOF085'
site: bookdown::bookdown_site
output: bookdown::gitbook
papersize: letter
documentclass: scrbook
mainfont: "Alegreya"
monofont: "Source Code Pro"
monofont_options: "Scale=0.7"
geometry: "margin=1in"
# bibliography: [book.bib, packages.bib]
# biblio-style: apalike
# link-citations: yes
---

```{r index-1, include=FALSE}
reticulate::use_condaenv('ds', required = T)
knitr::opts_chunk$set(fig.width = 6, fig.height=6, 
                      fig.align='center',
                      out.width='90%',
                      cache=F, 
                      comment = "")
```

# About this guide

Data science is a broad field that covers data storage, organization, analysis, visualization and reporting. In this guide, we start with a basic primer on the Python language and scientific programming, then progress through the main Python tools for data ingestion, cleaning, manipulation, analyses and presentation. These will primarily consist of the basic usage of the following Python packages:

+ [pandas](https://pandas.pydata.org)
+ [statsmodels](https://www.statsmodels.org)
+ [scikit-learn](https://scikit-learn.org)
+ [matplotlib](https://matplotlib.org)
+ [seaborn](https://seaborn.pydata.org)

In my mind this forms the core packages in Python for data analyses and data science work that is applicable to a wide variety of domains.

There are obvious topics that I am not covering here:

+ Natural language processing (`nltk`)
+ Deep neural networks (`tensorflow`, `keras`, `PyTorch`)
+ Image analysis (`scikit-image`, `pillow`)

To my mind these are intermediate to advanced topics, though they are widely used, and so not as foundational to understanding how to use Python for Data Science.

#### How to use this manual {-}

This manual was originally developed to support a 3 day workshop on using Python for Data Science. Each chapter was covered over roughly half a day, using live coding through the examples. 

+ **Day 1: ** Chapters 2, 3, 4
+ **Day 2: ** Chapters 5, 6
+ **Day 3: ** Chapters 7, 8, 9

Each chapter has a fair bit of didactic content describing 
the methodology underlying the code, to help understand the 
context for the code. Several datasets were used, which are
available in the [Github repository](https://www.github.com/araastat/BIOF085.git) for the workshop. Since the workshop, I have discovered that many
of the data sets are available online through the Rdatasets package, and so could be loaded directly using `statsmodels`; the examples will slowly be modified accordingly.

Each chapter of this manual is also available as a Jupyter
notebook that can be run on your computer or on Binder, to 
allow you to run and modify the code that is presented. These 
notebooks are also available at the Github repository. 

## License

<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>. The code samples in this manual are licensed under the [MIT License](https://mit-license.org)