Code for the IEEE ICMLA (International Conference on Machine Learning and Applications) The Data Science landscape: foundations, tools, and practical applications session.
Quick "get started" guide:
- Clone this repository
cd
to the repository's directory- Optional: create a Python virtual environment
python3 -m venv env
source env/bin/activate
(Windows:env\Scripts\activate.bat
)python -m pip install --upgrade pip
pip install -r requirements.txt
jupyter lab
Notebook 1 is about understanding the pieces of information we have in the dataset, and being confident it is not missing values and that each column has, in general, usable values (a few of them may need to be cleaned up - we will deal with that later).
We used:
- Pandas to read the data into well-structure
DataFrame
. shape
,columns
,dtypes
, andhead()
to investigate the basic structure of the dataset.isnull()
to check for missing values.describe()
andunique()
to verify that columns are consistent with what we expect them to be.
Notebook 2 describes how to clean up a dataset, removing outliers that are not relevant for the analysis. To do that we have to first understand the domain of the data we were using (working-age population). We also remove attributes (columns) that were not relevant for the analysis.
Once we had a clean dataset, we collected enough evidence to call for action on possible gender discrimination by using:
- seaborn
distplot()
to review the distribution of dataset attributes. - Box plots, with seaborn
distplot()
, to inspect details of an attribute's distribution: its quartiles and outliers. - Pandas
DataFrame
masks to filter out rows. For example, to remove employes over a certain age, or below an education level. - seaborn
pairplot()
to view the relationship of all attributes of a dataset at a glance. - Pandas'
cut()
to bin (group) attributes into larger categories.
Notebook 3 uses permutations of a dataset with np.random.permutation()
to test hypotheses
To prove (or disprove) a hypothesis, we:
- Inspected the dataset with
shape
,columns
,describe()
, andinfo()
- Checked for possible duplicated entries with
nunique()
. - Performed a domain check (a suspiciously low literacy rates), to verify if the data make sense. We found out that it matches a reliable source.
- To make the code clearer, we split out of the dataset only the pieces of information we need and transformed some pieces of data into a more convenient format (
fertility
andilliteracy
). - Established that there is a correlation visually (with a scatter plot) and formally (with the Pearson correlation coefficient).
- Once we confirmed that there is a correlation, we performed a large number of experiments to check if the correlation exists by chance (with
np.random.permutation()
). - To make our experiments reproducible, we set a seed for the pseudorandom generator (
np.random.seed(42)
).
Notebook 4 uses machine learning to build a model that achieved over 80% accuracy with a few lines of code and without resorting to feature engineering or other transformations.
Along the way we also:
- Verified that the dataset is imbalanced and adjusted the code accordingly (
value_counts()
). - Used stratified sampling to split the dataset and preserve the class ratios (
train_test_split(..., stratify=...)
). - Used precision and recall to understand where the model makes mistakes (
classification_report()
). - Visualized the mistakes with a confusion matrix (
confusion_matrix()
). - Established a baseline with a simple model.
- Switched to a more complex model, improving the baseline results.
- Found an even better model with grid search (
GridSearchCV()
).
If you found this repository useful, you may also want to checkout these repositories:
- Introduction to data science: the first formal training I had in data science, as part of my master's work. A series of assignments from the class, covering different aspects of data science.
- Answering Questions with Data, bridging the gap between technical analysis and stakeholders point-of-view with Jupyter notebooks: a lecture delivered as the guest speaker for a the data science class at my alma mater. How to use write Python/Jupyter code that is easy to understand, easy to modify, reliable, and (within reason) is understandable by the stakeholders.