Skip to content

Commit

Permalink
docs: Minor additions
Browse files Browse the repository at this point in the history
- Include a quote
- Include some links
- Minor formatting fixes
  • Loading branch information
booleanhunter committed Jan 21, 2024
1 parent 426ea3b commit 2e295d2
Show file tree
Hide file tree
Showing 3 changed files with 31 additions and 52 deletions.
12 changes: 7 additions & 5 deletions nbs/about.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,27 +13,29 @@
"id": "523fcc3e-88d9-46b3-a6ea-950d47e03bb4",
"metadata": {},
"source": [
"This project draws inspiration from Joel Spolsky's insightful blog post on the importance of a [development abstraction layer](https://www.joelonsoftware.com/2006/04/11/the-development-abstraction-layer-2/). While the article was intended mainly for a project manager, I think it applies well even for a developer relations role. Eliminate any obstacles that hinder programmers from doing their job is super important. And often accessing relevant documentation promptly is one of them!\n",
"This project draws inspiration from Joel Spolsky's insightful blog post on the importance of a [development abstraction layer](https://www.joelonsoftware.com/2006/04/11/the-development-abstraction-layer-2/). While the article was intended mainly for a project manager, I think it applies well even for a developer relations role. Eliminating any obstacles that hinder programmers from doing their job is super important. And often accessing relevant documentation promptly is one of them!\n",
"\n",
"As a developer advocate, I understand the need to make it straightforward for newcomers to grasp the concepts and utilize the functionality being described. But along with creating the tutorial, I pondered, why not also explore ways to enhance documentation and improve accessibility? The features incorporated into this website are an experimental exploration of this very idea. Here are some of them:\n",
"\n",
"1. **Search Feature**: As a developer, I frequently find myself recalling that I've come across a particular API or method in a tutorial but struggling to pinpoint where. A search feature would be incredibly helpful, eliminating the need for me to remember the exact location of each section.\n",
"2. **Table of Contents**: Include a sticky table of contents for easy navigation through the material that can be collapsed by the reader.\n",
"3. **Sidenotes**: Utilize the large unused margins on the left and right to incorporate sidenotes. These sidenotes can be used to provide quick explanations of code aspects without causing distraction. Developers can choose to ignore or collapse them based on their preferences (see example on right).\n",
"3. **Sidenotes**: Utilize the large unused margins on the left and right to incorporate sidenotes. These sidenotes can be used to provide quick explanations of code aspects without causing distraction or affecting their flow. Developers may choose to read or skip them based on their preferences (see example on right).\n",
"\n",
"::: {.column-margin}\n",
"<details>\n",
" <summary>Collapsible sidenote</summary>\n",
"- `df.age.isnull()`: This part of the code returns a boolean series where each value is `True` if the corresponding value in the `age` column of `raw_data` is missing, and `False` otherwise.\n",
"- `df.age.isnull()`: This part of the code returns a boolean series where each value is `True` if the corresponding value in the `age` column of `df` is missing, and `False` otherwise.\n",
"</details>\n",
":::\n",
"\n",
"The above 3 features allow you to achieve the *skimmable principle*, as outlined in [Write the Docs](https://www.writethedocs.org/guide/writing/docs-principles/#skimmable).\n",
" \n",
"::: {.callout-note}\n",
"## 4. Callouts\n",
"Employ callouts to highlight particularly important or interesting points within the documentation.\n",
"Use callouts to highlight particularly important or interesting points within the article.\n",
":::\n",
"\n",
"5. **Report an Issue**: Offer a user-friendly \"report an issue\" button that directs readers to the GitHub repository's issues tab, if they face any difficulties while navigating the tutorial / docs.\n",
"5. **Report an Issue**: Offer a user-friendly \"report an issue\" button that directs readers to the GitHub repository's issues tab, if they face any difficulties while navigating the tutorial/docs (See [participatory principle](https://www.writethedocs.org/guide/writing/docs-principles/#participatory) from *Write the Docs*. \n",
"6. **Visual Elements**: Enhance engagement by integrating mermaid diagrams and Plotly for visual representation and interactivity. This approach can encourage interaction with the content, transforming readers into participants.\n",
"7. A dark mode! 😎\n",
"\n",
Expand Down
Binary file added nbs/images/abraham-lossfunction.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
71 changes: 24 additions & 47 deletions nbs/index.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,11 @@
"\n",
"Imagine that the data science and machine learning team at FutureBank want to understand the key factors influencing a customer's decision to respond positively to term deposit subscriptions. By analyzing this data, the bank wants to refine its marketing strategies, tailor its offerings, and ultimately enhance customer satisfaction and retention.\n",
"\n",
"However, before the team can dive into algorithms and analytics, they need the data to be squeaky clean. That's where you come in! Your meticulous data cleaning will set the foundation for the team's success, empowering the team to craft a marketing masterpiece that resonates with its customers."
"However, before the team can dive into algorithms and analytics, they need the data to be squeaky clean. That's where you come in! Your meticulous data cleaning will set the foundation for the team's success, empowering the team to craft a marketing masterpiece that resonates with its customers.\n",
"\n",
"::: {.column-margin}\n",
"![Timeless quotes often require a modern twist](./images/abraham-lossfunction.png)\n",
":::"
]
},
{
Expand Down Expand Up @@ -1265,39 +1269,6 @@
"raw_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 112,
"id": "4b3211ae-653e-4b1e-86d9-f3440b3fd9c3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"may, 2017 13747\n",
"jul, 2017 6888\n",
"aug, 2017 6240\n",
"jun, 2017 5335\n",
"nov, 2017 3968\n",
"apr, 2017 2931\n",
"feb, 2017 2646\n",
"jan, 2017 1402\n",
"oct, 2017 738\n",
"sep, 2017 576\n",
"mar, 2017 476\n",
"dec, 2017 214\n",
"Name: month, dtype: int64"
]
},
"execution_count": 112,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"raw_data.month.value_counts()"
]
},
{
"cell_type": "markdown",
"id": "c6a06a3a-f242-4332-87df-78809b7709ff",
Expand Down Expand Up @@ -2249,13 +2220,14 @@
"### Fix Units\n",
"\n",
"Ensure all observations under a single variable are expressed in a consistent unit. Here are some examples:\n",
"\n",
"- If you have weight data in both pounds (lbs) and kilograms (kg), choose one (preferably the one most commonly used in your dataset's context) and convert all data to that unit.\n",
"- If some data points are recorded on a different scale (e.g., a test score out of 50 vs. 100), convert them to a common scale, like a percentage.\n",
"- This uniformity prevents confusion and errors in analysis that can arise from unit discrepancies.\n",
"\n",
"### Fix Precision\n",
"\n",
"Maintain a consistent level of decimal precision for better readability and uniformity. Round off numerical values to a standard number of decimal places that makes sense for your analysis, e.g., changing 4.5312341 kg to 4.53 kg. It enhances data readability and prevents the overemphasis of minor differences that are not meaningful for the analysis.\n",
"Maintain a consistent level of decimal precision for better readability and uniformity. Round off numerical values to a standard number of decimal places that makes sense for your analysis, e.g., changing `4.5312341` kg to `4.53` kg. It enhances data readability and prevents the overemphasis of minor differences that are not meaningful for the analysis.\n",
"\n",
"### Remove extra characters\n",
"\n",
Expand All @@ -2267,7 +2239,7 @@
"\n",
"### Standardize Format\n",
"\n",
"Ensure consistency in the format of structured text data like dates and names. For example, changing date formats from '23/10/16' to '2016/10/23', or standardizing name formats. This uniformity is crucial for sorting, filtering, and correctly interpreting structured text data.\n",
"Ensure consistency in the format of structured text data like dates and names. For example, changing date formats from `23/10/16` to `2016/10/23`, or standardizing name formats. This uniformity is crucial for sorting, filtering, and correctly interpreting structured text data.\n",
"\n",
"Let's look at an example from our dataset:"
]
Expand Down Expand Up @@ -2380,51 +2352,56 @@
"id": "4dff9271-f9c7-46d7-aa57-a596a7ae6111",
"metadata": {},
"source": [
"\n",
"::: {.column-page}\n",
"| Data quality issue | Examples | How to resolve |\n",
"|---------------------------------|----------------------------------------------------------------|-----------------------------------------------------------|\n",
"| **Fix rows and columns** | | |\n",
"| Incorrect rows | Header rows, footer rows | Delete |\n",
"| Summary rows | Total, subtotal rows | Delete |\n",
"| Extra rows | Column numbers, indicators, blank rows | Delete |\n",
"| Missing Column Names | Column names as blanks, NA, XX etc. | Add the column names |\n",
"| Inconsistent column names | X1, X2,C4 which give no information about the column | Add column names that give some information about the data|\n",
"| Missing Column Names | Column names as blanks, `NA`, `XX` etc. | Add the column names |\n",
"| Inconsistent column names | `X1`, `X2`,`C4` which give no information about the column | Add column names that give some information about the data|\n",
"| Unnecessary columns | Unidentified columns, irrelevant columns, blank columns | Delete |\n",
"| Columns containing Multiple data values | E.g. address columns containing city, state, country | Split columns into components |\n",
"| No Unique Identifier | E.g. Multiple cities with same name in a column | Combine columns to create unique identifiers e.g. combine City with the State |\n",
"| Misaligned columns | Shifted columns | Align these columns |\n",
"| **Missing Values** | | |\n",
"| Disguised Missing values | \"blank strings, \"NA\", \"XX\", \"999\" etc\" | Set values as missing values |\n",
"| Disguised Missing values | blank strings `\" \"`, `\"NA\"`, `\"XX\"`, `\"999\"` | Set values as missing values |\n",
"| Significant number of Missing values in a row/column | | Delete rows, columns |\n",
"| Partial missing values | Missing time zone, century etc | Fill the missing values with the correct value |\n",
"| **Standardise Numbers** | | |\n",
"| Non-standard units | Convert lbs to kgs, miles/hr to km/hr | Standardise the observations so all of them have the same consistent units |\n",
"| Values with varying Scales | A column containing marks in subjects, with some subject marks out of 50 and others out of 100 | Make the scale common. E.g. a percentage scale |\n",
"| Over-precision | 4.5312341 kgs, 9.323252 meters | Standardise precision for better presentation of data. 4.5312341 kgs could be presented as 4.53 kgs |\n",
"| Over-precision | `4.5312341` kgs, `9.323252` meters | Standardise precision for better presentation of data. `4.5312341` kgs could be presented as `4.53` kgs |\n",
"| Remove outliers | Abnormally High and Low values | Correct if by mistake else Remove |\n",
"| **Standardise Text** | | |\n",
"| Extra characters | Common prefix/suffix, leading/trailing/multiple spaces | Remove the extra characters |\n",
"| Different cases of same words | Uppercase, lowercase, Title Case, Sentence case, etc | Standardise the case/bring to a common case |\n",
"| Non-standard formats | \"23/10/16 to 2016/10/20 “Modi, Narendra\" to “Narendra Modi\"\" | Correct the format/Standardise format for better readability in R |\n",
"| Non-standard formats | `23/10/16` to `2016/10/20`, `\"Reacher, Jack\"` to `\"Jack Reacher\"` | Correct the format/Standardise format for better readability and analysis |\n",
"| **Fix Invalid Values** | | |\n",
"| Encoding Issues | CP1252 instead of UTF-8 | Encode unicode properly |\n",
"| Incorrect data types | \"Number stored as a string: \"12,300\"\" | Convert to Correct data type |\n",
"| Encoding Issues | `CP1252` instead of `UTF-8` | Encode unicode properly |\n",
"| Incorrect data types | Number stored as a string: `\"12,300\"` | Convert to Correct data type |\n",
"| Correct values not in list | Non-existent country, PIN code | Delete the invalid values, treat as Missing |\n",
"| Wrong structure | Phone number with over 10 digits | |\n",
"| Correct values beyond range | Temperature less than -273° C ( K) | |\n",
"| Correct values beyond range | Temperature less than `-273C (`0°` K) | |\n",
"| Validate internal rules | Gross sales > Net sales | |\n",
"| | Date of delivery > Date of ordering | |\n",
"| | \"If Title is \"Mr\" then Gender is \"M\"\" | |\n",
"| | If Title is `\"Mr\"` then Gender is `\"M\"` | |\n",
"| **Filter Data** | | |\n",
"| Duplicate data | Identical rows, rows where some columns are identical | Deduplicate Data/ Remove duplicated data |\n",
"| Extra/Unnecessary rows | Rows that are not required in the analysis. E.g if observations before or after a particular date only are required for analysis, other rows become unnecessary | Filter rows to keep only the relevant data. |\n",
"| Columns not relevant to analysis| Columns that are not needed for analysis e.g. Personal Detail columns such as Address, phone column in a dataset for | Filter columns-Pick columns relevant to analysis |\n",
"| Columns not relevant to analysis| Columns that are not needed for analysis e.g. Personal Detail columns such as Address, phone column in a dataset for | Filter columns and pick only the ones relevant to your analysis |\n",
"| Dispersed data | Parts of data required for analysis stored in different files or part of different datasets | Bring the data together, Group by required keys, aggregate the rest |\n",
"\n",
": The data-cleaning cheatsheet {.striped .hover}\n",
":::"
]
},
{
"cell_type": "markdown",
"id": "6000c145",
"metadata": {},
"source": []
}
],
"metadata": {
Expand Down

0 comments on commit 2e295d2

Please sign in to comment.