Skip to content

Latest commit

 

History

History
443 lines (280 loc) · 20 KB

README.md

File metadata and controls

443 lines (280 loc) · 20 KB

Python data cheatsheet

Things in Python, Pandas, GeoPandas and Jupyter that I've had to look up or weren't obvious in the documentation.

TODO: Break quick inline examples into individual notebooks.

Basic flow control

  • Basic loops and related concepts: for and while loops, looping over a range of numbers, using enumerate() to access both the item and the index inside a loop, unpacking tuples and lists and exiting early from a loop.

Loading and creating data

Geocoding addresses

I usually first pass the addresses through the Census Geocoder Batch API and then geocode the addresses that it can't handle using Geocodio.

Helpful packages:

I put a sample script in scripts/geocode_data.py. Note that I haven't tried to run this script. It's a quick and dirty combination of code from a project, but it should provide an idea of how to do this.

Examining data

Filtering, subsetting and indexing data

Filter a Series

s = s[s != 1]

or in a more chainable way:

s.where(lambda x : x != 1)

Filter a DataFrame using a function

You can specify a function inside []:

df = pd.DataFrame(np.random.randn(5, 3), columns=['a', 'b', 'c'])
df[lambda x: x['b'] > x['c']]

Filtering DataFrame based on whether a column's value is in a list

Use pandas.Series.isin.

df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})

df[df['A'].isin([3, 6])]

via Use a list of values to select rows from a pandas dataframe [duplicate].

Invert a boolean Series

Use the ~ operator. See Boolean indexing in the docs.

via How can I obtain the element-wise logical NOT of a pandas Series?

Filter a DataFrame column based on a string property

Use the Vectorized string functions for Series.

For example, using len:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'a': ['1234', '12345', '123', '12']})

In [3]: df
Out[3]:
       a
0   1234
1  12345
2    123
3     12

In [4]: df[df['a'].str.len() > 3]
Out[4]:
       a
0   1234
1  12345

Cleaning, transforming and manipulating data

Change types of columns

This StackOverflow answer offers a good rundown of options.

Overview of Pandas Data Types has a rundown of the different data types, and their relationship to Numpy types and Python types.

Find and replace across an entire dataframe

Use pandas.DataFrame.replace()

Add a rank column to a DataFrame

Use Series.rank.

via Ranking Rows Of Pandas Dataframes

Summing values across columns

Use pandas.DataFrame.sum.

It seems easy to just use the + operator to add values across columns:

In [1]: import pandas as pd

In [2]: d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
   ...:      'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
   ...:

In [3]: df = pd.DataFrame(d)

In [4]: df
Out[4]:
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

In [5]: df['sum'] = df['one'] + df['two']

In [6]: df
Out[6]:
   one  two  sum
a  1.0  1.0  2.0
b  2.0  2.0  4.0
c  3.0  3.0  6.0
d  NaN  4.0  NaN

But this gives us NaN whenever one of the values is NaN. In many, but not all cases, we want to treat NaN as 0.

To get around this we can use pandas.DataFrame.sum on a slice of the columns we want to sum:

In [7]: df['sum'] = df[['one', 'two']].sum(axis=1)

In [8]: df
Out[8]:
   one  two  sum
a  1.0  1.0  2.0
b  2.0  2.0  4.0
c  3.0  3.0  6.0
d  NaN  4.0  4.0

Note that the axis=1 argument makes sum operate across columns. Also, if you want to not skip the NaN values, you can specify skipna=False.

h/t Pandas sum two columns, skipping NaN

Flatten MultiIndex columns into one

grouped.columns = ['%s%s' % (a, '|%s' % b if b else '') for a, b in grouped.columns]

via python - Pandas dataframe with multiindex column - merge levels - Stack Overflow

Adding a quartile column

removals_inactive_migration['removed_inactive_rate_q'] = pd.qcut(
    removals_inactive_migration['removed_inactive_rate'],
    4,
    labels=[1, 2, 3, 4]
)

Add a column and return a new DataFrame

Use DataFrame.assign()

Dealing with the "Geometry is in a geographic CRS. Results form 'area' are likely incorrect." warning in GeoPandas

When you're trying to find the area of a geometry column of a GeoPandas GeoDataFrame, you may encounter this error:

Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

The message tells you what you need to do, but doesn't explain the issue.

Map Projection and Coordinate Reference Systems has an explaination of the difference between geographic coordinate reference systems and projected ones, as well as Python code, using GeoPandas, that shows how to reproject. Moreover, it offers some suggestions for which projected CRS to use. tl;dr, if you're working with spatial data that spans the US, Albers Equal Area Conic (EPSG = 5070 or 102003) is a good choice.

Parsing addresses

Address data can be messy in a way that makes parsing addresses yourself difficult. I recommend using the usaddress package, which uses a statistical model to try to parse components from addresses.

If you don't want to use the Python library, they have a web interface that lets you parse sets of 500 addresses.

Learning regular expressions

Regular expressions are powerful tools for searching and cleaning text data, but they can have a bit of a learning curve.

The documentation for the standard library re module isn't the worst place to start.

I find Pythex to be a really useful sandbox for testing out regular expressions. I find it often offers a faster feedback loop than testing it in the context of my programs. Once I have it working, I just copy the regex back into my code.

Aggregating data

Visualizing data

Jupyter notebooks

Render a Python variable in a Markdown cell

Use the Python Markdown extension which is part of the Unofficial Jupyter Notebook Extensions

Showing lower-level logging messages

%config Application.log_level="INFO"

TODO: Check if this actually works.

via Get Output From the logging Module in IPython Notebook

Avoiding having to manually reimport modules when you change code in them.

Use the autoreload extension.

%load_ext autoreload

%autoreload 2

from foo import some_function

some_function()

# open foo.py in an editor and change some_function to return 43

some_function()

Sharing code and variables between notebooks

The best way of doing this depends on what you're trying to do.

For passing tabular data, I like to write Pandas DataFrames to Feather files because they can be read by R programs/notebooks, are faster to read in than CSV or Pickle files and preserve data types. I store these in a conventional file structure. That way, one notebook can read in the DataFrame written by another notebook. Sometimes I'll reference the file paths with a variable in a project-level settings module, so if the file name/organization changes, I don't have to update my code in a bunch of places.

For individual variables, there is the store magic.

For code and (possibly) values, there are (somewhat complicated) methods for importing notebooks as modules.

Joe Germuska suggests using the %run magic for executing a Python script (or, apparently, another notebook) from within a notebook. The interactive namespace gets updated with all variables defined in the program. Joe says this is useful for boostrapping notebooks that share common initialization.

TODO: Experiment with store magic and importing notebooks as modules and perhaps make some example notebooks.

Show all rows of a Pandas DataFrame or Series

Use IPython.display.display() and pandas.option_context().

You can also set the Pandas options globally, but I've almost always wanted to do this on a case-by-case basis, which is what pandas.option_context() lets you do.

from IPython.display import display
import pandas as pd

with pd.option_context("display.max_rows", None):
    display(long_df)

Python environment

Installing Python in a good way

My Python Development Environment, 2020 Edition is a good guide. I prefer to use Pipenv instead of Poetry because pipenv run automatically sources .env files.

Python Wrangler is a GUI application for installing Python in a way that's similar to my setup. Even if you don't use it to install Python, it's useful for seeing all the different versions of Python that end up on your system.

I've never found much extra utility in using Anaconda, but it does offer one big benefit: it installs useful packages for working with data, like Pandas and Jupyter, all in one installation process, instead of installing them one at a time with pip or Pipenv.

Use a particular Python version with Pipenv

pipenv install --python=/path/to/your/python

Running a script and dropping to the debugger on an exception

python -m pdb -c continue myscript.py arg1 arg2 

Source: Starting python debugger automatically on error

Create a kernel for use with Jupyter

This makes it possible to have a single Jupyter Lab installation with separate kernels for the virtualenvs for each project.

pipenv run python -m ipykernel install --user --name=your-project-slug

Source: associatedpress/cookiecutter-python-project

Integrating code formatters and linters into your coding workflow

In many projects, I use Black to format my source code and pylint to check my code for both formatting issues and things that could cause errors or make my code more difficult to read and update.

There are ways to integrate these tools into your editor, such as vim or Visual Studio Code. However, I've found a good universal way to approach this is to install them as git hooks that run when trying to commit code. To do this, I use the pre-commit framework.

I've put an example .pre-commit-config.yml file that runs Black and pylint in this repo.

Tell pylint to ignore a rule

To disable a rule for a signle line, add a comment like # pylint: disable=wildcard-import, method-hidden to the end of the line.

Source: How to disable a particular message?

Listing the path to a virtual environment created with Pipenv

pipenv --venv

Specifying the Python environment used by Visual Studio code

This is important if you're running pylint on a particular project using VS Code integration and want to make sure it detects the packages installed in your project virtual environment.

To select a specific environment, use the Python: Select Interpreter command from the Command Palette (⇧⌘P on Mac). If the environment you want isn't listed, you can enter a full path to the Python executable.

Source: Using Python Environments in Visual Studio Code

Executing a Jupyter notebook from the command line.

Use nbconvert to execute the notebook:

jupyter nbconvert --to notebook --execute mynotebook.ipynb

I haven't run that particular version of the command in a while, but I think that either overwrites the input notebook or creates a similarly-named copy. Since I usually want an HTML version, similar to how RStudio automatically creates an HTML version of RMarkdown notebooks, I usually use a somewhat different set of parameters:

jupyter nbconvert \
	  --to html \
	  --output notebook.nb.html \
	  --execute \
	  notebook.ipynb

Source: Executing notebooks from the command line

Data sources

These are the data sources used as example data in this project.

This will be for examples of working with GeoPandas and Spatialite.

Installation of this project for local development

I've used the Jupytext Jupyter Lab extension to create "paired" .Rmd versions of the notebooks in this repository.

This was mostly because the git diffs for the Rmd files more clearly show the changes between versions. There may be other, potentially preferable ways to do this in 2021. I also just wanted to experiment with Jupytext.

You'll need to install this extension in the same environment as your Jupyter lab.

For me, since I installed Jupyter Lab using pipx, this looks like:

pipx inject jupyterlab jupytext