A cookiecutter template for data journalism projects using python
The bond between data and journalism is growing heartily. In the era of the big data, there's an opening field to dig into the digital content and uncover new stories.
That's why although there are a lot of content for data science, we need adapted contents and tools for data journalism in order to emphasize the importance of reporting since it is not only a matter of analyzing and visualizing data but telling stories about the discoveries humanizing that data.
Working with big amounts of data can result on several pivot tables, graphics and inevitable on different versions of our code and data. So when it comes to look through our own projects, it would be ideal to have organized the names and location of our files to be able to locate them easily and to know what each of them contains.
It's hard to explore a disorganized project and even harder to reproduce it, for that reason when we bring together structured and documented projects for data journalism, we make them easier for others to replicate and scrutinize methodological decisions that sometimes are not well captured by published stories.
Last but not least, sharing our data driven work methods and codes can be helpful to other journalists to reuse them for their own investigation or to give accountability to our research ensuring information is reported truthfully.
In short, if we want journalists to share their work, we need to make a change on existing workflows, but that would mean and extra effort and therefore time investment, thus this template can serve as a tool, among others, to help data journalism achieve transparency.
- Features
- Installation
- Directory Structure
- Workflow
- Python Virtual Environment
- Python Packages
- Initialize Git
- Related Templates
This template standardizes projects for data journalism and speeds up their creation by automating repetitive work when a new project is generated.
-
Brings the scaffolding of a project with the help of a directory structure designed around data pipelines and reporting stories.
-
Improves the analysis process with established phases from a typical data journalism workflow.
-
Automates the creation of a virtual environment in order to make an isolated and reproducible data project.
-
Installs useful python packages during data analysis like pandas.
-
Initializes a local git repository for the purpose of managing a version control of the project.
-
Can be configured to Linux, MacOS and Windows.
-
First you need to install
cookiecutter
either it is with pip or conda.- Installing with
pip
:
pip install cookiecutter
- Installing with
conda
:
conda config --add channels conda-forge conda install cookiecutter
For more information about installing cookiecutter read the documentation.
- Installing with
-
Now install the data journalism template:
cookiecutter https://github.com/DataCritica/cookiecutter-data-journalism
-
> Select a project name: > Select a project slug: > Write a project description: > Select a project author: > Select a license: 1. MIT 2. GNU General Public License v3 > Select an operating system: 1. Linux 2. MacOS 3. Windows > Select a setup project (Create a virtual environment and install packages): 1. Yes 2. No > Select initialize git: 1. Yes 2. No
-
The template works with jupyter notebooks, in case you don't have a set up for jupyter, run the following command:
pip install jupyterlab notebook
- Set up the project π§
- Process data π§Ό
- Analyze data π
- Visualize data π
- Write a report βοΈ
- Publish a story π₯
βββ data # Categorized data files
βΒ Β βββ processed # Cleaned data
βΒ Β βββ raw # Original data
|
βββ docs # Explanatory materials
βΒ Β βββ data-dictionary.md # Information about the data
βΒ Β βββ explore-data.md # Questions to explore the data
βΒ Β βββ references # Papers, manuals, articles, etc.
βΒ Β βββ reports # Report analysis as PDF, HTML, etc.
|
βββ LICENSE # Project's license
|
βββ notebooks # Jupyter notebooks
βΒ Β βββ 0.0-process.ipynb # Data processing (fixing column types, data cleansing, etc.)
βΒ Β βββ 1.0-analyze.ipynb # Exploratory data analysis
βΒ Β βββ 2.0-visualize.ipynb # Data visualization methods
|
βββ outputs # Exports generated by notebooks
βΒ Β βββ figures # Generated graphics, maps, etc. to be used in reporting
βΒ Β βββ tables # Generated pivot tables to analyze data
|
βββ .gitignore # Customized .gitignore for python projects
|
βββ Pipfile # Project dependencies
|
βββ README.md # Top-level README for this project
-
The file contains a template for python projects.
-
Public repositories need an open source license in order to be used, modified and distributed. For this reason, with this template you can choose between a MIT License or a GNU General Public License v3.
For more information on how to license your code, checkout this site.
-
A README is a markdown file that introduces and gives a description of the project. It includes information that is required to understand what the project is about.
Here's a manual on how to create a README file, an article on how to write markdown and a link to test an online editor.
-
The data section contains two directories:
raw
andprocessed
:The original data files should remain intact and only be used for consultation purposes.
Everything related to data cleansing and polishing should go in this folder.
-
This category consists of two directories (
references
andreports
) and two markdown files (data-dictionary.md
andexplore-data.md
):-
This folder contains all the documents that serve as reference for the project such as papers, articles, other journalistic publications, interviews, FOIA requests, data documentation, etc.
-
Here are the reports that account for the analysis of the data and put into words the results from the graphs and in general all the outputs generated by the code.
-
Information about the dataset or, in other words, metadata to put the data in context such as describing what each column refers to.
-
A template for making an exploratory analysis by treating our data as a source of information and therefore asking it questions and find out what data are telling us. At this point we also need to interrogate the context of the data, who collected them, how they were collected, for what purpose and more than that consider possible data gaps or missing voices.
This template was inspired by Putting data back into context.
-
-
This part covers jupyter notebooks divided into three categories: processing, analysis and visualization. These sections in turn may have subcategories as well, hence their nomenclature contains an enumeration to arrange them.
-
During processing we will clean the data, correct the variable types and generally perform procedures in order to make the data categories comparable.
-
In this stage, meaningful information is extracted from the data by grouping, filtering, comparing, calculating among many other methods in order to find patterns and relationships between categories.
-
After the exploratory analysis, we make visual representations of what has been discovered in the analysis, for which we can choose from a wide range of graphics to communicate this information.
-
-
This section is composed of two directories:
tables
andfigures
-
This folder contains simple tables and pivot tables generated in the crosses of different variables from the dataset.
-
Here comes the graphs, diagrams, maps or other types of visualizations generated on notebooks.
-
-
A file created when the virtual environment is generated with
pipenv
. This file lists all the packages used in the project.
During the project generation, you'll be asked if you want to create a virtual environment, if you accept pipenv will be installed and create an environment for the project.
A virtual environment is a tool that separates dependencies of different projects. That means we can have isolated projects with their own packages, but on top of that it will help us to make our research reproducible since listing all the libraries necessary to reproduce an outcome should be part of our workflow.
Pipenv has several advantages in comparison to other libraries like virtualenv
or virtualenvwrapper
. Its main features are that you no longer need to use pip
since it is already integrated in pipenv
command. Likewise its Pipfile
is much easier to use and understand than a requirements.txt
file.
For more information about pipenv
you can read the documentation.
If you accept the previous option, you will also install a library dedicated to data analysis.
Library | Documentation |
---|---|
Pandas |
Besides this package, IPython kernel will be installed too with the purpose of using a kernel with the virtual environment.
Using git is a way to be able to manage the different versions of a project and therefore have a backup of it. We can have this history on our own computer through a local repository or have it available at any time through a remote repository on servers (such as GitHub or GitLab), so that we can synchronize these repositories as we make changes to them.
In case you don't have git installed, here's a brief guide on how to download it according to your operating system.
-
Debian and Ubuntu:
sudo apt-get update && sudo apt-get upgrade sudo apt-get install git
For other Linux distributions checkout this guide.
-
You can run
git
on your terminal and if you don't have it installed, it will prompt you to install it:git --version
Furthermore you have a few other options like installing it with homebrew:
brew install git
-
For Windows, you have to install git and git bash, here's a manual for the installation.
The current project was inspired by the following templates dedicated to data science and data journalism: