Home

Context

This is a centralized project managed by LITS that provides a Dockerized environment for research projects to use as a starting template that that can be extended.

This project was developed in response to a request from the Mount Holyoke College Sociology Data Lab for assistance in creating a standardized and reproducible data science research environment.

The template repository was developed by Abby Drury in the Academic Technologies department in LITS in consultation with Sarah Oelker in the Educational Technology department in LITS and Benjamin Gebre-Medhin from the Department of Sociology and Anthropology and the Data Science Committee.

Need more help?

While we endeavor to make this documentation as helpful as possible, sometimes documentation is not enough.

For assistance with a specific environment that uses this template repository, contact your project sponsor or lead researcher.
LITS offers this standardized environment through collaboration with Data Science faculty at the College. User support is limited and users of this environment should be prepared to use widely available training materials, such as the LinkedIn Learning subscription available through LITS E-Resources, to get up to speed on Docker and JupyterHub as needed. Issue reports and pull requests are welcome.

Functional requirements that shaped the development of this project

Researchers should be able to:

Add their required Python/R packages and/or application extensions to Jupyter
Use the Jupyter UI to develop and run their notebooks
Access datasets within the Jupyter UI
Track changes to their notebooks and datasets
Share their notebooks with other researchers
Publish their environment in a reproducible way
Stop and start the Docker container in a reasonable time period; building the container may take longer
Ideally, also pull any changes from this parent repository into their child projects if needed

Getting started

Need to create a new project? Set up a brand new environment and repository from this template for your project / team.
Someone already set up an environment for your project? Clone their repository, then build and bring up JupyterLab.
Previously cloned the repository for your project and just need a refresher? Check out the quickstart guide.
Already have a cloned project and want to pull in more recent changes from the template repository? Take a look at the instructions for working with the template repository

Working with this documentation

You will see references to SERVICE_NAME, CONTAINER_NAME, and PORT_NUMBER throughout this documentation. You'll need to get those values for your project when issuing commands containing these placeholders.

Getting your service name, container name, and port

Throughout the wiki, you wil see references to SERVICE_NAME, CONTAINER_NAME, and PORT_NUMBER. In your project repository, you may wish to find and replace these references with the relevant values for readability.

Finding the SERVICE_NAME

This is in environment/compose.yml file, right under the services: key.

In the template repository, SERVICE_NAME is datascience-notebook and is found on line 3 of the file.

Finding the CONTAINER_NAME

Find your container's name using docker container ls. The container name will be in the last column and should contain SERVICE_NAME.

Note that, depending on your computer, Docker may use hyphens (-) or underscores (_) in container names. This means that the container name could be slightly different amongst your team.

In the template repository, CONTAINER_NAME is something like environment_datascience-notebook_1 or environment-datascience-notebook-1.

Finding the PORT_NUMBER

This is in environment/compose.yml file under the ports: key, and is the first value in the colon (:) separated port numbers.

In the template repository, PORT_NUMBER is 10000 and is found on line 9 of the file.

Project structure

`analysis/`

This is where your analysis should go.

These files are shared between the Docker container and your computer/the host machine, and are visible in JupyterLab.

Any data that should not be committed to the repository should be ignored using data/.gitignore.

Autosave files found .ipynb_checkpoints/ are not available to version control in Git. You must be sure to save your work in JupyterLab prior to adding your changes to Git.

`analysis/.gitignore`

A classic .gitignore file for specifying what files Git should consider as not eligible for version control. This is how we exclude autosave information from our repository.

`analysis/demo-notebooks`

These contain demo code and may be removed by the researcher.

`analysis/demo-notebooks/get-url-in-python.ipynb`

This notebook uses Python to retrieve a given URL (google.com), print the HTML response code, and then print the HTML contents of that URL.

`analysis/demo-notebooks/hello-world-in-r.ipynb`

This notebook uses R to execute a classic "hello world" program.

`data/`

This is where your data should go. These files are shared between the Docker container and your computer/the host machine, and are visible in JupyterLab as data/. Any data that should not be committed to the repository should be ignored using data/.gitignore.

`data/.gitignore`