Skip to content
Abby Drury edited this page Feb 6, 2023 · 19 revisions

Context

This is a centralized project managed by LITS that provides a Dockerized environment for research projects to use as a starting template that that can be extended.

This project was developed in response to a request from the Mount Holyoke College Sociology Data Lab for assistance in creating a standardized and reproducible data science research environment.

The template repository was developed by Abby Drury in the Academic Technologies department in LITS in consultation with Sarah Oelker in the Educational Technology department in LITS and Benjamin Gebre-Medhin from the Department of Sociology and Anthropology and the Data Science Committee.


Need more help?

While we endeavor to make this documentation as helpful as possible, sometimes documentation is not enough.

  • For assistance with a specific environment that uses this template repository, contact your project sponsor or lead researcher.

  • LITS offers this standardized environment through collaboration with Data Science faculty at the College. User support is limited and users of this environment should be prepared to use widely available training materials, such as the LinkedIn Learning subscription available through LITS E-Resources, to get up to speed on Docker and JupyterHub as needed. Issue reports and pull requests are welcome.


Functional requirements that shaped the development of this project

Researchers should be able to:

  • Add their required Python/R packages and/or application extensions to Jupyter
  • Use the Jupyter UI to develop and run their notebooks
  • Access datasets within the Jupyter UI
  • Track changes to their notebooks and datasets
  • Share their notebooks with other researchers
  • Publish their environment in a reproducible way
  • Stop and start the Docker container in a reasonable time period; building the container may take longer
  • Ideally, also pull any changes from this parent repository into their child projects if needed

Getting started


Working with this documentation

You will see references to SERVICE_NAME, CONTAINER_NAME, and PORT_NUMBER throughout this documentation. You'll need to get those values for your project when issuing commands containing these placeholders.

Getting your service name, container name, and port

Throughout the wiki, you wil see references to SERVICE_NAME, CONTAINER_NAME, and PORT_NUMBER. In your project repository, you may wish to find and replace these references with the relevant values for readability.

Finding the SERVICE_NAME

This is in environment/compose.yml file, right under the services: key.

In the template repository, SERVICE_NAME is datascience-notebook and is found on line 3 of the file.

Finding the CONTAINER_NAME

Find your container's name using docker container ls. The container name will be in the last column and should contain SERVICE_NAME.

Note that, depending on your computer, Docker may use hyphens (-) or underscores (_) in container names. This means that the container name could be slightly different amongst your team.

In the template repository, CONTAINER_NAME is something like environment_datascience-notebook_1 or environment-datascience-notebook-1.

Finding the PORT_NUMBER

This is in environment/compose.yml file under the ports: key, and is the first value in the colon (:) separated port numbers.

In the template repository, PORT_NUMBER is 10000 and is found on line 9 of the file.


Project structure

analysis/

This is where your analysis should go.

These files are shared between the Docker container and your computer/the host machine, and are visible in JupyterLab.

Any data that should not be committed to the repository should be ignored using data/.gitignore.

Autosave files found .ipynb_checkpoints/ are not available to version control in Git. You must be sure to save your work in JupyterLab prior to adding your changes to Git.

analysis/.gitignore

A classic .gitignore file for specifying what files Git should consider as not eligible for version control. This is how we exclude autosave information from our repository.

analysis/demo-notebooks

These contain demo code and may be removed by the researcher.

analysis/demo-notebooks/get-url-in-python.ipynb

This notebook uses Python to retrieve a given URL (google.com), print the HTML response code, and then print the HTML contents of that URL.

analysis/demo-notebooks/hello-world-in-r.ipynb

This notebook uses R to execute a classic "hello world" program.

data/

This is where your data should go. These files are shared between the Docker container and your computer/the host machine, and are visible in JupyterLab as data/. Any data that should not be committed to the repository should be ignored using data/.gitignore.

data/.gitignore

A classic .gitignore file for specifying what files Git should consider as not eligible for version control. This is how we exclude autosave information from our repository.

environment/

This is where the Docker, Python package, and Jupyter server configuration live. You can generally ignore most of it, but there are 4 files to be aware of.

  • Python packages are specified in environment/requirements.txt. See Python package requirements for more details.
  • R packages are specified in environment/r-packages.R. See R package requirements for more details.
  • JupyterLab extensions are specified in environment/jupyter-extensions.csv. See JupyterLab extension requirements for more details.
  • Jupyter server configuration can be managed in environment/jupyter_server_config.py. You probably won't need to modify this, but it's good to know about.