-
Notifications
You must be signed in to change notification settings - Fork 4
Home
This is a centralized project managed by LITS that provides a Dockerized environment for research projects to use as a starting template that that can be extended.
This project was developed in response to a request from the Mount Holyoke College Sociology Data Lab for assistance in creating a standardized and reproducible data science research environment.
The template repository was developed by Abby Drury in the Academic Technologies department in LITS in consultation with Sarah Oelker in the Educational Technology department in LITS and Benjamin Gebre-Medhin from the Department of Sociology and Anthropology and the Data Science Committee.
While we endeavor to make this documentation as helpful as possible, sometimes documentation is not enough.
-
For assistance with a specific environment that uses this template repository, contact your project sponsor or lead researcher.
-
LITS offers this standardized environment through collaboration with Data Science faculty at the College. User support is limited and users of this environment should be prepared to use widely available training materials, such as the LinkedIn Learning subscription available through LITS E-Resources, to get up to speed on Docker and JupyterHub as needed. Issue reports and pull requests are welcome.
Researchers should be able to:
- Add their required Python/R packages and/or application extensions to Jupyter
- Use the Jupyter UI to develop and run their notebooks
- Access datasets within the Jupyter UI
- Track changes to their notebooks and datasets
- Share their notebooks with other researchers
- Publish their environment in a reproducible way
- Stop and start the Docker container in a reasonable time period; building the container may take longer
- Ideally, also pull any changes from this parent repository into their child projects if needed
-
Need to create a new project? Set up a brand new environment and repository from this template for your project / team.
-
Someone already set up an environment for your project? Clone their repository, then build and bring up JupyterLab.
-
Previously cloned the repository for your project and just need a refresher? Check out the quickstart guide.
-
Already have a cloned project and want to pull in more recent changes from the template repository? Take a look at the instructions for working with the template repository
You will see references to SERVICE_NAME
, CONTAINER_NAME
, and PORT_NUMBER
throughout this documentation. You'll need to get those values for your project when issuing commands containing these placeholders.
Throughout the wiki, you wil see references to SERVICE_NAME
, CONTAINER_NAME
, and PORT_NUMBER
. In your project repository, you may wish to find and replace these references with the relevant values for readability.
This is in environment/compose.yml
file, right under the services:
key.
In the template repository, SERVICE_NAME
is datascience-notebook
and is found on line 3 of the file.
Find your container's name using docker container ls
. The container name will be in the last column and should contain SERVICE_NAME
.
Note that, depending on your computer, Docker may use hyphens (-
) or underscores (_
) in container names. This means that the container name could be slightly different amongst your team.
In the template repository, CONTAINER_NAME
is something like environment_datascience-notebook_1
or environment-datascience-notebook-1
.
This is in environment/compose.yml
file under the ports:
key, and is the first value in the colon (:
) separated port numbers.
In the template repository, PORT_NUMBER
is 10000
and is found on line 9 of the file.
This is where your analysis should go.
These files are shared between the Docker container and your computer/the host machine, and are visible in JupyterLab.
Any data that should not be committed to the repository should be ignored using data/.gitignore
.
Autosave files found .ipynb_checkpoints/
are not available to version control in Git. You must be sure to save your work in JupyterLab prior to adding your changes to Git.
A classic .gitignore
file for specifying what files Git should consider as not eligible for version control. This is how we exclude autosave information from our repository.
These contain demo code and may be removed by the researcher.
This notebook uses Python to retrieve a given URL (google.com), print the HTML response code, and then print the HTML contents of that URL.
This notebook uses R to execute a classic "hello world" program.
This is where your data should go. These files are shared between the Docker container and your computer/the host machine, and are visible in JupyterLab as data/
. Any data that should not be committed to the repository should be ignored using data/.gitignore
.
A classic .gitignore
file for specifying what files Git should consider as not eligible for version control. This is how we exclude autosave information from our repository.
This is where the Docker, Python package, and Jupyter server configuration live. You can generally ignore most of it, but there are 4 files to be aware of.
- Python packages are specified in
environment/requirements.txt
. See Python package requirements for more details. - R packages are specified in
environment/r-packages.R
. See R package requirements for more details. - JupyterLab extensions are specified in
environment/jupyter-extensions.csv
. See JupyterLab extension requirements for more details. - Jupyter server configuration can be managed in
environment/jupyter_server_config.py
. You probably won't need to modify this, but it's good to know about.