Authors
Kevin T. Chu <[email protected]>
Bonita Song
Srikar Munukutla
-
1.2. Directory Structure
1.3. Template Files
-
2.1. Python Environment
The SpectraML project team researches applications of machine learning to the analysis of spectroscopic data. We are currently focused on the following core areas:
-
feature engineering (e.g., preprocessing algorithms for spectra);
-
machine learning algorithms (e.g., artificial neural networks, CNNs); and
-
performance evaluation framework (e.g., bootstrap, k-fold cross-validation).
As a model problem, we are developing a machine learning system for classifying reflectance spectra from the USGS Spectral Library Version 7 dataset.
- Python
See requirements.txt
for list of Python packages required for this project.
autoenv
virtualenv
virtualenvwrapper
README.markdown
requirements.txt
bin/
config/
data/
docs/
lab-notebook/
lib/
reports/
-
README.markdown
: this file -
requirements.txt
:pip
requirements file containing Python packages for data science, testing, and assessing code quality -
bin
: directory containing utility programs -
config
: directory containing template configuration files (e.g.,autoenv
configuration file) -
data
: directory where project datasets should be placed. Note: in general, datasets should not be committed to the git repository. Instead, datasets should be placed into this directory (either manually or using automation scripts) and referenced by Jupyter notebooks. See Section 2 for details. -
docs
: directory containing project documentation and notes -
lab-notebook
: directory containing Jupyter notebooks used for experimentation and development. Jupyter notebooks saved in this directory should (1) have a single author and (2) be dated. -
lib
: directory containing source code developed to support project -
reports
: directory containing Jupyter notebooks that present and record final results. Jupyter notebooks saved in this directory should be polished, contain final analysis results, and be the work product of the entire data science team.
Template files and directories are indicated by the 'template' suffix. These files and directories are intended to simplify the set up of the lab notebook. When appropriate, they should be renamed (with the 'template' suffix removed).
-
Create Python virtual environment for project.
$ mkvirtualenv -p /PATH/TO/PYTHON PROJECT_NAME
-
Install required Python packages.
$ pip install -r requirements.txt
-
Set up autoenv.
-
Copy
config/env.template
to.env
in project root directory. -
Set template variables in
.env
(indicated by{{ }}
notation).
-
A zip file containing the full USGS Spectra Library (Version 7) is included
in the data
directory. To prepare the spectra data for use in Jupyter
notebooks, use following instructions.
-
Extract the data files in
ASCIIdata_splib07a.zip
.$ cd data $ unzip ASCIIdata_splib07a.zip
-
Generate standardized version of spectra by using the
standardize-spectra
script.standardize-spectra
carries out the following operations:-
fills in missing data points with interpolated values;
-
resamples spectra so that they all have the same abscissa values;
-
saves spectra to CSV files containing wavelength and reflectance values;
-
generate the
spectra-metadata.csv
database containing metadata for each spectrum; and -
names each spectrum file using the unique ID (in
spectra-metadata.csv
) associated with the spectrum.
Usage
The following provide several examples of how to use
standardize-spectra
. Note: if thestandardize-spectra
command cannot be found, check thatbin
is on your path.-
Show help message.
$ standardize-spectra --help
-
Basic usage uses default output directory and wavelength values.
$ cd data $ standardize-spectra ASCIIdata_splib07a spectrometers
-
Set custom output directory by using the
-o OUTPUT_DIR
option.$ cd data $ standardize-spectra ASCIIdata_splib07a spectrometers -o custom-location
-
Set number of wavelengths in spectra directory by using the
--num-wavelengths NUM_WAVELENGTHS
option.$ cd data $ standardize-spectra ASCIIdata_splib07a spectrometers \ --num-wavelengths 2000
-
-
Use lists of spectra IDs to define collections of spectra. Within Jupyter notebook, use the following directory paths to facilitate access to spectra files.
# Data directories data_dir = os.environ['DATA_DIR'] spectra_data_dir = os.path.join(data_dir, 'ASCIIdata_splib07a') # Path to data file for spectra with ID=12345 spectrum_path = os.path.join(spectra_data_dir, '12345.csv')
- J. Whitmore. "Jupyter Notebook Best Practices for Data Science" (2016/09).