Integrating Continuous Plankton Recorder data into NERACOOS & ERDDAP
This repository contains the necessary code and documentation to support hosting the Gulf of Maine Continuous Plankton Recorder Survey Data on ERDDAP.
This repository documents the data provenance for continuous plankton recorder data obtained from a number of scientific research agencies (NOAA and MAB), and covering different sampling transects (The Gulf of Maine & The Mid-Atlantic Bight Transects).
Raw data from all sources is contained in the data_raw/
directory.
Code that prepares the raw data for ERDDAP and any necessary
documentation is specific to the source that the data was received from.
This information can be found in the following sub-folders:
Sub-Folder | Description |
---|---|
GulfOfMaine_NOAA | Gulf of Maine CPR Data obtained from NOAA |
GulfOfMaine_MBA | Gulf of Maine CPR Data Obtained from MBA |
MidAtlantic_NOAA | Mid-Atlantic Bight CPR Data Obtained from NOAA |
MidAtlantic_MBA | Mid-Atlantic Bight CPR Data Obtained from MBA |
These resources have been processed independently due to differences in measurement units and organization structures. Documentation on how each dataset was received and treated prior to uploading into ERDDAP is documented within each of the corresponding sub-folders.
The full processing pipeline from the raw data to their final ERDDAP formats has been implemented using the {targets} R-package, and can be recreated in full by running the following code in an active R session. (Assuming all R-packages are installed).
library(targets)
tar_make()
This will recreate the processing steps outlined in _targets.R
that
transform the raw files into the format uploaded onto ERDDAP:
The DAG above shows a simplified representation of the steps for the NOAA Continuous Plankton Recorder Survey’s Zooplankton data, where the taxonomic information found in the header is separated from the abundance information and later joined back after it has been reshaped. Similar cleanup paths exist for the data obtained from NOAA as well as the data obtained from the MBA.
Due to how the CPR data is stored and maintained within these two institutions, conversions to a standard unit of measurement is necessary when working with CPR jointly from both sources.
In addition to unit conversions, there are taxonomic and development stages that are recorded inconsistently across the two data sources and used inconsistently through time. Working across the data sources requires additional data-wrangling which is accomplished with the use of a key for transitioning to more coarse development stage groupings.
Information on resolving the differences between these two data
resources can be found in the following sub folder:
Full_Timeseries_Workup/
, with examples of code working from ERDDAP as
a starting point.
For those interested in working with a complete timeseries, we have made one available following minor data wrangling changes to the original datasets.
Access to the complete timeseries can be done via ERDDAP here: NERACOOS ERDDAP
Or using software packages like {rerddap} for R or {erdappy} for access using python:
# Package to interface with ERDDAP
library(rerddap)
# 1. Zooplankton
# Get the tabledap information from the server link and dataset_id
cpr_info <- info(url = "http://ismn.erddap.neracoos.org/erddap",
datasetid = "gom_cpr_zooplankton_full")
# Use the tabledap function to import all the data (optionally add filters)
gom_zp <- tabledap(cpr_info)
# 2. Phytoplankton
# Get the tabledap information from the server link and dataset_id
cpr_info <- info(url = "http://ismn.erddap.neracoos.org/erddap",
datasetid = "gom_cpr_phytoplankton_full")
# Use the tabledap function to import all the data (optionally add filters)
gom_phyto <- tabledap(cpr_info)
Details on how the complete timeseries was generated, with code and notes on data wrangling decisions can be found here: www.github.com/gulfofmaine/neracoos_cpr_data/Full_Timeseries_Workup
Funding for making these resources available was provided through grant awards from the National Science Foundation and from the Lenfest Ocean Program. With communication and support from the Northeast Fisheries Science Center and the Marine Biological Association.
Whenever working with different datasets, or differently managed versions of the same data, it is common to have to perform data reshaping steps in order to join across resources. CPR data made available via ERDDAP is no different. Below are a few common processing workflows that a user of this data may find helpful: