Skip to content

Project to integrate Continuous Plankton Recorder data into NERACOOS

Notifications You must be signed in to change notification settings

gulfofmaine/NERACOOS_CPR_DATA

Repository files navigation

NERACOOS: Continuous Plankton Recorder Data Support

Integrating Continuous Plankton Recorder data into NERACOOS & ERDDAP

This repository contains the necessary code and documentation to support hosting the Gulf of Maine Continuous Plankton Recorder Survey Data on ERDDAP.

Organization:

This repository documents the data provenance for continuous plankton recorder data obtained from a number of scientific research agencies (NOAA and MAB), and covering different sampling transects (The Gulf of Maine & The Mid-Atlantic Bight Transects).

Raw data from all sources is contained in the data_raw/ directory. Code that prepares the raw data for ERDDAP and any necessary documentation is specific to the source that the data was received from. This information can be found in the following sub-folders:

Sub-Folder Description
GulfOfMaine_NOAA Gulf of Maine CPR Data obtained from NOAA
GulfOfMaine_MBA Gulf of Maine CPR Data Obtained from MBA
MidAtlantic_NOAA Mid-Atlantic Bight CPR Data Obtained from NOAA
MidAtlantic_MBA Mid-Atlantic Bight CPR Data Obtained from MBA

These resources have been processed independently due to differences in measurement units and organization structures. Documentation on how each dataset was received and treated prior to uploading into ERDDAP is documented within each of the corresponding sub-folders.

Reproducing the Data Transformations

The full processing pipeline from the raw data to their final ERDDAP formats has been implemented using the {targets} R-package, and can be recreated in full by running the following code in an active R session. (Assuming all R-packages are installed).

library(targets)
tar_make()

This will recreate the processing steps outlined in _targets.R that transform the raw files into the format uploaded onto ERDDAP:

The DAG above shows a simplified representation of the steps for the NOAA Continuous Plankton Recorder Survey’s Zooplankton data, where the taxonomic information found in the header is separated from the abundance information and later joined back after it has been reshaped. Similar cleanup paths exist for the data obtained from NOAA as well as the data obtained from the MBA.

Abundance Unit Differences

Due to how the CPR data is stored and maintained within these two institutions, conversions to a standard unit of measurement is necessary when working with CPR jointly from both sources.

Taxon Naming Differences

In addition to unit conversions, there are taxonomic and development stages that are recorded inconsistently across the two data sources and used inconsistently through time. Working across the data sources requires additional data-wrangling which is accomplished with the use of a key for transitioning to more coarse development stage groupings.

Information on resolving the differences between these two data resources can be found in the following sub folder: Full_Timeseries_Workup/, with examples of code working from ERDDAP as a starting point.

Full Gulf of Maine Timeseries

For those interested in working with a complete timeseries, we have made one available following minor data wrangling changes to the original datasets.

Access to the complete timeseries can be done via ERDDAP here: NERACOOS ERDDAP

Or using software packages like {rerddap} for R or {erdappy} for access using python:

# Package to interface with ERDDAP
library(rerddap)

# 1. Zooplankton
# Get the tabledap information from the server link and dataset_id
cpr_info <- info(url = "http://ismn.erddap.neracoos.org/erddap", 
                 datasetid = "gom_cpr_zooplankton_full")

# Use the tabledap function to import all the data (optionally add filters)
gom_zp <- tabledap(cpr_info)


# 2. Phytoplankton
# Get the tabledap information from the server link and dataset_id
cpr_info <- info(url = "http://ismn.erddap.neracoos.org/erddap", 
                 datasetid = "gom_cpr_phytoplankton_full")

# Use the tabledap function to import all the data (optionally add filters)
gom_phyto <- tabledap(cpr_info)

Details on how the complete timeseries was generated, with code and notes on data wrangling decisions can be found here: www.github.com/gulfofmaine/neracoos_cpr_data/Full_Timeseries_Workup

Project Funding:

Funding for making these resources available was provided through grant awards from the National Science Foundation and from the Lenfest Ocean Program. With communication and support from the Northeast Fisheries Science Center and the Marine Biological Association.


Additionals Resources (Under Development):

Whenever working with different datasets, or differently managed versions of the same data, it is common to have to perform data reshaping steps in order to join across resources. CPR data made available via ERDDAP is no different. Below are a few common processing workflows that a user of this data may find helpful:

About

Project to integrate Continuous Plankton Recorder data into NERACOOS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •