Spatial Modeling on NDBC Data

An empirical study and comparison of Deterministic, Statistical, and ML Algorithms for performing Spatial Modeling of significant wave height values collected by buoy and sea monitoring stations managed by the United States' National Data Buoy Center (NDBC) located near costs of the Southern Atlantic regions of the United States, including those on the Gulf of Mexico and parts of the Caribbean.


Temporal-Spatial Interpolation of wave height in the area at a certain timestamp. Black dots indicate the points that have been actually sampled. Red circles are points that were excluded from training data

Techniques Studied:

Deterministic methods: such as linear barycentric interpolation, Inverse Distance Weighting (IDW), and Radial Basis Function (RBF) Interpolation.
Statistical methods: Kriging Interpolation (Gaussian Process Regression).
Machine Learning methods: LightGBM and Random Forests.

Technical Approach:

The experimental study was conducted on a large dataset consisting of hourly wave and meteorological measurements on the 2010-2022 period collected by buoys moored in the Mid Atlantic near the south-East coast of the continental United States.
Data was downloaded directly from the historical standard meteorological archive of the NDBC. The locations of each of the targeted buoys was obtained by scraping the webpages that lists their individual information (e.g: https://www.ndbc.noaa.gov/station_page.php?station=44008).


Timeseries of wave height measurements from buoy #42019

The general preprocessing steps were done by defining a kedro pipeline to detect and parse missing values, format the columns, and convert it to a geo-parquet format (Geopandas was used for read/write operations and to work with it as geospatial data).
The data was then split into training and test sets. The test set itself consisted of several subsets of selected data, each of which was used to evaluate the performance of the algorithms based on the specific spatial configuration of the buoys available in each set.


Test subsets evaluated in this area. Inside red circles are the buoys that were not available in the training set of each period mentioned.

Evaluation was conducted by writing individual MLFlow experiments of each of the algorithms and were then executed with each of the subsets of the test data on parallel (see the experiments/ directory to see examples of this).
The results of the experiments were then analyzed by comparing the performance of the algorithms on the test sets.

Results:

The results of the study favour the use of ML algorithms over the use of other methods when paired with a strong feature set that are able to capture the spatial distribution of the data well. While they achieve similar error than other algorithms in sets that test interpolation inside the convex hull of the data (such as those in sets A,B,C) they are much better than the others on points that would require extrapolation outside the convex hull of the data (sets D,E,F).


Overall error metrics	Avg RMSE per test set


Visual results per evaluated technique

Of the two ML methods, Gradient Boost (LightGBM) was the one that turned out to be most successful not only on accuracy but also when comparing the time it takes to run inference in comparison to Random Forest (3x faster).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Spatial Modeling on NDBC Data

Techniques Studied:

Technical Approach:

Results:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Spatial Modeling on NDBC Data

Techniques Studied:

Technical Approach:

Results: