You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An empirical study and comparison of Deterministic, Statistical, and ML Algorithms for performing Spatial Modeling of significant wave height values collected by buoy and sea monitoring stations managed by the United States' National Data Buoy Center (NDBC) located near costs of the Southern Atlantic regions of the United States, including those on the Gulf of Mexico and parts of the Caribbean.
Temporal-Spatial Interpolation of wave height in the area at a certain timestamp. Black dots indicate the points that have been actually sampled. Red circles are points that were excluded from training data
The experimental study was conducted on a large dataset consisting of hourly wave and meteorological measurements on the 2010-2022 period collected by buoys moored in the Mid Atlantic near the south-East coast of the continental United States.
Timeseries of wave height measurements from buoy #42019
The general preprocessing steps were done by defining a kedro pipeline to detect and parse missing values, format the columns, and convert it to a geo-parquet format (Geopandas was used for read/write operations and to work with it as geospatial data).
The data was then split into training and test sets. The test set itself consisted of several subsets of selected data, each of which was used to evaluate the performance of the algorithms based on the specific spatial configuration of the buoys available in each set.
Test subsets evaluated in this area. Inside red circles are the buoys that were not available in the training set of each period mentioned.
Evaluation was conducted by writing individual MLFlow experiments of each of the algorithms and were then executed with each of the subsets of the test data on parallel (see the experiments/ directory to see examples of this).
The results of the experiments were then analyzed by comparing the performance of the algorithms on the test sets.
Results:
The results of the study favour the use of ML algorithms over the use of other methods when paired with a strong feature set that are able to capture the spatial distribution of the data well. While they achieve similar error than other algorithms in sets that test interpolation inside the convex hull of the data (such as those in sets A,B,C) they are much better than the others on points that would require extrapolation outside the convex hull of the data (sets D,E,F).
Overall error metrics
Avg RMSE per test set
Visual results per evaluated technique
Of the two ML methods, Gradient Boost (LightGBM) was the one that turned out to be most successful not only on accuracy but also when comparing the time it takes to run inference in comparison to Random Forest (3x faster).