Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#GSOC PR : Add Preprocess Function for Data Cleaning and Validation #3321

Merged
merged 99 commits into from
Aug 2, 2024
Merged
Changes from 12 commits
Commits
Show all changes
99 commits
Select commit Hold shift + click to select a range
dccd805
Add Preprocess Function for Data Cleaning and Validation
sambhavnoobcoder Jun 27, 2024
eaa0846
Rename Function to NA_preprocess and Add Roxygen Documentation
sambhavnoobcoder Jun 29, 2024
9a887d4
Merge branch 'PecanProject:develop' into Preprocess-Function
sambhavnoobcoder Jun 29, 2024
cbc0a34
added author name and fixed roxygen formatting slightly
sambhavnoobcoder Jun 29, 2024
2879e6a
updated code to work with CNN in place of random forest model
sambhavnoobcoder Jul 11, 2024
8de0277
runner code for the NA_preprocess and NA_Downscale function.
sambhavnoobcoder Jul 11, 2024
e3403e6
Printing Evaulation metrics for the model
sambhavnoobcoder Jul 11, 2024
be698d5
Prepare metrics data for multi-axis line plot visualization
sambhavnoobcoder Jul 11, 2024
35f0a6e
Create multi-metric line plot for ensemble performance visualization
sambhavnoobcoder Jul 11, 2024
48b7c50
Add R-squared plot and combine with MSE/MAE plot
sambhavnoobcoder Jul 11, 2024
ebb32fb
Add scatter plot comparing actual vs predicted values for ensemble mo…
sambhavnoobcoder Jul 11, 2024
8cc689f
Implement Taylor Diagram for ensemble model evaluation
sambhavnoobcoder Jul 11, 2024
7a4a68a
Merge branch 'develop' into Preprocess-Function
dlebauer Jul 11, 2024
064fc30
Updated NA_downscale.Rd with changes with regards to CNN implementation
sambhavnoobcoder Jul 11, 2024
04d439f
Created NA_preprocess.Rd
sambhavnoobcoder Jul 11, 2024
e167ca4
Updated the NA_preprocess to SDA_downscale_preprocess , NA_downscale …
sambhavnoobcoder Jul 12, 2024
f28ebb2
Merge branch 'PecanProject:develop' into Preprocess-Function
sambhavnoobcoder Jul 15, 2024
e524b5e
refactored code leaving only functions in the code
sambhavnoobcoder Jul 15, 2024
a3a92f2
Updated the description of the return type of SDA_preprocess function.
sambhavnoobcoder Jul 15, 2024
9226ef5
Update SDA_downscale function to use base R pipe operator |>
sambhavnoobcoder Jul 17, 2024
f2bab83
Add explicit namespaces for non-base functions
sambhavnoobcoder Jul 17, 2024
cac3c8e
Implement dynamic carbon pool naming in SDA_downscale function
sambhavnoobcoder Jul 17, 2024
ecd5aa1
Improve data scaling to ensure consistency across train and test sets
sambhavnoobcoder Jul 20, 2024
5ce2339
Improve date handling in SDA_downscale_preprocess function
sambhavnoobcoder Jul 20, 2024
bd7cfa5
Refactor SDA_downscale function to accept covariates as direct input
sambhavnoobcoder Jul 20, 2024
a870b93
Updated description for SDA_downscale parameters
sambhavnoobcoder Jul 20, 2024
ca14c09
Renaming variables according to nomenclature standards
sambhavnoobcoder Jul 20, 2024
832801f
Updated documentation wrt variable nomenclature change
sambhavnoobcoder Jul 20, 2024
ce4a597
Add model selection feature to SDA_downscale function
sambhavnoobcoder Jul 21, 2024
d02318f
Update SDA_downscale function documentation
sambhavnoobcoder Jul 21, 2024
ac572da
Refactor SDA_downscale function to remove metrics calculation
sambhavnoobcoder Jul 21, 2024
350278f
Add calculate_metrics function for downscaling results
sambhavnoobcoder Jul 21, 2024
0c4fb82
Add documentation comments to calculate_metrics function
sambhavnoobcoder Jul 21, 2024
7574abc
Refactor SDA_downscale function for improved efficiency
sambhavnoobcoder Jul 21, 2024
6acfd74
Optimize SDA_downscale function and improve covariate handling
sambhavnoobcoder Jul 21, 2024
5b6f577
Create SDA_downscale.Rd
sambhavnoobcoder Jul 21, 2024
50ee452
Create SDA_downscale_preprocess.Rd
sambhavnoobcoder Jul 21, 2024
f812daa
Create calculate_metrics.Rd
sambhavnoobcoder Jul 21, 2024
47656a3
Merge branch 'PecanProject:develop' into Preprocess-Function
sambhavnoobcoder Jul 23, 2024
f55c2de
Delete NA_downscale.Rd
sambhavnoobcoder Jul 23, 2024
d751ffc
Delete NA_preprocess.Rd
sambhavnoobcoder Jul 23, 2024
06bf26b
Renamed function from calculate_metrics to SDA_downscale_metrics
sambhavnoobcoder Jul 23, 2024
bb66142
Refactor SDA_downscale function data prep snippet for improved effici…
sambhavnoobcoder Jul 23, 2024
4d2c6a5
Update SDA_downscale function to make seed optional
sambhavnoobcoder Jul 23, 2024
7e97841
Update SDA_downscale function documentation to improve seeding method…
sambhavnoobcoder Jul 23, 2024
fe5699d
set default model type
sambhavnoobcoder Jul 23, 2024
a20389f
Updated documentation for Default argument
sambhavnoobcoder Jul 23, 2024
1dd9e6c
Removed extra roxygen block
sambhavnoobcoder Jul 23, 2024
35f0b3e
modified title of SDA_downscale function
sambhavnoobcoder Jul 23, 2024
91236ac
Keeping date as a Date type
sambhavnoobcoder Jul 23, 2024
d01f739
Refactor SDA_downscale_preprocess for consistent date handling
sambhavnoobcoder Jul 23, 2024
62a8e44
Updated documentation to suit date type
sambhavnoobcoder Jul 24, 2024
7f782f2
Update documentation for clarification of variable data
sambhavnoobcoder Jul 24, 2024
21a615a
added namespace to functions
sambhavnoobcoder Jul 24, 2024
c8c234a
Unify output structure for RF and CNN models in SDA_downscale function
sambhavnoobcoder Jul 24, 2024
19402db
removed extra description for preprocess function
sambhavnoobcoder Jul 24, 2024
f43a50a
Changed the documentation for predictors for downscale instead of CNN
sambhavnoobcoder Jul 24, 2024
62221d9
Update modules/assim.sequential/R/downscale_function.R
sambhavnoobcoder Jul 24, 2024
0af7df7
Update modules/assim.sequential/R/downscale_function.R
sambhavnoobcoder Jul 24, 2024
1e6a484
Update modules/assim.sequential/R/downscale_function.R
sambhavnoobcoder Jul 24, 2024
529fe6f
update carbon_data call
sambhavnoobcoder Jul 24, 2024
2227fd9
updated full_data preprocess call
sambhavnoobcoder Jul 24, 2024
80e5b2d
Revert "update carbon_data call"
sambhavnoobcoder Jul 24, 2024
9f6554b
Update SDA_downscale.Rd documentation
sambhavnoobcoder Jul 24, 2024
6001fad
Change date type to Date in preprocess function
sambhavnoobcoder Jul 24, 2024
909ae68
Update SDA_downscale.Rd
sambhavnoobcoder Jul 24, 2024
9c08465
Update SDA_downscale_preprocess.Rd
sambhavnoobcoder Jul 24, 2024
3889397
Delete calculate_metrics.Rd
sambhavnoobcoder Jul 24, 2024
71c1013
Create SDA_downscale_metrics.Rd
sambhavnoobcoder Jul 24, 2024
38c9e7a
modified namespaces
sambhavnoobcoder Jul 26, 2024
460672c
Merge branch 'develop' into Preprocess-Function
mdietze Jul 26, 2024
7859206
Update NAMESPACE
sambhavnoobcoder Jul 28, 2024
e8c40ac
Update DESCRIPTION with keras3
sambhavnoobcoder Jul 28, 2024
53a9cae
Update pecan_package_dependencies.csv
sambhavnoobcoder Jul 28, 2024
b9cc4fb
Update pecan_package_dependencies.csv for some changes
sambhavnoobcoder Jul 28, 2024
d659567
Update NAMESPACE
sambhavnoobcoder Jul 28, 2024
aacc890
Reverting NAMESPACE
sambhavnoobcoder Jul 28, 2024
2d248d9
Reverting pecan_package_dependencies.csv to original
sambhavnoobcoder Jul 28, 2024
c9fdd7b
Update DESCRIPTION removing keras3
sambhavnoobcoder Jul 29, 2024
172bf55
degraded roxygen version to 7.3.1
sambhavnoobcoder Jul 29, 2024
5e43b35
Revert to last successful version
sambhavnoobcoder Jul 29, 2024
fa209b4
added keras3 to the suggests
sambhavnoobcoder Jul 29, 2024
b1bd57f
Update DESCRIPTION
sambhavnoobcoder Jul 29, 2024
fa9ed04
Update NAMESPACE
sambhavnoobcoder Jul 29, 2024
0cf995a
Update DESCRIPTION
sambhavnoobcoder Jul 29, 2024
6686b8d
Update NAMESPACE
sambhavnoobcoder Jul 29, 2024
7a66814
Update NAMESPACE
mdietze Jul 30, 2024
a25e663
Update NAMESPACE
mdietze Jul 30, 2024
b7ac546
Update NAMESPACE
mdietze Jul 30, 2024
b61fd4e
Update modules/assim.sequential/DESCRIPTION
mdietze Jul 30, 2024
d976a02
Merge branch 'develop' into Preprocess-Function
mdietze Jul 30, 2024
95018e4
Merge branch 'PecanProject:develop' into Preprocess-Function
sambhavnoobcoder Jul 31, 2024
732b966
Update pecan_package_dependencies.csv
sambhavnoobcoder Jul 31, 2024
6389e93
Update pecan_package_dependencies.csv
sambhavnoobcoder Jul 31, 2024
91ef69f
Update modules/assim.sequential/DESCRIPTION
mdietze Aug 1, 2024
1d215b7
Update modules/assim.sequential/DESCRIPTION
mdietze Aug 1, 2024
7b9ec87
Update docker/depends/pecan_package_dependencies.csv
mdietze Aug 1, 2024
91d3da6
Merge branch 'develop' into Preprocess-Function
mdietze Aug 1, 2024
b32e5e3
Merge branch 'develop' into Preprocess-Function
mdietze Aug 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
338 changes: 296 additions & 42 deletions modules/assim.sequential/R/downscale_function.R
Original file line number Diff line number Diff line change
@@ -1,33 +1,90 @@
##' @title Preprocess Data for Downscaling
##' @name NA_preprocess
mdietze marked this conversation as resolved.
Show resolved Hide resolved
##' @author Sambhav Dixit
##'
##' @param data_path Character. File path for .rds containing ensemble data.
mdietze marked this conversation as resolved.
Show resolved Hide resolved
##' @param coords_path Character. File path for .csv file containing the site coordinates, with columns named "lon" and "lat".
##' @param date Character. If SDA site run, format is yyyy/mm/dd; if NEON, yyyy-mm-dd. Restricted to years within the file supplied to 'data_path'.
mdietze marked this conversation as resolved.
Show resolved Hide resolved
##' @param C_pool Character. Carbon pool of interest. Name must match the carbon pool name found within the file supplied to 'data_path'.
mdietze marked this conversation as resolved.
Show resolved Hide resolved
##' @details This function ensures that the specified date and carbon pool are present in the input data. It also checks the validity of the site coordinates and aligns the number of rows between site coordinates and carbon data.
##'
##' @description This function reads and checks the input data, ensuring that the required date and carbon pool exist, and that the site coordinates are valid.
##'
##' @return A list containing The read .rds data , The cleaned site coordinates ,The extracted and possibly truncated carbon data.
mdietze marked this conversation as resolved.
Show resolved Hide resolved

# Preprocess function to check and clean the data
mdietze marked this conversation as resolved.
Show resolved Hide resolved
NA_preprocess <- function(data_path, coords_path, date, C_pool) {
# Read the input data and site coordinates
input_data <- readRDS(data_path)
site_coordinates <- read_csv(coords_path)

# Ensure the date exists in the input data
if (!date %in% names(input_data)) {
stop(paste("Date", date, "not found in the input data."))
}

# Extract the carbon data for the specified focus year
index <- which(names(input_data) == date)
data <- input_data[[index]]

# Ensure the carbon pool exists in the input data
if (!C_pool %in% names(data)) {
stop(paste("Carbon pool", C_pool, "not found in the input data."))
}

carbon_data <- as.data.frame(t(data[which(names(data) == C_pool)]))
names(carbon_data) <- paste0("ensemble", seq(ncol(carbon_data)))

# Ensure site coordinates have 'lon' and 'lat' columns
if (!all(c("lon", "lat") %in% names(site_coordinates))) {
stop("Site coordinates must contain 'lon' and 'lat' columns.")
}

# Ensure the number of rows in site coordinates matches the number of rows in carbon data
if (nrow(site_coordinates) != nrow(carbon_data)) {
message("Number of rows in site coordinates does not match the number of rows in carbon data.")
if (nrow(site_coordinates) > nrow(carbon_data)) {
message("Truncating site coordinates to match carbon data rows.")
site_coordinates <- site_coordinates[1:nrow(carbon_data), ]
} else {
message("Truncating carbon data to match site coordinates rows.")
carbon_data <- carbon_data[1:nrow(site_coordinates), ]
}
}

message("Preprocessing completed successfully.")
return(list(input_data = input_data, site_coordinates = site_coordinates, carbon_data = carbon_data))
}

##' @title North America Downscale Function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 35f0b3e commit

##' @name NA_downscale
##' @author Joshua Ploshay
##' @author Joshua Ploshay , Sambhav Dixit
##'
##' @param data In quotes, file path for .rds containing ensemble data.
##' @param coords In quotes, file path for .csv file containing the site coordinates, columns named "lon" and "lat".
##' @param preprocessed , In quotes, prepocessed data returned as an output for passing the raw data to the NA_preprocess function.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what "In quotes" means here. Also, update for change in preprocess function name

##' @param date In quotes, if SDA site run, format is yyyy/mm/dd, if NEON, yyyy-mm-dd. Restricted to years within file supplied to 'data'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update to be a Date

##' @param C_pool In quotes, carbon pool of interest. Name must match carbon pool name found within file supplied to 'data'.
##' @param covariates SpatRaster stack, used as predictors in randomForest. Layers within stack should be named. Recommended that this stack be generated using 'covariates' instructions in assim.sequential/inst folder
##' @param covariates_path SpatRaster stack, used as predictors in randomForest. Layers within stack should be named. Recommended that this stack be generated using 'covariates' instructions in assim.sequential/inst folder
##' @details This function will downscale forecast data to unmodeled locations using covariates and site locations
##'
##' @description This function uses the randomForest model.
##' @description This function uses the Convolutional Neural Network(CNN) model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per verbal discussion, it would be nice to support both models, not replace one with the other. Could we add an argument that lets you specify which one you're running?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed issue in commit ce4a597

##'
##' @return It returns the `downscale_output` list containing lists for the training and testing data sets, models, and predicted maps for each ensemble member.


NA_downscale <- function(data, coords, date, C_pool, covariates){
NA_downscale <- function(preprocessed, date, C_pool, covariates_path) {
mdietze marked this conversation as resolved.
Show resolved Hide resolved

# Read the input data and site coordinates
input_data <- readRDS(data)
site_coordinates <- terra::vect(readr::read_csv(coords), geom=c("lon", "lat"), crs="EPSG:4326")
input_data <- preprocessed$input_data
sambhavnoobcoder marked this conversation as resolved.
Show resolved Hide resolved
site_coordinates <- preprocessed$site_coordinates
sambhavnoobcoder marked this conversation as resolved.
Show resolved Hide resolved
carbon_data <- preprocessed$carbon_data
mdietze marked this conversation as resolved.
Show resolved Hide resolved

# Extract the carbon data for the specified focus year
index <- which(names(input_data) == date)
data <- input_data[[index]]
carbon_data <- as.data.frame(t(data[which(names(data) == C_pool)]))
names(carbon_data) <- paste0("ensemble",seq(1:ncol(carbon_data)))
# Convert site coordinates to SpatVector
site_coordinates <- terra::vect(site_coordinates, geom = c("lon", "lat"), crs = "EPSG:4326")
sambhavnoobcoder marked this conversation as resolved.
Show resolved Hide resolved

# Load the covariates raster stack
covariates <- rast(covariates_path)
mdietze marked this conversation as resolved.
Show resolved Hide resolved

# Extract predictors from covariates raster using site coordinates
predictors <- as.data.frame(terra::extract(covariates, site_coordinates,ID = FALSE))
predictors <- as.data.frame(terra::extract(covariates, site_coordinates, ID = FALSE))

# Combine each ensemble member with all predictors
ensembles <- list()
Expand All @@ -37,52 +94,249 @@ NA_downscale <- function(data, coords, date, C_pool, covariates){

# Rename the carbon_data column for each ensemble member
for (i in 1:length(ensembles)) {
ensembles[[i]] <- dplyr::rename(ensembles[[i]], "carbon_data" = "carbon_data[[i]]")
colnames(ensembles[[i]])[1] <- "carbon_data"
mdietze marked this conversation as resolved.
Show resolved Hide resolved
}

# Split the observations in each data frame into two data frames based on the proportion of 3/4
ensembles <- lapply(ensembles, function(df) {
sample <- sample(1:nrow(df), size = round(0.75*nrow(df)))
train <- df[sample, ]
test <- df[-sample, ]
split_list <- list(train, test)
sample <- sample(1:nrow(df), size = round(0.75 * nrow(df)))
train <- df[sample, ]
test <- df[-sample, ]
split_list <- list(training = train, testing = test)
return(split_list)
})

# Rename the training and testing data frames for each ensemble member
# Train a CNN model for each ensemble member using the training data
cnn_output <- list()
for (i in 1:length(ensembles)) {
# names(ensembles) <- paste0("ensemble",seq(1:length(ensembles)))
names(ensembles[[i]]) <- c("training", "testing")
# Prepare data for CNN
x_train <- as.matrix(ensembles[[i]]$training[, c("tavg", "prec", "srad", "vapr")])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than hard coding the covariates in the model, can you detect them from the covariates object?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since the x data is identical for all ensemble members, is there really a need to make replicate copies for each ensemble member?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed the detection of covariates automatically from the covariates object in commit 6acfd74 . the commit also addresses the issue of replicate copies for each ensemble member by creating a full_data frame for all the data .

y_train <- as.matrix(ensembles[[i]]$training$carbon_data)
x_test <- as.matrix(ensembles[[i]]$testing[, c("tavg", "prec", "srad", "vapr")])
y_test <- as.matrix(ensembles[[i]]$testing$carbon_data)

# Normalize the data
x_train <- scale(x_train)
x_test <- scale(x_test)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

train, test, and prediction data can't be rescaled separately otherwise they end up having different meanings. Whenever you standardize data you have to use the same constants (mean and sd) everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed rescaling in ecd5aa1 commit .


# Reshape data for CNN input (samples, timesteps, features)
x_train <- array_reshape(x_train, c(nrow(x_train), 1, ncol(x_train)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-base functions need an explicit namespace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in commit f2bab83

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow that commit isn't part of this code, as namespaces are still an issue

x_test <- array_reshape(x_test, c(nrow(x_test), 1, ncol(x_test)))

# Define the CNN model
model <- keras_model_sequential() %>%
mdietze marked this conversation as resolved.
Show resolved Hide resolved
layer_conv_1d(filters = 64, kernel_size = 1, activation = 'relu', input_shape = c(1, 4)) %>%
layer_flatten() %>%
layer_dense(units = 64, activation = 'relu') %>%
layer_dense(units = 1)

# Compile the model
model %>% compile(
loss = 'mean_squared_error',
optimizer = optimizer_adam(),
metrics = c('mean_absolute_error')
)

# Train the model
model %>% fit(
x = x_train,
y = y_train,
epochs = 100,
batch_size = 32,
validation_split = 0.2,
verbose = 0
)

cnn_output[[i]] <- model
}

# Train a random forest model for each ensemble member using the training data
rf_output <- list()
for (i in 1:length(ensembles)) {
rf_output[[i]] <- randomForest::randomForest(ensembles[[i]][[1]][["carbon_data"]] ~ land_cover+tavg+prec+srad+vapr+nitrogen+phh2o+soc+sand,
data = ensembles[[i]][[1]],
ntree = 1000,
na.action = stats::na.omit,
keep.forest = T,
importance = T)
# Wrapper function to apply the trained model
predict_with_model <- function(model, data) {
data <- as.matrix(data[, c("tavg", "prec", "srad", "vapr")])
data <- scale(data)
mdietze marked this conversation as resolved.
Show resolved Hide resolved
data <- array_reshape(data, c(nrow(data), 1, ncol(data)))
predictions <- predict(model, data)
return(predictions)
}

# Generate predictions (maps) for each ensemble member using the trained models
maps <- list(ncol(rf_output))
for (i in 1:length(rf_output)) {
maps[[i]] <- terra::predict(object = covariates,
mdietze marked this conversation as resolved.
Show resolved Hide resolved
model = rf_output[[i]],na.rm = T)
maps <- list()
predictions <- list()
for (i in 1:length(cnn_output)) {
# Prepare data for prediction
x_pred <- as.matrix(predictors[, c("tavg", "prec", "srad", "vapr")])
x_pred <- scale(x_pred)
mdietze marked this conversation as resolved.
Show resolved Hide resolved
x_pred <- array_reshape(x_pred, c(nrow(x_pred), 1, ncol(x_pred)))

map_pred <- predict_with_model(cnn_output[[i]], as.data.frame(covariates[]))
map_pred <- rast(matrix(map_pred, nrow = nrow(covariates), ncol = ncol(covariates)), ext = ext(covariates), crs = crs(covariates))
maps[[i]] <- map_pred

# Generate predictions for testing data
predictions[[i]] <- predict_with_model(cnn_output[[i]], ensembles[[i]]$testing)
}

# Calculate performance metrics for each ensemble member
mdietze marked this conversation as resolved.
Show resolved Hide resolved
metrics <- list()
for (i in 1:length(predictions)) {
actual <- ensembles[[i]]$testing$carbon_data
predicted <- predictions[[i]]
mse <- mean((actual - predicted)^2)
mae <- mean(abs(actual - predicted))
r_squared <- 1 - sum((actual - predicted)^2) / sum((actual - mean(actual))^2)
metrics[[i]] <- list(MSE = mse, MAE = mae, R_squared = r_squared, actual = actual, predicted = predicted)
}

# Organize the results into a single output list
downscale_output <- list(ensembles, rf_output, maps)
downscale_output <- list(data = ensembles, models = cnn_output, maps = maps, metrics = metrics)

# Rename each element of the output list with appropriate ensemble numbers
for (i in 1:length(downscale_output)) {
names(downscale_output[[i]]) <- paste0("ensemble",seq(1:length(downscale_output[[i]])))
for (i in 1:length(downscale_output$data)) {
names(downscale_output$data)[i] <- paste0("ensemble", i)
names(downscale_output$models)[i] <- paste0("ensemble", i)
names(downscale_output$maps)[i] <- paste0("ensemble", i)
names(downscale_output$metrics)[i] <- paste0("ensemble", i)
}

# Rename the main components of the output list
names(downscale_output) <- c("data", "models", "maps")

return(downscale_output)
}

# Define file paths for the data
mdietze marked this conversation as resolved.
Show resolved Hide resolved
data_path <- " " # Replace with the actual path to your data file
coords_path <- " " # Replace with the actual path to your coordinates file
covariates_path <- " " # Replace with the actual path to your .tiff file

# Define parameters
date <- " " # Replace with the actual date you want to use
C_pool <- " " # Replace with the actual carbon pool name you want to use

# Preprocess the data
preprocessed_data <- preprocess(data_path, coords_path, date, C_pool)

# Run the NA_downscale function
result <- NA_downscale(preprocessed_data, date, C_pool, covariates_path)

# Print the result
print(result)

# Print the accuracy metrics
print("Accuracy Metrics for Each Ensemble:")
for (i in seq_along(result$metrics)) {
cat(paste0("Ensemble ", i, ":"))
print(result$metrics[[i]])
}

# Function to create a Taylor diagram
TaylorDiagram <- function(taylor_data, ...) {
# Convert data to the format required by the plot
taylor_data$color <- as.factor(taylor_data$Ensemble)

plot <- ggplot(taylor_data, aes(x = Modelled, y = Observed, color = color)) +
geom_point(size = 3) +
scale_color_manual(values = rainbow(length(unique(taylor_data$color)))) +
coord_fixed() +
theme_minimal() +
labs(title = "Taylor Diagram for Ensemble Members",
x = "Standard Deviation (normalised)",
y = "Standard Deviation (normalised)",
color = "Model") +
theme(legend.position = "right")

return(plot)
}

# Prepare data for Taylor diagram
taylor_data <- data.frame()

for (i in seq_along(result$metrics)) {
ensemble_name <- names(result$metrics)[i]
actual <- result$metrics[[i]]$actual
predicted <- result$metrics[[i]]$predicted

# Calculate required statistics
obs_sd <- sd(actual)
mod_sd <- sd(predicted)
correlation <- cor(actual, predicted)

# Normalize standard deviations
norm_obs_sd <- obs_sd / max(obs_sd, mod_sd)
norm_mod_sd <- mod_sd / max(obs_sd, mod_sd)

# Add to the data frame
taylor_data <- rbind(taylor_data, data.frame(
Ensemble = ensemble_name,
Observed = norm_obs_sd,
Modelled = norm_mod_sd,
r = correlation
))
}

# Create and display Taylor diagram
taylor_plot <- TaylorDiagram(taylor_data)
print(taylor_plot)

# Prepare metrics data for line plot with multiple y-axes
metrics_df <- do.call(rbind, lapply(seq_along(result$metrics), function(i) {
data.frame(Ensemble = paste0("ensemble", i),
MSE = result$metrics[[i]]$MSE,
MAE = result$metrics[[i]]$MAE,
R_squared = result$metrics[[i]]$R_squared)
}))

# Reshape data for ggplot
metrics_melted <- reshape2::melt(metrics_df, id.vars = "Ensemble")

# Create a line plot with multiple y-axes using ggplot2
p1 <- ggplot(metrics_melted[metrics_melted$variable %in% c("MSE", "MAE"), ], aes(x = Ensemble, y = value, color = variable, group = variable)) +
geom_line() +
geom_point() +
scale_y_continuous(name = "MSE and MAE") +
labs(title = "Performance Metrics for Each Ensemble",
x = "Ensemble",
color = "Metrics") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

p2 <- ggplot(metrics_melted[metrics_melted$variable == "R_squared", ], aes(x = Ensemble, y = value, color = variable, group = variable)) +
geom_line() +
geom_point() +
scale_y_continuous(name = "R_squared") +
labs(x = "Ensemble",
color = "Metrics") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Combine the plots
grid.arrange(p1, p2, ncol = 1)

# Scatter plot to compare actual and predicted values for each ensemble with one random instance
set.seed(123) # For reproducibility
sampled_data <- do.call(rbind, lapply(seq_along(result$metrics), function(i) {
ensemble_name <- names(result$metrics)[i]
actual <- result$metrics[[i]]$actual
predicted <- result$metrics[[i]]$predicted

# Sample one random instance
sample_index <- sample(1:length(actual), 1)
actual_sample <- actual[sample_index]
predicted_sample <- predicted[sample_index]

data.frame(Ensemble = ensemble_name, Actual = actual_sample, Predicted = predicted_sample)
}))

# Create scatter plot with lines connecting actual and predicted values
p3 <- ggplot(sampled_data, aes(x = Actual, y = Predicted, color = Ensemble)) +
geom_point(aes(x = Actual, y = Actual, shape = "Actual"), size = 3) + # Circle for actual values
geom_point(aes(x = Predicted, y = Predicted, shape = "Predicted"), size = 3) + # Square for predicted values
geom_segment(aes(x = Actual, y = Actual, xend = Actual, yend = Predicted), linetype = "dotted") + # Connect actual to predicted
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black") +
geom_smooth(method = "lm", linetype = "solid", se = FALSE, color = "blue") + # Regression line
labs(title = "Actual vs. Predicted Scatter Plot for Random Samples",
x = "Actual",
y = "Predicted",
color = "Ensemble",
shape = "Type") +
scale_shape_manual(values = c("Actual" = 16, "Predicted" = 22)) + # Define shapes
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Display scatter plot
print(p3)