Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#GSOC PR : Add Preprocess Function for Data Cleaning and Validation #3321

Merged
merged 99 commits into from
Aug 2, 2024

Conversation

sambhavnoobcoder
Copy link
Contributor

Description:

This PR introduces a new preprocess function designed to streamline the data cleaning and validation process. This function reads input data and site coordinates, validates the presence of specified date and carbon pool, and ensures the consistency of data dimensions. It outputs a structured list containing the cleaned data, ready for further analysis. Below is an extended description of the new function and its components.

Motivation and Context

Function: preprocess

Purpose:

The preprocess function is created to read and validate input data and site coordinates, ensuring that the data is correctly formatted and consistent for further processing. It handles potential inconsistencies in the data, providing informative messages and adjustments where necessary.

Parameters:

data_path: Path to the RDS file containing the input data. coords_path: Path to the CSV file containing site coordinates. date: The specific date for which the carbon data is to be extracted. C_pool: The specific carbon pool within the input data to focus on. Process:

Reading Data:

Reads the input data from the provided RDS file.
Reads the site coordinates from the provided CSV file.

Validation:

Checks if the specified date exists in the input data. If not, the function stops and returns an error message. Extracts the carbon data for the specified date and validates the existence of the specified carbon pool. If the carbon pool is not found, the function stops and returns an error message.

Data Transformation:

Transposes the extracted carbon data to a data frame format, ensuring each column represents an ensemble. Renames the columns to a consistent naming convention (e.g., "ensemble1", "ensemble2", etc.). Coordinate Validation:

Ensures that the site coordinates data contains 'lon' and 'lat' columns. If these columns are missing, the function stops and returns an error message.

Data Consistency:

Validates that the number of rows in the site coordinates matches the number of rows in the carbon data. If there is a mismatch in the number of rows, the function truncates either the site coordinates or the carbon data to match the row counts, ensuring consistency.

Output:

The function returns a list containing:

input_data: The original input data read from the RDS file. site_coordinates: The validated and possibly truncated site coordinates. carbon_data: The validated and possibly truncated carbon data. Messages:
The function provides informative messages during the preprocessing steps, alerting the user to any adjustments made to the data to ensure consistency.

Example Usage:

preprocessed_data <- preprocess("path/to/input_data.rds", "path/to/site_coords.csv", "2022-01-01", "TotalCarbon")

Benefits:

Efficiency: Streamlines the data preparation process, reducing manual validation and transformation steps.
Error Handling: Provides clear error messages and handles common data issues, improving robustness.
Consistency: Ensures consistent data formats and dimensions, facilitating further analysis and modeling.

Review Time Estimate

  • Immediately
  • Within one week
  • When possible

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • My name is in the list of CITATION.cff
  • I have updated the CHANGELOG.md.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Description:

This PR introduces a new preprocess function designed to streamline the data cleaning and validation process. This function reads input data and site coordinates, validates the presence of specified date and carbon pool, and ensures the consistency of data dimensions. It outputs a structured list containing the cleaned data, ready for further analysis. Below is an extended description of the new function and its components.

Function: preprocess

Purpose:
The preprocess function is created to read and validate input data and site coordinates, ensuring that the data is correctly formatted and consistent for further processing. It handles potential inconsistencies in the data, providing informative messages and adjustments where necessary.

Parameters:
data_path: Path to the RDS file containing the input data.
coords_path: Path to the CSV file containing site coordinates.
date: The specific date for which the carbon data is to be extracted.
C_pool: The specific carbon pool within the input data to focus on.
Process:
Reading Data:

Reads the input data from the provided RDS file.
Reads the site coordinates from the provided CSV file.
Validation:

Checks if the specified date exists in the input data. If not, the function stops and returns an error message.
Extracts the carbon data for the specified date and validates the existence of the specified carbon pool. If the carbon pool is not found, the function stops and returns an error message.
Data Transformation:

Transposes the extracted carbon data to a data frame format, ensuring each column represents an ensemble.
Renames the columns to a consistent naming convention (e.g., "ensemble1", "ensemble2", etc.).
Coordinate Validation:

Ensures that the site coordinates data contains 'lon' and 'lat' columns. If these columns are missing, the function stops and returns an error message.
Data Consistency:

Validates that the number of rows in the site coordinates matches the number of rows in the carbon data.
If there is a mismatch in the number of rows, the function truncates either the site coordinates or the carbon data to match the row counts, ensuring consistency.
Output:
The function returns a list containing:

input_data: The original input data read from the RDS file.
site_coordinates: The validated and possibly truncated site coordinates.
carbon_data: The validated and possibly truncated carbon data.
Messages:
The function provides informative messages during the preprocessing steps, alerting the user to any adjustments made to the data to ensure consistency.

Example Usage:

```preprocessed_data <- preprocess("path/to/input_data.rds", "path/to/site_coords.csv", "2022-01-01", "TotalCarbon")```
Benefits:
Efficiency: Streamlines the data preparation process, reducing manual validation and transformation steps.
Error Handling: Provides clear error messages and handles common data issues, improving robustness.
Consistency: Ensures consistent data formats and dimensions, facilitating further analysis and modeling.
Description

Renaming to NA_preprocess:

The function has been renamed from preprocess to NA_preprocess to better reflect its specific role in the North America downscaling process. This change ensures that the function name is consistent with the existing NA_downscale function, promoting a more organized and intuitive codebase.
Addition of Roxygen Documentation:

Comprehensive Roxygen documentation has been added to the NA_preprocess function. This documentation includes the following sections:

Title and Name:

Provides a clear title and the function's name for easy identification.
Description:

Offers a brief overview of the function's purpose and functionality.
Parameters:

Details each parameter, including data_path, coords_path, date, and C_pool, describing their expected input types and roles within the function.
Details:

Explains the specific tasks performed by the function, such as reading input data, validating the date and carbon pool, and ensuring the consistency of site coordinates.
Return Value:

Describes the structure and contents of the list returned by the function, which includes the input data, cleaned site coordinates, and extracted carbon data.
added the name of the author for the NA_preprocess function and changed the structure of the roxygen for the NA_preprocess function slightly so as to keep it consistent with the structure throughout the repository and existing code
@mdietze
Copy link
Member

mdietze commented Jul 8, 2024

requires a make document and commit of the man file

existing code featured a random forest model which was used as the predictive mechanism  for the NA_Downscale function . this is now replaced by a basic CNN in the downscale function . further commits will aim to optimise the performance of this model , while also assisting with measurement of its performance and its visualisation .
This commit initialises paths to all the necessary files to all the variables needed in the  code by both the functions , then systematically passes it to first the NA_preprocess function and the outputs of it to the NA_Downscale function . finally ends up printing the results .
This commit adds code that prints out the accuracy metrics for each ensemble in the data passed to the model for prediction .
This commit prepares the metrics data from our ensemble model results for visualization in a line plot with multiple y-axes. The main changes are:

1. Data Transformation:
   - Convert the list of metrics for each ensemble into a single dataframe
   - Each row represents one ensemble's metrics
   - Columns include: Ensemble identifier, MSE, MAE, and R-squared

2. Data Reshaping:
   - Melt the dataframe to long format for easier plotting with ggplot2
   - This creates a tidy dataset with columns: Ensemble, variable, value

3. Key Steps:
   - Use lapply() with do.call(rbind, ...) to efficiently combine metrics
   - Create ensemble identifiers (ensemble1, ensemble2, etc.)
   - Use reshape2::melt() for data reshaping

4. Preparation for Visualization:
   - The resulting 'metrics_melted' dataframe is now ready for use with ggplot2
   - This format allows for easy creation of a multi-line plot with separate y-axes for different metrics
This commit implements a line plot using ggplot2 to visualize the performance metrics (MSE and MAE) across different ensembles. Key aspects of this visualisation include:

1. Data Selection:
   - Filters the melted metrics data to include only MSE and MAE
   - Excludes R-squared for this particular plot (likely due to scale differences)

2. Plot Structure:
   - Uses ggplot2 to create a line plot with points
   - X-axis represents different ensembles
   - Y-axis shows the values for MSE and MAE
   - Colors differentiate between MSE and MAE lines

3. Aesthetic Choices:
   - Implements both lines and points for clear trend visualization and specific value identification
   - Colors are automatically assigned to differentiate metrics
   - X-axis labels are rotated 45 degrees for better readability, especially with many ensembles

4. Labeling:
   - Title clearly states the purpose of the plot
   - X-axis labeled as "Ensemble"
   - Y-axis labeled as "MSE and MAE" to indicate the metrics shown
   - Legend title set to "Metrics" for clarity

5. Scale:
   - Uses a continuous scale for the y-axis, appropriate for MSE and MAE values
   - Single y-axis used for both metrics, assuming their scales are comparable
This commit extends our visualisation by creating a separate plot for R-squared values and combining it with the previously created MSE/MAE plot. Key aspects of this update include:

1. R-squared Plot Creation:
   - Filters the melted metrics data to include only R-squared values
   - Creates a new ggplot object (p2) specifically for R-squared
   - Maintains consistent styling with the MSE/MAE plot for visual coherence

2. Plot Structure:
   - Uses the same x-axis (Ensemble) as the MSE/MAE plot
   - Y-axis now represents R-squared values
   - Implements both line and point geometries for clear trend visualization

3. Aesthetic Consistency:
   - Maintains 45-degree rotation for x-axis labels
   - Uses consistent color scheme (though only one metric is present)
   - Keeps the same theme as the MSE/MAE plot

4. Labeling:
   - X-axis labeled as "Ensemble" (consistent with p1)
   - Y-axis explicitly labeled as "R_squared"
   - Legend title set to "Metrics" for consistency

5. Plot Combination:
   - Uses grid.arrange() from the gridExtra package to combine p1 and p2
   - Arranges plots vertically (ncol = 1) for easy metric comparison across ensembles

6. Advantages of This Approach:
   - Separates R-squared into its own plot, addressing potential scale differences with MSE/MAE
   - Allows for easy visual comparison of all three metrics (MSE, MAE, R-squared) across ensembles
   - Maintains readability by not overcrowding a single plot with three potentially differently-scaled metrics
…dels

This commit introduces a scatter plot to visualize the performance of different ensemble models by comparing actual and predicted values for randomly sampled instances. Key aspects of this update include:

1. Data Preparation:
   - Uses set.seed(123) for reproducibility of random sampling
   - Samples one random instance from each ensemble's predictions
   - Creates a dataframe (sampled_data) with columns for Ensemble, Actual, and Predicted values

2. Scatter Plot Creation:
   - Utilizes ggplot2 to create a comprehensive scatter plot (p3)
   - X-axis represents Actual values, Y-axis represents Predicted values
   - Each ensemble is represented by a unique color

3. Plot Elements:
   - Points: 
     * Circular points for actual values
     * Square points for predicted values
   - Lines:
     * Dotted vertical lines connecting actual to predicted values
     * Dashed diagonal line (y=x) representing perfect prediction
     * Solid blue line showing the overall regression trend
   - Shapes are manually defined for clear differentiation between actual and predicted values

4. Aesthetics and Labeling:
   - Title clearly describes the plot's purpose
   - Axes labeled as "Actual" and "Predicted"
   - Legend includes both Ensemble (color) and Type (shape)
   - Uses a minimal theme for clean presentation
   - X-axis labels rotated 45 degrees for better readability

5. Statistical Elements:
   - Includes a regression line (geom_smooth) to show overall trend
   - Omits confidence interval (se = FALSE) for clarity

6. Visualization Insights:
   - Allows for quick assessment of each ensemble's prediction accuracy
   - Facilitates easy comparison of prediction errors across ensembles
   - The y=x line helps in identifying over- or under-predictions

7. Usage:
   - The plot is immediately displayed using print(p3)

8. Potential Future Enhancements:
  - Revise color scheme to reflect data types rather than ensembles
   - Implement a binary color system: e.g., blue for actual data, red for predicted data
   - This change aligns with the conceptual role of ensembles as data groupings rather than distinct entities
   - Binary colouring would emphasise the comparison between actual and predicted values across all ensembles
This commit introduces a Taylor Diagram to provide a concise visual summary of how well ensemble models match observations. Key aspects of this update include:

1. Taylor Diagram Function:
   - Creates a custom function 'TaylorDiagram' using ggplot2
   - Visualizes three statistical parameters: standard deviation, correlation coefficient, and centered root-mean-square difference
   - Uses a color-coded system to differentiate between ensemble models

2. Data Preparation:
   - Iterates through each ensemble's metrics
   - Calculates required statistics: observed and modelled standard deviations, and correlation
   - Normalizes standard deviations for consistent scaling across ensembles
   - Compiles data into a single dataframe (taylor_data)

3. Plot Structure:
   - X-axis represents normalized modelled standard deviation
   - Y-axis represents normalized observed standard deviation
   - Color distinguishes different ensemble models
   - Uses fixed coordinate system for proper circular representation

4. Aesthetic Choices:
   - Employs rainbow color palette for easy distinction between ensembles
   - Uses minimal theme for clean presentation
   - Places legend on the right for clear model identification

5. Labeling:
   - Title clearly states "Taylor Diagram for Ensemble Members"
   - Axes labeled as "Standard Deviation (normalised)"
   - Color legend labeled as "Model"

6. Visualization Insights:
   - Allows simultaneous comparison of multiple statistical parameters
   - Facilitates easy identification of models closest to observations
   - Provides a compact way to evaluate model performance across ensembles

7. Usage:
   - The plot is created and immediately displayed using print(taylor_plot)

8. Potential Future Enhancements:
   - Revise the color scheme of the Taylor Diagram to better reflect the nature of ensembles
   - Implement a single color for all data points, recognizing that ensembles are essentially data bins rather than distinct entities
   - This change eliminates the implication of metadata that doesn't exist, as the current multi-color scheme suggests differences between ensembles that may not be meaningful
   - Consider using shape or size variations instead of color to distinguish points if needed
   - Add alternative methods to convey ensemble-specific information, such as labels or interactive tooltips
   - This modification will simplify the visual presentation and focus attention on the statistical relationships rather than arbitrary ensemble distinctions
Copy link
Member

@dlebauer dlebauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just gone through the beginning - some of the comments / questions about design are more directed towards @PecanProject/reviewers starting with @mdietze

  1. ultimately do we want to separate out data source specific functions that generate [pecan or efi] standard netcdf files?
  2. if so, what is within the scope of the initial implementation (GSOC project) and what should wait to be motivated by future applications?

modules/assim.sequential/R/downscale_function.R Outdated Show resolved Hide resolved
modules/assim.sequential/R/downscale_function.R Outdated Show resolved Hide resolved
modules/assim.sequential/R/downscale_function.R Outdated Show resolved Hide resolved
modules/assim.sequential/R/downscale_function.R Outdated Show resolved Hide resolved
modules/assim.sequential/R/downscale_function.R Outdated Show resolved Hide resolved
modules/assim.sequential/R/downscale_function.R Outdated Show resolved Hide resolved
modules/assim.sequential/R/downscale_function.R Outdated Show resolved Hide resolved
modules/assim.sequential/R/downscale_function.R Outdated Show resolved Hide resolved
modules/assim.sequential/R/downscale_function.R Outdated Show resolved Hide resolved
modules/assim.sequential/R/downscale_function.R Outdated Show resolved Hide resolved
modules/assim.sequential/R/downscale_function.R Outdated Show resolved Hide resolved
modules/assim.sequential/R/downscale_function.R Outdated Show resolved Hide resolved
updated the .rd file in accordance to the changes made in the original NA_downscale function ,with regards to the CNN implementation in the function .
created and committed the NA_preprocess.Rd file for the NA_preprocess function , used to preprocess the data and perform checks before passing anything into the NA_downscale function .
…to SDA_downscale

In line with previous discussion , there is nothing specific about this code to NA , instead it is more based on the sda runs , so according to the suggestions made , i've changes the NA_preprocess function to SDA_downscale_preprocess and NA_downscale to SDA_downscale . the preprocess also now aligns with its specificity to being the preprocess for the particular downscale . 

future scope :
this change in the code as well as roxygen will be prompted with a change in the .rd files as well , so that should be kept in mind as well .
as suggested , the code in the assim.sequential /R folder should contain only functions and do not require the runner code for this . thus the runner code has been refactored out and will eventually be added to /inst folder as suggested .
Based on the recent suggestion, it appears more meaningful to replace the phrase "extracted and possibly truncated" with "preprocessed." The original wording is somewhat vague, while the latter term is more precise and appropriate for our context.
Making that change in this commit
Key changes:
- Replaced `%>%` with `|>` in the keras_model_sequential() chain
- Updated the model compilation step to use `|>`
- Modified the model fitting step to use `|>`

The SDA_downscale_preprocess function remains unchanged as it did not use any pipe operators.

This update improves code consistency and reduces dependency on external packages by leveraging R's built-in pipe operator. It also aligns the code with modern R programming practices.

Note: No functional changes were made to the algorithm itself; this is purely a syntactic update to improve code style and maintainability.
This commit improves code clarity and reduces potential naming conflicts
by adding explicit namespaces to all non-base R functions used in the
SDA_downscale_preprocess and SDA_downscale functions.
Key Changes :
- Updated readr::read_csv() for CSV file reading
- Added terra:: namespace to all terra package functions
- Added keras:: namespace to all keras package functions
- Ensures correct function calls from respective packages
- Improves code maintainability and reduces risk of conflicts with other loaded packages
- No functional changes to the code logic or behavior
- Enhances reproducibility by making package dependencies more explicit
Key Changes in this commit are as follows : 

- Replace fixed "carbon_data" column with dynamic naming (paste0(C_pool, "_ens", i))
- Update CNN model training to use specific carbon pool column names
- Modify metrics calculation to use dynamic column names for each ensemble
- Adjust ensemble processing to handle variable carbon pool names

This change allows for more flexible handling of different carbon pools.
Core functionality remains the same, but now supports multiple named carbon pools.
This commit addresses the issue of separate scaling for training and testing
data, which could lead to inconsistent data representations. The changes
allow for scaling all data together before splitting into train and test
sets, fixing the original query about rescaling data separately.

Extended description:
- Implemented a single scaling operation for all predictor data before
  splitting into train and test sets.
- Use the same scaling parameters across all ensembles to ensure consistency.
- Apply scaling to each ensemble using the global scaling parameters.
- Modified the prediction process to use the same scaling parameters for
  new data.
- Simplified the predict_with_model function as data is now pre-scaled.

These changes ensure that all data (training, testing, and prediction) are
scaled using the same parameters, addressing the potential issue of different
scaling across different subsets of the data. This approach maintains data
consistency throughout the model training and prediction pipeline, leading
to more reliable and comparable results across all stages of the process.
This commit enhances the flexibility and robustness of date handling
in the SDA_downscale_preprocess function. It addresses the following
issues and queries:

1. Flexible date format handling:
   - Converts input_data names to a standard YYYY-MM-DD format using
     lubridate::ymd(), allowing for various input date formats.
   - Standardizes the input 'date' parameter to the same format.

2. Preservation of non-date names:
   - Uses ifelse() to only convert valid date names, leaving non-date
     names in input_data unchanged.
   - This addresses the concern: "If input_data has names that aren't
     dates, we may not want to overwrite them".

3. Consistent date comparison:
   - Uses the standardized date format for checking and extracting data,
     ensuring consistency regardless of the original input format.

4. Error handling:
   - Maintains existing error checks for date existence and carbon pool
     presence, now using the standardized date format.

These changes make the function more versatile, allowing it to handle
both "yyyy/mm/dd" and "yyyy-mm-dd" formats (among others) while
preserving the integrity of non-date data in the input.
since the code passed 11 of 13 GitHub actions with old namespace and pecan_package_dependencies , perhaps reverting to them one by one would be helpful in figuring out the error here. starting by reverting to the exiting version of namespace
so based on my past observation I tried changing NAMESPACE to its original format in hopes to debug it such , bit actions are still 11 fail . riveting this back , should in theory atleast , pass 11 actions , since previously old observation , old NAMESPACE gave 11 pass 2 fail . if this fails 
- I wont be clear what the reason of error is 
- indicates some sort of change between initial run and this one , which needs to be reverted back . 
- if no change is observed then also differential change can help detect the solution .
reversion before keras3 update . perhaps if this doesn't work , I'll add it to imports instead of suggested , looking/hoping for improvements and then following up with step by step improvements /changes to NAMESPACES and dependencies as well .
brought back the roxygen version in accordance to pecan support . hoping this gets the masses of tests to pass.
revert to last version of dependencies of the PR on the project
modules/assim.sequential/DESCRIPTION Outdated Show resolved Hide resolved
modules/assim.sequential/DESCRIPTION Outdated Show resolved Hide resolved
sambhavnoobcoder and others added 8 commits July 29, 2024 21:16
added keras3 to the suggests , as descried in discussion , moving back to the base suggestions implemented , one by one
updated with correct date and version number
updated namespace after updated DESCRIPTION
updated the version number of roxygen
NAMESPACES after running the DESCRIPTION changes to roxygen 7.3.1
modules/assim.sequential/DESCRIPTION Outdated Show resolved Hide resolved
modules/assim.sequential/DESCRIPTION Outdated Show resolved Hide resolved
@mdietze mdietze merged commit 43a975b into PecanProject:develop Aug 2, 2024
12 of 13 checks passed
@sambhavnoobcoder sambhavnoobcoder changed the title Add Preprocess Function for Data Cleaning and Validation #GSOC PR : Add Preprocess Function for Data Cleaning and Validation Aug 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants