Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve data_prep.R file loading approach with config file to specify… #39

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 27 additions & 4 deletions data-raw/prep_data.R
Original file line number Diff line number Diff line change
@@ -1,27 +1,50 @@
# Load required libraries
library(tidyverse)
library(here)
library(yaml)

# Define constants
RAW_DATA_DIR <- here("data-raw")
OUTPUT_DIR <- here("data")

config_path <- file.path(RAW_DATA_DIR, "config.yml")
all_config <- yaml.load_file(config_path)

# Function to process data for a single ecoregion
process_ecoregion_data <- function(ecoregion) {
process_ecoregion_data <- function(ecoregion, config) {
# Read raw data files for the ecoregion
annex_data <- read.csv(file = paste0(RAW_DATA_DIR, "/", ecoregion, "/annex_table.csv"))

config <- all_config[[ecoregion]]

if (is.null(config)) {
stop(paste("Configuration for ecoregion", ecoregion, "not found"))
}

# Read raw data files for the ecoregion based on config
data_list <- lapply(names(config$files), function(file_key) {
file_path <- file.path(RAW_DATA_DIR, ecoregion, config$files[[file_key]])
if (!file.exists(file_path)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Enhance error handling and logging for missing data files

Consider adding more detailed logging, such as which ecoregion and file type is missing. This would help in debugging and data completeness verification.

if (!file.exists(file_path)) {
  log_message <- sprintf("File not found for ecoregion '%s', file type '%s': %s", 
                         ecoregion, file_key, file_path)
  warning(log_message)
  logging::logwarn(log_message)

warning(paste("File not found:", file_path))
return(NULL)
}
read.csv(file_path)
})
names(data_list) <- names(config$files)

# Remove any NULL entries (files that weren't found)
data_list <- data_list[!sapply(data_list, is.null)]

# Process the data

# Return a list of topic data frames
list(stock_annex_table = annex_data)
return(data_list)
}

# Get list of ecoregions (assuming each subdirectory in RAW_DATA_DIR is an ecoregion)
ecoregions <- list.dirs(RAW_DATA_DIR, full.names = FALSE, recursive = FALSE)

# Process data for each ecoregion
all_data <- map(ecoregions, process_ecoregion_data)
all_data <- map(ecoregions, process_ecoregion_data, all_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Update function call to match new process_ecoregion_data signature

The process_ecoregion_data function now expects two arguments, but it's being called with only one. This will cause an error. Update the map call to pass both ecoregions and all_config.

names(all_data) <- ecoregions

# Save processed data
Expand Down