-
Notifications
You must be signed in to change notification settings - Fork 14
Edits to spatial train/val/test, additional performance metrics #211
Conversation
jds485
commented
Mar 10, 2023
•
edited
Loading
edited
- I found that validation segments were appearing in the test set, and training segments were appearing in validation. I edited some if/else statements to ensure that only test segments are in testing, and only validation are in validation. I'm not sure if the previous implementation was as-intended or a bug.
- I added bi-weekly, monthly, and yearly performance metrics, and a by-year summary of daily data. Also added log-KGE, and recommending that 21 be the minimum number of observations needed to compute a performance metric. For the 10th and 90th percentile metrics, this would give 2 samples each.
- I added args to preprocessing to filter the x_data by an earliest and/or latest date. With these arguments, the xarray of data could contain NAs/NaNs before or after these dates, respectively. I think this is useful when there are lagged attributes that would have NAs in the lag before the 1st non-NA point.
…dation sites are both specified and are the same
…nd a by-year summary for daily data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments throughout this - a couple for clarification, one where I think your suggested updates would change the functionality of the current code, and at least 1 or 2 places I suggested getting broader input on how others are using the code. Let me know if want to discuss any of these further.
river_dl/evaluate.py
Outdated
data = data[~data[spatial_idx_name].isin(train_sites)] | ||
if val_sites and partition=='tst': | ||
data = data[~data[spatial_idx_name].isin(val_sites)] | ||
if train_sites and partition == 'trn': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that there is an omission in the original code and that the following code should be added to correct that:
if train_sites and partition=='val':
data = data[~data[spatial_idx_name].isin(train_sites)]
If I'm understanding them correctly, I think the suggested edits change the functionality of this section. As I read the original code, it appears that train_sites (as well as test_sites and val_sites) were sites that only appeared in that partition, but they weren't necessarily the only sites in that partition. In the revised code, it appears that if train_sites is specified, it will use only those sites in the training evaluations (and remove those sites from the test and validation partition).
If the intention is to change the functionality of train_sites, etc, then it's probably good to have a broader conversation. I am not using that option in my projects currently, but I don't know if others are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I read the original code, it appears that train_sites (as well as test_sites and val_sites) were sites that only appeared in that partition, but they weren't necessarily the only sites in that partition.
Yes, I agree. I wasn't sure if this was intentional. I could add another explicit_spatial_split
parameter here to allow for the previous functionality when False and this new functionality when True. I'll hold off on that before receiving feedback from others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow. Yeah. This was definitely a bug! 😮 Thanks for catching this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janetrbarclay: in a comment below, Jeff suggests we assume that sites within the train/val/test are the only sites in those partitions. That's also what I would expect. Do you know of anyone who is relying on the previous method wherein sites that are not within train_sites/val_sites/test_sites could be in the train/val/test partitions?
river_dl/evaluate.py
Outdated
.apply(calc_metrics) | ||
.reset_index() | ||
) | ||
elif group == "biweekly": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the addition of biweekly and yearly options for the metrics is great.
If I'm reading this correctly, "biweekly", "monthly", and "yearly" all also use "seg_id_nat". For consistency with the other grouped metrics, it seems good to have that included in the group list. (so group = ['seg_id_nat','biweekly'])
(and as an aside, I'm noticing we should remove the hardcoded reference to seg_id_nat and replace it with spatial_idx_name. I think it's just the 3 references in this section. Would you want to fix that in this PR since you're already editing this section?)
Also, without running (which I haven't done) I'm not sure how monthly and yearly are different from ['seg_id_nat','month'] and ['seg_id_nat','year'] since they are both grouping on the same things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how monthly and yearly are different from ['seg_id_nat','month'] and ['seg_id_nat','year'] since they are both grouping on the same things
The biweekly, monthly and yearly options are resampling the daily timeseries to those time steps by taking the sum of the data within those time periods (only for the days with observations). I'm not sure that sum is the best option and am open to other suggestions.
The resulting performance metrics are computed over all reaches, not by reach as with the ['seg_id_nat','time'] options, so I can add the group option that reports these metrics by reach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with removing the "seg_id_nat" comparison to be more generic, but that will affect snakemake workflows. For example, the workflow examples all have a function that defines group
using seg_id_nat. Might be better to address this problem in a separate issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resulting performance metrics are computed over all reaches, not by reach as with the ['seg_id_nat','time'] options, so I can add the group option that reports these metrics by reach
I'm not super familiar with the pandas grouper (so maybe that's the source of my confusion), but both monthly and yearly use 2 pandas groupers, 1 on time_idx_name and one on spatial_idx_name, right? So are you summing by reach and then calculating metrics across all the reaches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So are you summing by reach and then calculating metrics across all the reaches?
yes, that's right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add the group option that reports these metrics by reach
Edit: I added metrics for reach-biweekly, reach-monthly and reach-yearly timeseries.
We could also have reach-biweekly-month (summarize the biweekly timeseries by month), reach-biweekly-year, and reach-monthly-year. reach-biweekly-time would require an additional function to define a biweekly index for Python datetime objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grr. I intended to click "request changes" before. Sorry for the change!!
Thanks for the review, @janetrbarclay! I think I addressed all of your comments. I think a few will require feedback from others before editing / reverting |
Agreed! Let me know when you want me to glance at those parts again and I'll be happy to approve them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jds485 - thanks for catching these things that have been slipping through the cracks! I have a few comments/suggestions that I think you/we should consider before making this merge.
river_dl/evaluate.py
Outdated
data = data[~data[spatial_idx_name].isin(train_sites)] | ||
if val_sites and partition=='tst': | ||
data = data[~data[spatial_idx_name].isin(val_sites)] | ||
if train_sites and partition == 'trn': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow. Yeah. This was definitely a bug! 😮 Thanks for catching this.
river_dl/evaluate.py
Outdated
@@ -268,6 +335,99 @@ def partition_metrics( | |||
.apply(calc_metrics) | |||
.reset_index() | |||
) | |||
elif group == "year": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting to be quite verbose. I suggest we split the group
argument into two, maybe (group_spatially
, and group_temporally
). The group_spatially
would be just a boolean. The group_temporally
would be a str
.
then the function could be something like:
if not group_spatially and not group_temporally:
metrics = calc_metrics(data)
# need to convert to dataframe and transpose so it looks like the
# others
metrics = pd.DataFrame(metrics).T
elif group_spatially and not group_temporally:
metrics = data.groupby(spatial_idx_name).apply(calc_metrics).reset_index()
elif not group_spatially and group_temporally:
metrics = data.groupby(pd.Grouper(index=time_idx_name, freq=group_temporally)).apply(calc_metrics).reset_index()
elif group_spatially and group_temporally:
metrics = data.groupby([pd.Grouper(index=time_idx_name, freq=group_temporally),
pd.Grouper(index=spatial_idx_name)]
).apply(calc_metrics).reset_index()
I think that should work. We'd just have to document how the group_temporally
argument needs to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely a bigger change, but I think it would be worth trying. It would also require propagating the change all the way up including any Snakefiles that are using this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would also resolve #212
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that is much cleaner. I think one more argument would be needed for how to do temporal aggregation. I think what you've programmed would compute metrics for native timestep only (daily). I used a sum of the daily data to get biweekly, monthly, and yearly timesteps and compute metrics for those. Let me try out this edit because it will make the code much cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed this change in the most recent commit. I needed 4 group args to replicate the previous functionality. I think it's more generic now to using different timesteps of aggregation, but it's more difficult to understand how to apply the 4 args to compute the desired metrics. For reference, here is the function that I used in the snakemake file to assign the 4 new args to this function to compute different metrics for different groups. See the description of the 4 args in the function docstring.
#Order in the list is:
#group_spatially (bool), group_temporally (False or timestep to use), sum_aggregation (bool), site_based (bool)
def get_grp_arg(wildcards):
if wildcards.metric_type == 'overall':
return [False, False, False, False]
elif wildcards.metric_type == 'month':
return [False, 'M', False, False]
elif wildcards.metric_type == 'reach':
return [True, False, False, False]
elif wildcards.metric_type == 'month_reach':
return [True, 'M', False, False]
elif wildcards.metric_type == 'monthly_site_based':
return [False, 'M', True, True]
elif wildcards.metric_type == 'monthly_all_sites':
return [False, 'M', True, False]
elif wildcards.metric_type == 'monthly_reach':
return [True, 'M', True, False]
river_dl/evaluate.py
Outdated
elif group == ["seg_id_nat", "monthly"]: | ||
#filter the data to remove nans before computing the sum | ||
#so that the same days are being summed in the year. | ||
data_calc = (data_multiind.dropna() | ||
.groupby( | ||
[pd.Grouper(level=time_idx_name, freq='M'), | ||
pd.Grouper(level=spatial_idx_name)]) | ||
.sum() | ||
) | ||
data_calc = data_calc.reset_index() | ||
metrics = (data_calc | ||
.groupby(spatial_idx_name) | ||
.apply(calc_metrics) | ||
.reset_index() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why you would need to break it into two steps. Did you try this (below) and it didn't work?
elif group == ["seg_id_nat", "monthly"]: | |
#filter the data to remove nans before computing the sum | |
#so that the same days are being summed in the year. | |
data_calc = (data_multiind.dropna() | |
.groupby( | |
[pd.Grouper(level=time_idx_name, freq='M'), | |
pd.Grouper(level=spatial_idx_name)]) | |
.sum() | |
) | |
data_calc = data_calc.reset_index() | |
metrics = (data_calc | |
.groupby(spatial_idx_name) | |
.apply(calc_metrics) | |
.reset_index() | |
) | |
elif group == ["seg_id_nat", "monthly"]: | |
#filter the data to remove nans before computing the sum | |
#so that the same days are being summed in the year. | |
data_calc = (data_multiind.dropna() | |
.groupby( | |
[pd.Grouper(level=time_idx_name, freq='M'), | |
pd.Grouper(level=spatial_idx_name)]) | |
.apply(calc_metrics) | |
.reset_index() | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No additional .
functions were allowed after .sum()
, so I broke it into several lines.
Also, noting that the original and suggested code compute metrics for different groups. The original code creates a monthly timeseries for each reach and computes reach-based metrics using the monthly timeseries. The suggested code uses data in the native timestep (e.g., daily) to compute metrics for each month-reach. Both of these are options in the latest commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my question was, why do you need to do .sum()
in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used .sum()
to convert the daily timeseries to a biweekly, monthly, and yearly timeseries for each reach. I think this is analogous to the overall
metrics computed for the native timestep, which is also indexed by date-reach. Performance metrics for these timeseries indicate how well the model is able to predict at those timesteps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. A little slow, but I think I got it now. So it's a two-step process because first you are aggregating by the time-space dimensions (e.g., site-biweekly) and then you are aggregating by the space (e.g., site). So you end up with just one number per site, but it's looking at bigger windows of time at a time (e.g., two weeks instead of every day).
Am I getting that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's right, why did you choose sum
instead of mean
? It seems like mean
would be better.
For example, if I do a rmse on the biweekly sums, I get
site_id
01466500 8.862246
01472104 23.094643
01472119 23.529100
014721254 20.738631
014721259 16.497109
01473499 17.143101
01473500 17.544453
01473675 17.654814
01473780 15.881052
01474500 12.225832
01480617 35.625263
01480870 36.393360
01481000 24.799420
01481500 26.861900
which is way high.
If I do it on the means:
site_id
01466500 0.633631
01472104 1.651097
01472119 1.682208
014721254 1.481826
014721259 1.179091
01473499 1.225882
01473500 1.254535
01473675 1.262302
01473780 1.136314
01474500 0.874386
01480617 2.546138
01480870 2.601188
01481000 1.772298
01481500 1.919950
A lot lower.
Is that what you'd expect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's a two-step process because first you are aggregating by the time-space dimensions (e.g., site-biweekly) and then you are aggregating by the space (e.g., site). So you end up with just one number per site, but it's looking at bigger windows of time at a time
Yes, that's right.
why did you choose sum instead of mean?
Good point - the daily values are daily averages, not the total flow in a day. For some reason I was thinking of total flow, salinity, etc. I think the sum metrics will also be inflated compared to the mean. Here's an example for biweekly metrics for specific conductivity. Log RMSE is the same because log(sum_pred/n) - log(sum_obs/n) = log(sum_pred) - log(sum_obs)
metrics_sum
Out[126]:
rmse 10407.172144
nse 0.946307
rmse_top10 28507.780134
rmse_bot10 1128.871582
rmse_logged 0.162555
mean_bias -3456.110397
mean_bias_top10 -11188.212460
mean_bias_bot10 -618.612363
nse_top10 0.793622
nse_bot10 0.803559
nse_logged 0.963944
kge 0.899966
kge_logged 0.977389
kge_top10 0.881643
kge_bot10 0.875451
n_obs 966.000000
dtype: float64
metrics_mean
Out[127]:
rmse 47.075655
nse 0.649618
rmse_top10 121.434744
rmse_bot10 21.972157
rmse_logged 0.162555
mean_bias -21.274822
mean_bias_top10 -88.179638
mean_bias_bot10 -15.051474
nse_top10 -1.496829
nse_bot10 0.258352
nse_logged 0.698174
kge 0.756805
kge_logged 0.887974
kge_top10 0.220643
kge_bot10 0.732286
n_obs 966.000000
dtype: float64
river_dl/preproc_utils.py
Outdated
@@ -596,6 +596,7 @@ def prep_y_data( | |||
test_end_date=None, | |||
val_sites=None, | |||
test_sites=None, | |||
explicit_spatial_partition=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to think that we should just assume that it is explicit. If I specify train/val/test sites, I expect those to be the only sites in the partition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I added this argument to allow for the previous implementation. It sounds like the previous implementation was not intentional, so I'm happy to remove the explicit_spatial_partition
arg here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I just realized something when I was working on the DO project that may change this conversation.
What if you have sites that are in the training set for the training time period, but have data in validation time period.
As the code is currently written, the data at training sites but in the validation time period will be included in the validation set.
With these suggested changes, even if a training site has data in the validation time period, that data will not be used in the evaluation.
I think that's why I wrote the code as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that makes sense. I think that's similar to @janetrbarclay's point here. Is this also why evaluate.py
did not remove training sites from validation? And would you also want validation sites to be included within the test set for the same reason? Those sites are currently not removed from the test set.
Here are some edits I can make:
- keep the
explicit_spatial_partition
option in this function (prep_y_data
) to allow for the functionality you're describing. - Add an
explicit_spatial_partition
option to thepartition_metrics
function inevaluate.py
. - When
explicit_spatial_partition = FALSE
, (the current functionality) there should be a check in the code that there is a temporal split specified for train/val and val/test (whichever is applicable) so that data are not used in multiple splits. - I think the
prep_y_data
function andpartition_metrics
function should have the same train_sites, val_sites, and test_sites args, and functionality for those args. They are currently different.
Let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jumping back into this after not thinking about it for awhile, I'm wanting to make sure I have a clear understanding in my head of the options.
It seems we can have
trn_sites
(andval_sites
andtst_sites
) as the only sites for the respective partitions (ie remove all other sites from those partitions, but those sites can appear in other partitions), ortrn_sites
(andval_sites
andtst_sites
) can appear only in those partitions (ie remove those sites from all other partitions, but other sites can appear in those partitions), ortrn_sites
(andval_sites
andtst_sites
) are the only sites for those partitions and they only appear there (ie remove those sites from other partitions and remove other sites from those partitions)
Right?
I'm not quite sure which option explicit_spatial_partition
==TRUE or FALSE maps to.
Also, would we ever want different behavior among the partitions? (ie trn_sites
can appear in the test partition, but tst_sites
can't appear in the training partition) It would get messy and complicated and likely isn't worth adding unless there's a clear need.
I like the idea of a check to make sure that data are not used in multiple partitions. Maybe that's an addition that is irrespective of the partition approach that's used? (just a final step in prepping the data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
option 1 would be the same as option 3 if all 3 partitions were specified (right?).,
yes, that's right (though I wasn't thinking of that yesterday). If both of these options remain, maybe a note goes in the comments pointing this out?
I'm trying to think about how I would use this. My inclination is that if I'm specifying training sites, then I want those to be the only training sites. (otherwise what's the point of specifying sites?) So I guess I can see uses for options 1 and 3, but I'm not sure the point of # 2.
What do you think of keeping those options, with the explicit_spatial_partition
as the flag that determines which one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me! That's also how I would expect a function to use provided site partitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and if someone wanted something like the prior functionality of some get removed from other partitions and some don't, then they can fully specify all the partitions and list things how they want. (which means for option 3 it would be good for a specified list of sites for a partition to override removing sites specified for other paritions -> ie if trn_sites
is specified, then those sites are removed from the validation partition, unless val_sites
is also specified, in which case that takes precedence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can work that option in. I'll plan to get to this next week
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for cleaning this up for us!
…ly and group_spatially to compute metrics
…on instead of sum_aggregation
river_dl/preproc_utils.py
Outdated
@@ -1150,3 +1187,32 @@ def read_exclude_segs_file(exclude_file): | |||
with open(exclude_file, "r") as s: | |||
d = yaml.safe_load(s) | |||
return [val for key, val in d.items()] | |||
|
|||
def check_partitions(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janetrbarclay Is this what you were envisioning for the partition check? y_obs uses the same ids_<partition> and times_<partition>`, so I think only the x data dict has to be checked.
But maybe each dataset has to be filtered to remove the site-days with NaN in y_obs because the sequences might overlap even though the data are NaN.
I decided to make this nan filter as well. Seems like a stricter check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, though I hadn't thought through the implementation details, this is the functionality I was picturing. My only suggestion has to do with pretraining. I think we're not as worried about overlapping partitions in the pretraining data. It seems like good practice to have this check as the default, but maybe either add an option to not check the pretraining data and/ or to print a warning rather than fail if there are overlaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change the hard-coded True/False to a user-specified value.
But why would overlapping partitions be okay for pretraining?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would definitely keep it as the default to check, but maybe allow an override? I'm not sure when it would be used, and maybe I'm conflating partitioning and holdouts in my head. I know that for the gw project, we only kept our holdout reaches out of the fine-tuning training, we let them be in the pretraining. And actually, river-dl doesn't currently do anything with the validation and testing partitions of the pretraining data. (that's not a very clear answer :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That holdout description makes sense to me. You're basically building an emulator of a process model for the study region, and I don't think it's cheating to use all reaches in that step because a process model can be run anywhere that has the necessary input data. However, it would be important to remove the real observations for holdout reaches when training the process model. Otherwise, some information from those observations would inform weights and biases of the ML model.
river-dl doesn't currently do anything with the validation and testing partitions of the pretraining data
It looks like the model training functions could use validation or testing data (e.g., for optimizing hyperparameters during pretraining). Is your statement more about how pretraining has been implemented in workflows?
Anyway, okay, I'll write in an argument that can override the pretraining check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janetrbarclay I added a function argument that allows overriding the pretraining check. Let me know if this is good to merge.
data = data[~data[spatial_idx_name].isin(val_sites)] | ||
|
||
if not group: | ||
if partition == 'trn': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janetrbarclay here is the new handling of spatial data holdouts. It's similar to the preproc_utils, except this is first checking for partition of the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for doing this!
@@ -671,14 +677,27 @@ def prep_y_data( | |||
) | |||
|
|||
|
|||
# replace validation sites' (and test sites') data with np.nan | |||
# replace trn, val and tst sites' data with np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janetrbarclay here is the new handling of spatial data holdouts in preproc_utils
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for all your work on this! (and your patience with our slow responses)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this, Jared.
I added 2 final commits based on discussion in #218. I am going to merge this because the commits have been reviewed there. |