Allow util_corr_fit() to group by a variable #58

emcfalls · 2023-11-29T23:39:23Z

This extension will allow users to return correlation data for the numerical columns in their synthesis by a certain variable (i.e., gender, age, race, etc.). Therefore, users can assess the performance of their synthesis based how it preserves multivariate relationships for different subgroups in the population. I don't if we would want to group by multiple variables due to the complexity.

Right now, the util_corr_fit() function returns a list

return(
    list(
      correlation_original = original_lt,
      correlation_synthetic = synthetic_lt,
      correlation_difference = difference_lt,
      correlation_fit = correlation_fit,
      correlation_difference_mae = correlation_difference_mae,
      correlation_difference_rmse = correlation_difference_rmse
    )

I have two ideas for this extension

When grouping by another variable, the function would return a list of lists. Each element of the list would be the correlation list data for each subgroup of that variable
Example for variable sex:

return(
    list(
      male = list(
      correlation_original_m = original_lt_m,
      correlation_synthetic_m = synthetic_lt_m,
      correlation_difference_m = difference_lt_m,
      correlation_fit_m = correlation_fit_f,
      correlation_difference_mae_m = correlation_difference_mae_m,
      correlation_difference_rmse_m = correlation_difference_rmse_m
      ),
      female = list(
      correlation_original_f = original_lt_f,
      correlation_synthetic_f = synthetic_lt_f,
      correlation_difference_f = difference_lt_f,
      correlation_fit_f = correlation_fit_f,
      correlation_difference_mae_f = correlation_difference_mae_f,
      correlation_difference_rmse_f = correlation_difference_rmse_f
      )
  )

Instead of making a list of lists, we could include all of the correlations for each subgroup in one dataframe for correlation_original, correlation_synthetic, and correlation_difference_f. correlation_difference_f, correlation_difference_mae_f, and correlation_difference_rmse_f would be lists where each element corresponds to the metric for a specific subgroup
Example correlation matrix for a dataframe with sex, income (numeric), and height (numeric)

The text was updated successfully, but these errors were encountered:

Deckart2 · 2023-11-30T15:22:19Z

Hey @emcfalls ! Thank you for the very helpful write-up and for sharing your ideas (and also the picture :) ). I have a few thoughts:

I think both are viable alternatives, but I like option 2 better and would support us doing that.
I like option 2 for two reasons. 3. First, it maintains the same API experience as when a .group_by argument is not provided, so, either way, a user would be returned a named list where the values of the list are a data.frame. Second, I can imagine this being more easy for a user to work with than nested lists (I suspect a user would just unnest them and put them in a dataframe anyway).
One other thing I think is important is that we should maintain a consistent API across functions within syntheval. This would mean that for all functions to which you (or any of us) add a group_by argument, our strategy is to return dataframes with a new column for the column containing the subgroup and row bind different dataframes that have the results for each subgroup together to make one larger dataframe (in other words so they all look like the second chart Elyse drew). I looked through all of the utility metrics, and it appears that this strategy is viable for them. For the disclosure metrics, it looks like we generally return either a single float or a named list of floats. This would mean we would have to change that API, but we will have to change it regardless for any functions that get the group_by argument. This is a bummer but is probably worth it.

One alternative to the two options presented here would be to make any functions that get a group_by argument to take another optional parameter of group (or something similar):

The function would then return the exact same results as it does now, but just analyzes the one group, and you call the function multiple times. Internally, the function would work the same as it does now, but it would just start off by doing a filter(group_by_col == group) for both the synthetic and original data.
Then the user would iterate over all the values of the group in the group_by column.
The pro of this is that it maintains an identical API. The con is that this is not very elegant and users would have to put together result dataframes.

Deckart2 · 2023-11-30T15:23:28Z

One other note as a total aside, markdown is supported on Github issues, so you can get inline code and

code chunks

with one and three tic marks, respectively.

emcfalls · 2023-11-30T16:44:23Z

@Deckart2 I agree with not changing the format of the output and keeping it consistent with the other functions that use groupby! I also like your idea of using a group parameter so instead of returning the results for all groups we just return it for one. I think that would be the easiest as far as keeping the same function output, but I'm thinking it may be tedious for users who want to look at multiple groups. I think a best of both worlds would be to have both parameters (groupby and group), but that may be overkill.

Deckart2 · 2023-11-30T17:39:08Z

Sounds right to me, and I think it could be totally okay to have a groupby and group argument. I also agree it could be tedious, but if we go this route, we could add some documentation that could show how to do it in a few lines of code.

It may be harder than this, but at its core, we would need to do something like:

Identify all relevant groups: groups <- synthetic_data$group_var |> unique()
Call function for each group: results_by_group <- map_dfr(.f = util_corr_fit, .x = groups, ..., groupby = group_var)

Anyway, this is really thoughtful, and excited to discuss it synchronously with you @emcfalls and hear @awunderground 's thoughts :)!

Deckart2 · 2023-12-01T16:02:36Z

Brief Notes from Convo with Aaron:
Inspire by collect metrics from tidymodels: see documentation here - https://awunderground.github.io/data-science-for-public-policy2/11_ensembling.html
Go away from list of lists and instead go for tidy data of tibbles.

emcfalls self-assigned this Nov 30, 2023

awunderground mentioned this issue May 10, 2024

Version 0.0.4 #57

Closed

9 tasks

uijkelly unassigned emcfalls May 13, 2024

awunderground added this to the Version 0.0.4 milestone Jul 3, 2024

awunderground added the new-feature label Jul 10, 2024

awunderground modified the milestones: Version 0.0.4, Version 0.0.5 Sep 4, 2024

awunderground mentioned this issue Oct 30, 2024

Version 0.0.5 #77

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow util_corr_fit() to group by a variable #58

Allow util_corr_fit() to group by a variable #58

emcfalls commented Nov 29, 2023 •

edited

Loading

Deckart2 commented Nov 30, 2023

Deckart2 commented Nov 30, 2023 •

edited

Loading

emcfalls commented Nov 30, 2023

Deckart2 commented Nov 30, 2023

Deckart2 commented Dec 1, 2023

Allow util_corr_fit() to group by a variable #58

Allow util_corr_fit() to group by a variable #58

Comments

emcfalls commented Nov 29, 2023 • edited Loading

Deckart2 commented Nov 30, 2023

Deckart2 commented Nov 30, 2023 • edited Loading

emcfalls commented Nov 30, 2023

Deckart2 commented Nov 30, 2023

Deckart2 commented Dec 1, 2023

emcfalls commented Nov 29, 2023 •

edited

Loading

Deckart2 commented Nov 30, 2023 •

edited

Loading