Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow util_corr_fit() to group by a variable #58

Open
Tracked by #77
emcfalls opened this issue Nov 29, 2023 · 5 comments
Open
Tracked by #77

Allow util_corr_fit() to group by a variable #58

emcfalls opened this issue Nov 29, 2023 · 5 comments

Comments

@emcfalls
Copy link
Contributor

emcfalls commented Nov 29, 2023

This extension will allow users to return correlation data for the numerical columns in their synthesis by a certain variable (i.e., gender, age, race, etc.). Therefore, users can assess the performance of their synthesis based how it preserves multivariate relationships for different subgroups in the population. I don't if we would want to group by multiple variables due to the complexity.

Right now, the util_corr_fit() function returns a list

return(
    list(
      correlation_original = original_lt,
      correlation_synthetic = synthetic_lt,
      correlation_difference = difference_lt,
      correlation_fit = correlation_fit,
      correlation_difference_mae = correlation_difference_mae,
      correlation_difference_rmse = correlation_difference_rmse
    )

I have two ideas for this extension

  1. When grouping by another variable, the function would return a list of lists. Each element of the list would be the correlation list data for each subgroup of that variable
    Example for variable sex:
return(
    list(
      male = list(
      correlation_original_m = original_lt_m,
      correlation_synthetic_m = synthetic_lt_m,
      correlation_difference_m = difference_lt_m,
      correlation_fit_m = correlation_fit_f,
      correlation_difference_mae_m = correlation_difference_mae_m,
      correlation_difference_rmse_m = correlation_difference_rmse_m
      ),
      female = list(
      correlation_original_f = original_lt_f,
      correlation_synthetic_f = synthetic_lt_f,
      correlation_difference_f = difference_lt_f,
      correlation_fit_f = correlation_fit_f,
      correlation_difference_mae_f = correlation_difference_mae_f,
      correlation_difference_rmse_f = correlation_difference_rmse_f
      )
  )
  1. Instead of making a list of lists, we could include all of the correlations for each subgroup in one dataframe for correlation_original, correlation_synthetic, and correlation_difference_f. correlation_difference_f, correlation_difference_mae_f, and correlation_difference_rmse_f would be lists where each element corresponds to the metric for a specific subgroup
    Example correlation matrix for a dataframe with sex, income (numeric), and height (numeric)
    IMG_4652
@emcfalls emcfalls self-assigned this Nov 30, 2023
@Deckart2
Copy link
Contributor

Hey @emcfalls ! Thank you for the very helpful write-up and for sharing your ideas (and also the picture :) ). I have a few thoughts:

  1. I think both are viable alternatives, but I like option 2 better and would support us doing that.
  2. I like option 2 for two reasons. 3. First, it maintains the same API experience as when a .group_by argument is not provided, so, either way, a user would be returned a named list where the values of the list are a data.frame. Second, I can imagine this being more easy for a user to work with than nested lists (I suspect a user would just unnest them and put them in a dataframe anyway).
  3. One other thing I think is important is that we should maintain a consistent API across functions within syntheval. This would mean that for all functions to which you (or any of us) add a group_by argument, our strategy is to return dataframes with a new column for the column containing the subgroup and row bind different dataframes that have the results for each subgroup together to make one larger dataframe (in other words so they all look like the second chart Elyse drew). I looked through all of the utility metrics, and it appears that this strategy is viable for them. For the disclosure metrics, it looks like we generally return either a single float or a named list of floats. This would mean we would have to change that API, but we will have to change it regardless for any functions that get the group_by argument. This is a bummer but is probably worth it.

One alternative to the two options presented here would be to make any functions that get a group_by argument to take another optional parameter of group (or something similar):

  • The function would then return the exact same results as it does now, but just analyzes the one group, and you call the function multiple times. Internally, the function would work the same as it does now, but it would just start off by doing a filter(group_by_col == group) for both the synthetic and original data.
  • Then the user would iterate over all the values of the group in the group_by column.
  • The pro of this is that it maintains an identical API. The con is that this is not very elegant and users would have to put together result dataframes.

@Deckart2
Copy link
Contributor

Deckart2 commented Nov 30, 2023

One other note as a total aside, markdown is supported on Github issues, so you can get inline code and

code chunks

with one and three tic marks, respectively.

@emcfalls
Copy link
Contributor Author

@Deckart2 I agree with not changing the format of the output and keeping it consistent with the other functions that use groupby! I also like your idea of using a group parameter so instead of returning the results for all groups we just return it for one. I think that would be the easiest as far as keeping the same function output, but I'm thinking it may be tedious for users who want to look at multiple groups. I think a best of both worlds would be to have both parameters (groupby and group), but that may be overkill.

@Deckart2
Copy link
Contributor

Sounds right to me, and I think it could be totally okay to have a groupby and group argument. I also agree it could be tedious, but if we go this route, we could add some documentation that could show how to do it in a few lines of code.

It may be harder than this, but at its core, we would need to do something like:

  1. Identify all relevant groups: groups <- synthetic_data$group_var |> unique()
  2. Call function for each group: results_by_group <- map_dfr(.f = util_corr_fit, .x = groups, ..., groupby = group_var)

Anyway, this is really thoughtful, and excited to discuss it synchronously with you @emcfalls and hear @awunderground 's thoughts :)!

@Deckart2
Copy link
Contributor

Deckart2 commented Dec 1, 2023

Brief Notes from Convo with Aaron:
Inspire by collect metrics from tidymodels: see documentation here - https://awunderground.github.io/data-science-for-public-policy2/11_ensembling.html
Go away from list of lists and instead go for tidy data of tibbles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants