Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve expansion of reference records, parent records and sub-form records #136

Open
nickdickinson opened this issue Oct 16, 2024 · 5 comments

Comments

@nickdickinson
Copy link
Collaborator

nickdickinson commented Oct 16, 2024

For the convenience of the user and to reduce the load on the server, provide an explicit grammar to allow the user to expand reference fields, parent records and sub-form records. The current automatic expansion would be used to implement these but reduce the load on the server by only showing what users explicitly want to see.

There are two potential approaches that can be implemented.

Approach 1: ActivityInfo label/code based selection of columns.

This would be easiest for ActivityInfo users who dabble in scripts. It would be a wrapper for getRecords() and select() and rename() to allow immediately selecting variables in activityInfo style with the label and/or the code or id of each column and choose the resulting names.

In pseudo R to get all inhabitants of households with a reference to a Person form for all person fields:

df = selectFormVariables(
     inhabitantFormId, 
     vars = c(
          'Household ID' =  '@parent.[ID]', 
          'Inhabitant name' = '[Inhabitant].[Full name]',
          'Mother name' = '[Inhabitant].[Mother].[Full name]', 
          'Father name' = '[Inhabitant].[Father].[Full name]', 
          'HH head gender' = '@parent.[Head of household].[Gender]')
          'HH head age' = '@parent.[Head of household].[Age]')
)

Approach 2: Tidyverse verbs

This is the best from an R developer / data science perspective. We would implement unnest verbs. These verbs include unnest_wider() for reference fields and parent records and unnest_longer() for sub-form records and potentially unnest_auto() to automatically choose the most appropriate and potentially hoist() to be more specific in selection. The example below shows how one can use the tidy select functions to powerfully select exactly which variables are needed.

In pseudo R to get all inhabitants of households with a reference to a Person form for all person fields:

getRecords(inhabitantFormId) %>%
     unnest_wider(Inhabitant) %>%
     unnest_wider(Mother, names_sep = ' ') %>%
     unnest_wider(Father, names_sep = ' ') %>%
     unnest_wider(`@parent`, ' ') %>%
     rename(`Head of household` = `@parent Head of household`) %>% 
     unnest_wider(`Head of household`, names_sep = '') %>%
     select(`Full name`, `Mother Full name`, `Father Full name`, starts_with('@parent') & contains('ID'), contains('Head') & (contains('Gender') || contains('Age'))

Principles:

  • It would unnest columns that are unexpanded reference, parent or subform fields.
  • It would do nothing if these are already expanded.
  • Allow arbitrary expansion, even if cycles are in place.
  • One could still use getRecords(inhabitantFormId, allColumnsStyle(maxDepth = n)) to be able to use the ActivityInfo label/code to select columns at depth n but we would limit this practice due to the way that the number of columns can blow up and because we don't expand forms that have already been visited.

Combined

Both approaches are compatible and could be chained. It is mainly about exposing the most useful API.

@nickdickinson
Copy link
Collaborator Author

@jamiewhths Forgot to mention this. I worked out a couple of options for an implementation of what we discussed.

Option 1 would be easy to implement BUT only if we stick to our existing varNames() implementation which then limits the expansion in cycles. For example, [supervisor].[employee].[supervisor] would not work because we do not revisit forms during the variable expansion. If we parse the variable names and perhaps do a little re-write on varNames() then we could expand arbitrarily into cycles and would require more effort initially to code.

Is there a way for us to do this with the help of the server? For example, send the '[Inhabitant].[Mother].[Full name]' to the server and get back the field id if it is correct? For a whole list of columns?

Option 2 would allow arbitrary expansion into cycles. A bit more effort as we'll be implementing the dplyr verbs but more standardized from an R / data science perspective.

Want to place this in a 4.39 milestone?

@nickdickinson
Copy link
Collaborator Author

I think what makes most sense here is to add an argument to getRecords to leverage the ability to specify columns in queryTable():

inhabitantForm %>% getRecords(columns = c(
          'Household ID' =  '@parent.[ID]', 
          'Inhabitant name' = '[Inhabitant].[Full name]',
          'Mother name' = '[Inhabitant].[Mother].[Full name]', 
          'Father name' = '[Inhabitant].[Father].[Full name]', 
          'HH head gender' = '@parent.[Head of household].[Gender]')
          'HH head age' = '@parent.[Head of household].[Age]'))

It looks like, in order to do this, the following line in records.R, needs to explicitly set the vars argument and the columns argument when getRecords is passed explicit columns:
step = firstStep(formTree, style, totalRecords)

-->

step = firstStep(formTree, style, totalRecords, vars = columns, columns = columns)

A more sophisticated approach is to additionally modify select.tbl_activityInfoRemoteRecords() in order to first check for selection of variables that are not existing in the lazy data frame and then use a REST API call to check if that variable is legal for the table. If so, add it.

This would allow the following approach:

inhabitantForm %>% getRecords() %>% select(
          `mother_name` =  `[Inhabitant].[Mother].[Full name]`
)

@nickdickinson
Copy link
Collaborator Author

nickdickinson commented Dec 4, 2024

I think I cracked the nut by the way @jamiewhths. I can now do:

# create an officer column using an ActivityInfo variable expression in select()
x %>% select(`_lastEditTime`, officer = `ce8xfhgm3x4xtgr5.[Office administrator].[Field Office].[Office administrator].[Field Office].[Office administrator].[Field Office].[c1r6xktm49zvh2l2].field_office.c1r6xktm49zvh2l2.Name`) %>% filter(officer == "Person 1")

or even

# filter rows that only contain _id and _lastEditTime with a filter that uses an ActivityInfo variable expression
x %>% select(starts_with('_')) %>% filter(`ce8xfhgm3x4xtgr5.[Office administrator].[Field Office].[Office administrator].[Field Office].[Office administrator].[Field Office].[c1r6xktm49zvh2l2].field_office.c1r6xktm49zvh2l2.Name` == "Person 1")

It will error if the expression is incorrect before calling the API by checking the formTree.

With this, we can also simply use the select statements to add variables when fetching from a form and avoid all recursive styles.

We can also implement the mutate() dplyr verb to create new columns with ActivityInfo formulas.

@nickdickinson
Copy link
Collaborator Author

The implementation is on the arbitrary-columns-4.38 branch.

@nickdickinson
Copy link
Collaborator Author

The mutate verb now works with many ActivityInfo formulas. I've tried to include them all, but we'll need to check.

This is now supported:

x %>% mutate(a = 1) %>% mutate(b = a + a) %>% mutate(c = IF(CONCAT((2*2)+1)==TEXT(4), "It does!", "Nope"))

A limited number of R functions are automatically translated including:

  • grepl -> REGEXMATCH
  • paste0 -> CONCAT
  • Binary infix operators | and & -> || and &&

Allowing also:

x %>% mutate(a = 1) %>% mutate(b = a + a) %>% mutate(c = paste0("Name: ", `[Field.Office].[Office administrator].Name`))

This is great because it means specific columns and 'views' can be generated before collecting the data frame.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant