Home

MergeFilterAndCollapse

Context

MergeFilterAndCollapse is used in the context of the execution of a specific family of steps in the data processing of a multi-database study. The step is part of the ‘study variable’ process, labelled T2 in the conceptualization by Gini et al, (Gini et al, eGEMS 2016). The step allows computing study variables for the unit of observations of the study, based on multiple observations occurred during routine healthcare. In the Deliverable 7.5 of ConcePTION this step is further analysed and step T2 is conceptualised as T2.1 (extraction from the raw data), T2.2 (creation of components by processing at record level and within records of the same unit of observation) and T2.3 (creation of composites by processing components). The definition of component (both primary and secondary) can be found in Gini et al, 2020. See also the representation at this link. MergeFilterAndCollapse provides an interface for many of the most common steps of T2.2, and supports the production of both primary and secondary components.

Purpose

A dataset containing one observation per unit of observation is to be merged with one or more longitudinal datasets. The result is then filtered per some conditions (eg on the timeframe of the longitudinal observations), and then, as an option, collapsed to obtain again one record per unit of observation. Before collapsing, as an option, additional record-level variables can be added, and the resulting dataset can be saved as an intermediate result. As an option, when production of secondary components is requested, both datasets can be longitudinal.

Structure of input data

listdatasetL a list of one or more data.table() datasets, containing multiple records per -key-.
datasetS (optional) a data.table() dataset, containing one record per -key- or, if -typemerge- is 2, multiple records per -key-

Main parameters

listdatasetL (str) a list of one or more data.table() datasets, containing multiple records per -key-. In case the list contains more than one dataset, make sure the names of the -key- variables are equal across datasets.
datasetS (str) (optional) a data.table() dataset, containing one record per -key- or, if -typemerge- is 2, multiple records per -key-
key (list of str) a vector containing the name(s) of the column(s) identifying the key for the merge between -listdatasetL- and -datasetS-. If the key variables are identified by variables with the same name, the name can be listed just once, otherwise list first the name in -listdatasetL- and second the name in -datasetS-. Key may be the unit of observations, or a same key may belong to multiple units of observations (for instance the unit of observation may be a pregnancy, and the key may be the identifier of the pregnant woman, which may contribute to the study more than one pregnancy)
condition (str) (optional) a string containing a condition on the rows of the product between -datasetS- and-datasetL-. Only rows of the product that comply with the condition will be further processed by the function.
typemerge a dichotomous parameter, by default set to 1, indicating a merge one-to-may. If it is 2 the merge will be many-to-many (this implies that datasetS contains more than 1 record per –key-)
saveintermediatedataset (boolean) a logical parameter, by default set to FALSE. If it is TRUE the intermediate dataset obtained after -listdatasetL- is merged with -datasetS- and filtered with -condition- will be saved. If -additionalvar-is specified, the intermediate dataset will also contain the new variables. If -nameintermediatedataset- is not specified, the intermedate dataset is saved in the working directory with name 'intermediatedataset'.
nameintermediatedataset (optional) a string specifying the namefile of the intermediate dataset (path is comprised in the name, if any).
additionalvar (list of lists) (optional) a list of lists containing additional variables to be created on the merged dataset before computing summary statistics. Each list is made up of three parts: the first is the name to give to the new variable, the second is the content of the variable, and the third is an optional condition filtering the rows to fill.
sorting (list of str) (optional) a vector containing the column(s) the dataset must be sorted by before computing summary statistics
strata (list of str) strata a vector of column name(s) of -datasetS- or/and -datasetL- and/or -additionalvar- across which the dataset is collapsed.
summarystat (a list of lists) a list of lists each one containing three elements: first a summary statistic to be computed (values allowed are: mean, min, max, sd, mode, first, second, secondlast, last, exist, sum, count), on which variable to computed it and optionally, as third element with the new name to give to the new variable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly