Skip to content
RosaMGini edited this page Sep 2, 2022 · 9 revisions

MergeFilterAndCollapse

Context

MergeFilterAndCollapse is used in the context of the execution of a specific family of steps in the data processing of a multi-database study. The step is part of the ‘study variable’ process, labelled T2 in the conceptualization by Gini et al, (Gini et al, eGEMS 2016). The step allows computing study variables for the unit of observations of the study, based on multiple observations occurred during routine healthcare. In the Deliverable 7.5 of the ConcePTION Project this step is further analysed and step T2 is conceptualised as a sequence of T2.1 (extraction from the raw data), T2.2 (creation of components by processing at record level and within records of the same unit of observation) and T2.3 (creation of composites by processing components). The definition of component (both primary and secondary) can be found in Gini et al, Vaccine 2020. See also the representation at this link. MergeFilterAndCollapse provides an interface for many of the most common steps of T2.2, and supports the production of both primary and secondary components.

Purpose

A dataset containing one observation per unit of observation is to be merged with one or more longitudinal datasets. The result is then filtered per some conditions (eg on the timeframe of the longitudinal observations), and then, as an option, collapsed to obtain again one record per unit of observation. Before collapsing, as an option, additional record-level variables can be added, and the resulting dataset can be saved as an intermediate result. As an option, when production of secondary components is requested, both datasets can be longitudinal.

Structure of input data

  • listdatasetL a list of one or more data.table() datasets, containing multiple records per -key-.
  • datasetS (optional) a data.table() dataset, containing one record per -key- or, if -typemerge- is 2, multiple records per -key-

Main parameters

  • listdatasetL (str) a list of one or more data.table() datasets, containing multiple records per -key-. In case the list contains more than one dataset, make sure the names of the -key- variables are equal across datasets.
  • datasetS (str) (optional) a data.table() dataset, containing one record per -key- or, if -typemerge- is 2, multiple records per -key-
  • key (list of str) a vector containing the name(s) of the column(s) identifying the key for the merge between -listdatasetL- and -datasetS-. If the key variables are identified by variables with the same name, the name can be listed just once, otherwise list first the name in -listdatasetL- and second the name in -datasetS-. Key may be the unit of observations, or a same key may belong to multiple units of observations (for instance the unit of observation may be a pregnancy, and the key may be the identifier of the pregnant woman, which may contribute to the study more than one pregnancy)
  • condition (str) (optional) a string containing a condition on the rows of the product between -datasetS- and-datasetL-. Only rows of the product that comply with the condition will be further processed by the function.
  • typemerge a dichotomous parameter, by default set to 1, indicating a merge one-to-may. If it is 2 the merge will be many-to-many (this implies that datasetS contains more than 1 record per –key-)
  • saveintermediatedataset (boolean) a logical parameter, by default set to FALSE. If it is TRUE the intermediate dataset obtained after -listdatasetL- is merged with -datasetS- and filtered with -condition- will be saved. If -additionalvar-is specified, the intermediate dataset will also contain the new variables. If -nameintermediatedataset- is not specified, the intermedate dataset is saved in the working directory with name 'intermediatedataset'.
  • nameintermediatedataset (optional) a string specifying the namefile of the intermediate dataset (path is comprised in the name, if any).
  • additionalvar (list of lists) (optional) a list of lists containing additional variables to be created on the merged dataset before computing summary statistics. Each list is made up of three parts: the first is the name to give to the new variable, the second is the content of the variable, and the third is an optional condition filtering the rows to fill.
  • sorting (list of str) (optional) a vector containing the column(s) the dataset must be sorted by before computing summary statistics
  • strata (list of str) strata a vector of column name(s) of -datasetS- or/and -datasetL- and/or -additionalvar- across which the dataset is collapsed.
  • summarystat (a list of lists) a list of lists each one containing three elements: first a summary statistic to be computed (values allowed are: mean, min, max, sd, mode, first, second, secondlast, last, exist, sum, count), on which variable to computed it and optionally, as third element with the new name to give to the new variable.

Action

The function MergeFilterAndCollapse operates several steps:

  1. Merge -listdatasetL- with -datasetS- per -key- while filtering with -condition-. The merge may be one-to-many (default) or many-to-many (if -typemerge- is set to 2). As an option, -datasetS- may be missing, in this case only -listdatasetL- is filtered.
  2. If the parameter -additionalvar- is specified, additional variables can be computed on the merged dataset and then used in the next step to compute summary statistics. This intermediate dataset may be saved for later processing
  3. The merged dataset is collapsed across strata of -strata- and rows summary statistics specified in the parameter -summarystat- are computed. The possible value are: minimum (write "min"), maximum ("max"), mean ("mean"), standard deviation ("sd"), mode ("mode"), first element("first"), second element ("second"), second last element ("secondlast"),an element exist ("exist"),sum ("sum"), count ("count").

Structure of output

The function returns a data.table() dataset with one row for each level of the strata variable(s) and as columns: the strata variable(s), a column for each specified summary statistic. Morevoer, if –saveintermediatedataset- is TRUE, an intermediate dataset is saved: it’s the dataset obtained after -listdatasetL- is merged with -datasetS- and filtered with -condition- will be saved. If -additionalvar-is specified, the intermediate dataset will also contain the new variables.

Example

You have three persons in your cohort, all entering on the 1st January 2021,, stored in the dataset -cohort-

person_id,study_entry_date
P00000001,20210101
P00000002,20210101
P00000003,20210101

and there are two conditions of interest, diabetes and hypertension. The input is a set of longitudinal diagnosis, stored in the dataset -longitudinal_dataset- as follows

person_id,date,conceptset,n
P00000001,20200203,diabetes,1
P00000001,20200913,diabetes,1
P00000001,20200221,hypertension,1
P00000002,20191015,hypertension,1
P00000002,20201015,hypertension,1

The person P00000001 has both diabetes and hypertension, P00000002 has two diagnoses of hypertension but one of them is older than 365 days when the study starts so must be discarded, P00000003 has no record of diagnoses.

The purpose is to check whether there is at least a diagnosis 365 days before the start of the study, and to count the number of such diagnoses. The command is as follows

    output <- MergeFilterAndCollapse(listdatasetL =  list(longitudinal_dataset),
                                     datasetS = cohort,                                                         
                                     condition= "date >= study_entry_date - 365 & date < study_entry_date",
                                     key = "person_id",
                                     saveintermediatedataset = F,
                                     strata=c("person_id","conceptset"),
                                     summarystat = list(
                                                        list(c("max"),"n","at_least_one"),
                                                        list(c("sum"),"n","number_of_diagnosis")
                                                       )
                                     )

and the dataset -output- is as follows

person_id,conceptset,at_least_one,number_of_diagnoses
P00000001,diabetes,1,2
P00000001,hypertension,1,1
P00000002,hypertension,1,1
Clone this wiki locally