Challenge of large datasets -- how does this look at different sites? #531

EddieAtUniversityofColorado · 2022-06-21T15:44:45Z

EddieAtUniversityofColorado
Jun 21, 2022

Here at CU, we have about 4 million patients and we want our instance of Leaf to have access to 10 years of data. In addition, we are hosting our instance of Leaf in the Cloud. We've run up against some query-timeout issues that we believe may be due to the sheer quantity of data we're querying. For example, while trying to configure datasets in Leaf, we have about 240 million records in the concept "procedures" -- this runs for over an hour, and with the app server and our MSSSQL DB communicating for so long, the pipeline shuts down. We are going to limit the quantities of data and attempt to optimize this function.

Are other sites using datasets this large? Can other sites share general experiences (or specific experiences related to datasets configuration) related to overcoming similar challenges?

artgoldberg · 2022-06-21T16:09:00Z

artgoldberg
Jun 21, 2022

Hi Eddie

At Sinai we use an omop/OHDSI dbms. We have 10+ million patients, and many years of data. E.g., measurement contains almost 1 B records.

Achieving satisfactory performance has been an on-going challenge. I believe that we have a good plan for how to enhance performance, but I've not had time to execute it. The basic approach is two-fold:

Tune Leaf's single-concept queries.
Tune Leaf queries that combine multiple concepts.

This outlines of our plan for 1:

Index columns used by as many of Leaf's concept queries as possible.

Let's take a bottom up approach. Think of a Leaf query as a tree, each individual concept query as a leaf in the tree, and the operators that combine individual query facets as internal nodes in the tree. E.g., if 2 query facets are intersected, then the tree has intersection at the root, and the 2 query facets as leaves. The goal is to have as many of Leaf's individual query concepts use indices.

Each query concept is described by a row in LeafDB.app.Concept. Currently (2022-04-06) it has 306,646 records. But collectively they use relative few columns in the CDM.

These are our planned steps:

Enumerate all columns used by the concepts in LeafDB.app.Concept
- Create examples of queries that use the columns
Measure the performance of each example
- Tune the CDM for the columns and operators used by Leaf's queries
- Re-measure the performance of each example, and the evaluate performance improvement and final performance

Sorry, but I don't have time to write down more now. But I'd be happy to chat with you and your team about it.

Arthur

1 reply

EddieAtUniversityofColorado Jul 28, 2022
Author

Hello Arthur, for the Leaf datasets (the export functionality), can you share any metrics regarding the sizes of the subqueries you're using or how long an export might take you all? We're currently looking at limiting our subqueries to ~13 GB. Some of our simple subqueries run an export almost immediately and some of our more complex subqueries/exports can take 5 minutes.

EddieAtUniversityofColorado · 2022-06-21T19:15:15Z

EddieAtUniversityofColorado
Jun 21, 2022
Author

Thank you Arthur, this is a great outline. Yes, I will definitely be reaching out to you in the future to discuss these issues in greater detail.

Best,

Eddie

0 replies

EddieAtUniversityofColorado · 2022-06-24T20:27:18Z

EddieAtUniversityofColorado
Jun 24, 2022
Author

When we attempt to add the dataset to the cohort patientlist using the add more data feature, it will keep loading and fail to show any data. From the app server log it appears that a query is querying all data from one clinical domain using a with statement, so when there's no records limitation, it queries all records from the clinical domain. We believe this is causing the issue.

Our SQL server's maximum memory currently is 26 GB. Can someone confirm if the backend query is using memory or storage resource of SQL Server? If it’s using the memory, then the query result definitely exceeded our total memory size.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Challenge of large datasets -- how does this look at different sites? #531

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Challenge of large datasets -- how does this look at different sites? #531

EddieAtUniversityofColorado Jun 21, 2022

Replies: 3 comments · 1 reply

artgoldberg Jun 21, 2022

EddieAtUniversityofColorado Jul 28, 2022 Author

EddieAtUniversityofColorado Jun 21, 2022 Author

EddieAtUniversityofColorado Jun 24, 2022 Author

EddieAtUniversityofColorado
Jun 21, 2022

Replies: 3 comments 1 reply

artgoldberg
Jun 21, 2022

EddieAtUniversityofColorado Jul 28, 2022
Author

EddieAtUniversityofColorado
Jun 21, 2022
Author

EddieAtUniversityofColorado
Jun 24, 2022
Author