Challenge of large datasets -- how does this look at different sites? #531
Replies: 3 comments 1 reply
-
Hi Eddie At Sinai we use an omop/OHDSI dbms. We have 10+ million patients, and many years of data. E.g., measurement contains almost 1 B records. Achieving satisfactory performance has been an on-going challenge. I believe that we have a good plan for how to enhance performance, but I've not had time to execute it. The basic approach is two-fold:
This outlines of our plan for 1:
These are our planned steps:
Sorry, but I don't have time to write down more now. But I'd be happy to chat with you and your team about it. Arthur |
Beta Was this translation helpful? Give feedback.
-
Thank you Arthur, this is a great outline. Yes, I will definitely be reaching out to you in the future to discuss these issues in greater detail. Best, Eddie |
Beta Was this translation helpful? Give feedback.
-
When we attempt to add the dataset to the cohort patientlist using the add more data feature, it will keep loading and fail to show any data. From the app server log it appears that a query is querying all data from one clinical domain using a with statement, so when there's no records limitation, it queries all records from the clinical domain. We believe this is causing the issue. Our SQL server's maximum memory currently is 26 GB. Can someone confirm if the backend query is using memory or storage resource of SQL Server? If it’s using the memory, then the query result definitely exceeded our total memory size. |
Beta Was this translation helpful? Give feedback.
-
Here at CU, we have about 4 million patients and we want our instance of Leaf to have access to 10 years of data. In addition, we are hosting our instance of Leaf in the Cloud. We've run up against some query-timeout issues that we believe may be due to the sheer quantity of data we're querying. For example, while trying to configure datasets in Leaf, we have about 240 million records in the concept "procedures" -- this runs for over an hour, and with the app server and our MSSSQL DB communicating for so long, the pipeline shuts down. We are going to limit the quantities of data and attempt to optimize this function.
Are other sites using datasets this large? Can other sites share general experiences (or specific experiences related to datasets configuration) related to overcoming similar challenges?
Beta Was this translation helpful? Give feedback.
All reactions