Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle memory limits dynamically #145

Open
EvanDietzMorris opened this issue Dec 12, 2022 · 1 comment
Open

Handle memory limits dynamically #145

EvanDietzMorris opened this issue Dec 12, 2022 · 1 comment

Comments

@EvanDietzMorris
Copy link
Contributor

There are two places in the pipeline where we merge nodes and edges, removing duplicates and consolidating properties: during the normalization of a data source and during graph building where multiple data sources are combined. These are the events that have historically caused memory issues when holding many edges or nodes in memory at once. To avoid memory issues a hybrid on-disk merging technique was implemented.

As of now whether or not the on-disk merging technique is used is determined by a list of data sources considered "resource hogs". This is not perfect because merging a large number of small data sources could still consume too much memory. It is also complicated by the inclusion of sub graphs which may or may not include resource hogs. Additionally, there is currently no way to specify the amount of available RAM or to have which sources are considered hogs adapt to that configuration.

We should instead implement an environment variable specifying the amount of available memory and dynamically recognize before merges whether or not the specified amount of RAM is sufficient for merging in memory vs on disk. This could be done by eliminating the resource hog list and checking the actual metadata available at the time for the sources being used.

@EvanDietzMorris
Copy link
Contributor Author

EvanDietzMorris commented Apr 21, 2023

Additionally, the in memory merging technique has converted node or edge objects to json strings during the merging process to conserve memory, but it's become apparent by data sources with a very high number of merges that this technique is unacceptably slow. The json strategy should be removed when we implement a dynamic way to recognize when too much memory would be consumed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant