You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are two places in the pipeline where we merge nodes and edges, removing duplicates and consolidating properties: during the normalization of a data source and during graph building where multiple data sources are combined. These are the events that have historically caused memory issues when holding many edges or nodes in memory at once. To avoid memory issues a hybrid on-disk merging technique was implemented.
As of now whether or not the on-disk merging technique is used is determined by a list of data sources considered "resource hogs". This is not perfect because merging a large number of small data sources could still consume too much memory. It is also complicated by the inclusion of sub graphs which may or may not include resource hogs. Additionally, there is currently no way to specify the amount of available RAM or to have which sources are considered hogs adapt to that configuration.
We should instead implement an environment variable specifying the amount of available memory and dynamically recognize before merges whether or not the specified amount of RAM is sufficient for merging in memory vs on disk. This could be done by eliminating the resource hog list and checking the actual metadata available at the time for the sources being used.
The text was updated successfully, but these errors were encountered:
Additionally, the in memory merging technique has converted node or edge objects to json strings during the merging process to conserve memory, but it's become apparent by data sources with a very high number of merges that this technique is unacceptably slow. The json strategy should be removed when we implement a dynamic way to recognize when too much memory would be consumed.
There are two places in the pipeline where we merge nodes and edges, removing duplicates and consolidating properties: during the normalization of a data source and during graph building where multiple data sources are combined. These are the events that have historically caused memory issues when holding many edges or nodes in memory at once. To avoid memory issues a hybrid on-disk merging technique was implemented.
As of now whether or not the on-disk merging technique is used is determined by a list of data sources considered "resource hogs". This is not perfect because merging a large number of small data sources could still consume too much memory. It is also complicated by the inclusion of sub graphs which may or may not include resource hogs. Additionally, there is currently no way to specify the amount of available RAM or to have which sources are considered hogs adapt to that configuration.
We should instead implement an environment variable specifying the amount of available memory and dynamically recognize before merges whether or not the specified amount of RAM is sufficient for merging in memory vs on disk. This could be done by eliminating the resource hog list and checking the actual metadata available at the time for the sources being used.
The text was updated successfully, but these errors were encountered: