Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize pattern frequency calculation #26

Open
ngeiswei opened this issue Jan 22, 2020 · 1 comment
Open

Optimize pattern frequency calculation #26

ngeiswei opened this issue Jan 22, 2020 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@ngeiswei
Copy link
Member

Problem

Currently, pattern frequency (required for calculating the empirical probability during surprisingness evaluation) is calculated by enumerating all its matches and dividing by the universe count. Such enumeration is costly, especially in RAM. On a real world dataset, such as used in

https://github.com/opencog/miner/tree/master/examples/miner/mozi-ai

or

https://github.com/ngeiswei/reasoning-bio-as-xp

it easily maxes out 32GB of RAM. This has been improved by subsampling/bootstrapping the dataset based on an estimate of the empirical probability. Such estimate can be very wrong though, leading to under or over subsampling, thus innacurracies or RAM explosions.

Solutions

  1. Improve the subsampling/bootstrapping mechanism, maybe auto-tuned via binary search, etc.
  2. Introduce a dedicated pattern matcher callback that takes less memory, maybe only saving the atom hashes rather than the atoms themselves, or maybe saving nothing at all but still somehow guarantying not to recount matches.
@ngeiswei ngeiswei self-assigned this Jan 22, 2020
@ngeiswei ngeiswei added the enhancement New feature or request label Jan 22, 2020
@ngeiswei
Copy link
Member Author

Boostrapping seems to work fairly well, however it's still too slow for large data sets, thus introducing a dedicated pattern matcher callback could be welcome.

ngeiswei added a commit to ngeiswei/miner that referenced this issue Jan 28, 2021
Ignore test_ignore_var_2() till it gets fixed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant