Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify and remove redundant component class generation #90

Open
gaurav opened this issue Mar 17, 2021 · 1 comment
Open

Identify and remove redundant component class generation #90

gaurav opened this issue Mar 17, 2021 · 1 comment

Comments

@gaurav
Copy link
Member

gaurav commented Mar 17, 2021

In order to figure out the complexity of the minimum phyloref logical expression generation algorithm, I counted up the number of component classes being produced for a given number of internal specifiers (n) from n=2 to n=8 and worked through the algorithm to figure out equations for determining that count. There are two kinds of component classes:

  • Direct subclasses: Some component classes are direct subclasses of a phyloreference, and represent one topology that would satisfy that phyloreference. The number of direct subclasses is in the general form: 6(5(4(3+1)+1)+1) (for n=6), although that expression varies a bit depending on whether n is odd or even.
  • Cache classes: Other component classes are used to store a part of a complex logical expression (for example, when n=4, we generate four cache classes for A&B&C, A&B&D, A&C&D and B&C&D). These can then be reused by other component classes, reducing the complexity of the logical expressions being generated. The number of cache classes will always be equal to the sum of n choose k where k goes from (n-1) to 3. For example, this would be 6 choose 5 + 6 choose 4 + 6 choose 3 for n=6.

In running this analysis, I found that we currently generate redundant logical expressions in two ways:

  1. In most cases, we generate the same logical expression in two different ways: for example, when n=4, we generate both has_Child(<1&2&3> and excludes_lineage_to(test4)) and has_Child(excludes_lineage_to(test4) and <1&2&3>).
  2. In other cases, we generate two logical expressions that a human can identify as identical, but that a computer cannot: for example, when n=4, we generate both has_Child(excludes_lineage_to(has_Child(test4 ^ test1)) and has_Child(test3 ^ test2)) and has_Child(excludes_lineage_to(has_Child(test3 ^ test2)) and has_Child(test4 ^ test1)). Since it doesn't matter which branch is excluded and which is included, these two expressions will always resolve to the same node, and so they are redundant with each other.

At some point, it would be nice to dig into the algorithm and figure out why these redundant logical expressions are being generated. This will probably also allow us to support phyloreferences with more than eight internal specifiers, since the limiting factor is probably the large number of component classes that need to be generated.

@gaurav
Copy link
Member Author

gaurav commented Mar 19, 2021

Reducing the number of component classes will have an additional benefit: phyx.js currently defines every phyloreference from a Phyx file as a direct subclass of phyloref:Phyloreference, mainly for simplicity in debugging: this causes all phyloreferences to appear as direct children of phyloref:Phyloreference in Protege, making them easier to find. If we can reduce the number of component classes, this should become much more manageable and would allow us to remove that direct subclass relationship.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant