Identify and remove redundant component class generation #90

gaurav · 2021-03-17T20:21:51Z

In order to figure out the complexity of the minimum phyloref logical expression generation algorithm, I counted up the number of component classes being produced for a given number of internal specifiers (n) from n=2 to n=8 and worked through the algorithm to figure out equations for determining that count. There are two kinds of component classes:

Direct subclasses: Some component classes are direct subclasses of a phyloreference, and represent one topology that would satisfy that phyloreference. The number of direct subclasses is in the general form: 6(5(4(3+1)+1)+1) (for n=6), although that expression varies a bit depending on whether n is odd or even.
Cache classes: Other component classes are used to store a part of a complex logical expression (for example, when n=4, we generate four cache classes for A&B&C, A&B&D, A&C&D and B&C&D). These can then be reused by other component classes, reducing the complexity of the logical expressions being generated. The number of cache classes will always be equal to the sum of n choose k where k goes from (n-1) to 3. For example, this would be 6 choose 5 + 6 choose 4 + 6 choose 3 for n=6.

In running this analysis, I found that we currently generate redundant logical expressions in two ways:

In most cases, we generate the same logical expression in two different ways: for example, when n=4, we generate both has_Child(<1&2&3> and excludes_lineage_to(test4)) and has_Child(excludes_lineage_to(test4) and <1&2&3>).
In other cases, we generate two logical expressions that a human can identify as identical, but that a computer cannot: for example, when n=4, we generate both has_Child(excludes_lineage_to(has_Child(test4 ^ test1)) and has_Child(test3 ^ test2)) and has_Child(excludes_lineage_to(has_Child(test3 ^ test2)) and has_Child(test4 ^ test1)). Since it doesn't matter which branch is excluded and which is included, these two expressions will always resolve to the same node, and so they are redundant with each other.

At some point, it would be nice to dig into the algorithm and figure out why these redundant logical expressions are being generated. This will probably also allow us to support phyloreferences with more than eight internal specifiers, since the limiting factor is probably the large number of component classes that need to be generated.

The text was updated successfully, but these errors were encountered:

gaurav · 2021-03-19T00:11:50Z

Reducing the number of component classes will have an additional benefit: phyx.js currently defines every phyloreference from a Phyx file as a direct subclass of phyloref:Phyloreference, mainly for simplicity in debugging: this causes all phyloreferences to appear as direct children of phyloref:Phyloreference in Protege, making them easier to find. If we can reduce the number of component classes, this should become much more manageable and would allow us to remove that direct subclass relationship.

gaurav mentioned this issue May 24, 2023

Replace CDAO:has_Ancestor expression with phyloref:has_outside_TU #83

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify and remove redundant component class generation #90

Identify and remove redundant component class generation #90

gaurav commented Mar 17, 2021

gaurav commented Mar 19, 2021

Identify and remove redundant component class generation #90

Identify and remove redundant component class generation #90

Comments

gaurav commented Mar 17, 2021

gaurav commented Mar 19, 2021