Make selection of arbitirary baselines for DEA deterministic and/or add more rules. #1290

ppavlidis · 2024-11-14T23:35:17Z

For data sets without a "natural" baseline like "control", we pick randomly for the purposes of DEA.

In my review of the code, this seems non-deterministic. I suggest to increase consistency of comparisons, we at least make sure reanalysis of the same data set always yield the same baseline. Picking based on something like alphabetic order of the FV Subject value would work (picking by FV ID would also work, but wouldn't be consistent across data sets).

In some cases we might want to add some more rules. I'm also thinking about use cases like biological sex, where male or female might be picked as the baseline - we might as well be consistent. That is easily implemented as a rule.

This would somewhat ease comparisons across data sets by reducing any necessity of ensuring baselines are consistent.

Since biological sex is a common one, we can at least be consistent.

neerapatadia · 2024-11-15T01:09:24Z

Some additional scenarios to consider:

Timepoints: ensuring that earlier time point is listed as the base line condition

Examples:
0h vs 4h
Week 1 vs Week 4

Disease Staging: Studies that perform comparisons between disease progression stages.

Examples: 
Stage 1 vs Stage 2
Grade 1 vs Grade 2

We would want to make sure that the ordering of the subsequent disease stages are kept consistent (so stage 2 goes before stage 3 and so on). It would also be useful to have consistent number convention for these conditions, as there are cases where Roman numerals or letters (A,B,C..) are used as opposed to Arabic numerals. Personally I think it would just make sense to stick to arabic numerals.

Developmental Stages: Comparisons across different stages of development that do not include specific time points.

Examples: 
 juvenile stage versus adult stage  
 prime adult stage versus late adult stage

   I think it would be best to keep this ordered in terms of logical time points, so using juvenile as the baseline and adult as the condition or “early”/prime” versus “late”  .
   

arteymix · 2024-11-20T18:51:21Z

May I suggest we should just produce an error when the baseline cannot be determined? A curator can always step it and explicitly identify the baseline.

For timepoint, there might be enough consistency to parse the value with regular expressions and perform a numerical comparison.

In the dev branch, I'm using prettytime to parse user-supplied dates on the CLI. That might be applicable too.

ppavlidis · 2024-11-20T19:25:46Z

There would be hundreds of "errors", I don't think that would be helpful, it would just add to curator burden and there wouldn't be much they can do about it without better software support.

Providing more support for manual ordering in the UI might be useful, though it would also need data model support as categorical factor values don't have an "order" (other than the baseline). Currently AFAIK the UI doesn't even provide support for manual choosing of baselines, which the backend does support.

So I'm going to stick by my original suggestion: We can add more hard-coded rules/heuristics for choosing a baseline and an ordering. As you say, we can parse time points like "1 d" (and leverage the category = timepoint) since we have tried to use consistent format and units, and "grade 1", "grade 2" etc.

The first step is to enumerate such cases, do some harmonization of the curation if needed, and then code the rules. I believe there are three predominant types of EFs where this would be most relevant.

select c.CATEGORY, count(c.ID) from CHARACTERISTIC c inner join EXPERIMENTAL_FACTOR ef ON ef.CATEGORY_FK=c.ID WHERE c.CATEGORY in ('Timepoint', 'Developmental stage', 'Disease staging') and ef.TYPE='categorical' group by c.CATEGORY;

+---------------------+-------------+
| CATEGORY            | count(c.ID) |
+---------------------+-------------+
| developmental stage |         983 |
| disease staging     |         183 |
| timepoint           |        2990 |
+---------------------+-------------+

arteymix · 2024-11-20T20:04:08Z

Looking at top timepoint values:

select distinct VALUE, COUNT(*)
from CHARACTERISTIC
where CATEGORY = 'timepoint'
group by VALUE
order by COUNT(*) desc

Those are relatively easy to parse with something like (\d+(.\d+)?)\s*(d|h|m|s).

ppavlidis · 2024-11-20T20:13:36Z

Yep, you can see the need for more harmonization there too (e.g. 48h vs 48 h; we'll move discussion of that to the curator workspace).

arteymix · 2024-11-22T00:53:45Z

For determinism, IDs are the best because everything else can essentially be changed in the curator interface.

Sorting by ID might also be something we should do when generating VOs so we can also see it in the UI.

ppavlidis added the enhancement Enhance the code or user experience label Nov 14, 2024

ppavlidis added a commit that referenced this issue Nov 15, 2024

partially address #1290

728f829

Since biological sex is a common one, we can at least be consistent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make selection of arbitirary baselines for DEA deterministic and/or add more rules. #1290

Make selection of arbitirary baselines for DEA deterministic and/or add more rules. #1290

ppavlidis commented Nov 14, 2024

neerapatadia commented Nov 15, 2024

arteymix commented Nov 20, 2024

ppavlidis commented Nov 20, 2024

arteymix commented Nov 20, 2024

ppavlidis commented Nov 20, 2024

arteymix commented Nov 22, 2024

Make selection of arbitirary baselines for DEA deterministic and/or add more rules. #1290

Make selection of arbitirary baselines for DEA deterministic and/or add more rules. #1290

Comments

ppavlidis commented Nov 14, 2024

neerapatadia commented Nov 15, 2024

arteymix commented Nov 20, 2024

ppavlidis commented Nov 20, 2024

arteymix commented Nov 20, 2024

ppavlidis commented Nov 20, 2024

arteymix commented Nov 22, 2024