Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make selection of arbitirary baselines for DEA deterministic and/or add more rules. #1290

Open
ppavlidis opened this issue Nov 14, 2024 · 6 comments
Labels
enhancement Enhance the code or user experience

Comments

@ppavlidis
Copy link
Collaborator

For data sets without a "natural" baseline like "control", we pick randomly for the purposes of DEA.

In my review of the code, this seems non-deterministic. I suggest to increase consistency of comparisons, we at least make sure reanalysis of the same data set always yield the same baseline. Picking based on something like alphabetic order of the FV Subject value would work (picking by FV ID would also work, but wouldn't be consistent across data sets).

In some cases we might want to add some more rules. I'm also thinking about use cases like biological sex, where male or female might be picked as the baseline - we might as well be consistent. That is easily implemented as a rule.

This would somewhat ease comparisons across data sets by reducing any necessity of ensuring baselines are consistent.

@ppavlidis ppavlidis added the enhancement Enhance the code or user experience label Nov 14, 2024
ppavlidis added a commit that referenced this issue Nov 15, 2024
Since biological sex is a common one, we can at least be consistent.
@neerapatadia
Copy link

Some additional scenarios to consider:

  1. Timepoints: ensuring that earlier time point is listed as the base line condition
Examples:
0h vs 4h
Week 1 vs Week 4 
  1. Disease Staging: Studies that perform comparisons between disease progression stages.
Examples: 
Stage 1 vs Stage 2
Grade 1 vs Grade 2

We would want to make sure that the ordering of the subsequent disease stages are kept consistent (so stage 2 goes before stage 3 and so on). It would also be useful to have consistent number convention for these conditions, as there are cases where Roman numerals or letters (A,B,C..) are used as opposed to Arabic numerals. Personally I think it would just make sense to stick to arabic numerals.

  1. Developmental Stages: Comparisons across different stages of development that do not include specific time points.

Examples: 

juvenile stage versus adult stage 


prime adult stage versus late adult stage 




I think it would be best to keep this ordered in terms of logical time points, so using juvenile as the baseline and adult as the condition or “early”/prime” versus “late”

.





@arteymix
Copy link
Member

May I suggest we should just produce an error when the baseline cannot be determined? A curator can always step it and explicitly identify the baseline.

For timepoint, there might be enough consistency to parse the value with regular expressions and perform a numerical comparison.

In the dev branch, I'm using prettytime to parse user-supplied dates on the CLI. That might be applicable too.

@ppavlidis
Copy link
Collaborator Author

There would be hundreds of "errors", I don't think that would be helpful, it would just add to curator burden and there wouldn't be much they can do about it without better software support.

Providing more support for manual ordering in the UI might be useful, though it would also need data model support as categorical factor values don't have an "order" (other than the baseline). Currently AFAIK the UI doesn't even provide support for manual choosing of baselines, which the backend does support.

So I'm going to stick by my original suggestion: We can add more hard-coded rules/heuristics for choosing a baseline and an ordering. As you say, we can parse time points like "1 d" (and leverage the category = timepoint) since we have tried to use consistent format and units, and "grade 1", "grade 2" etc.

The first step is to enumerate such cases, do some harmonization of the curation if needed, and then code the rules. I believe there are three predominant types of EFs where this would be most relevant.

select c.CATEGORY, count(c.ID) from CHARACTERISTIC c inner join EXPERIMENTAL_FACTOR ef ON ef.CATEGORY_FK=c.ID WHERE c.CATEGORY in ('Timepoint', 'Developmental stage', 'Disease staging') and ef.TYPE='categorical' group by c.CATEGORY;

+---------------------+-------------+
| CATEGORY            | count(c.ID) |
+---------------------+-------------+
| developmental stage |         983 |
| disease staging     |         183 |
| timepoint           |        2990 |
+---------------------+-------------+

@arteymix
Copy link
Member

Looking at top timepoint values:

image

select distinct VALUE, COUNT(*)
from CHARACTERISTIC
where CATEGORY = 'timepoint'
group by VALUE
order by COUNT(*) desc

Those are relatively easy to parse with something like (\d+(.\d+)?)\s*(d|h|m|s).

@ppavlidis
Copy link
Collaborator Author

Yep, you can see the need for more harmonization there too (e.g. 48h vs 48 h; we'll move discussion of that to the curator workspace).

@arteymix
Copy link
Member

For determinism, IDs are the best because everything else can essentially be changed in the curator interface.

Sorting by ID might also be something we should do when generating VOs so we can also see it in the UI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhance the code or user experience
Projects
None yet
Development

No branches or pull requests

3 participants