-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make selection of arbitirary baselines for DEA deterministic and/or add more rules. #1290
Comments
Since biological sex is a common one, we can at least be consistent.
Some additional scenarios to consider:
We would want to make sure that the ordering of the subsequent disease stages are kept consistent (so stage 2 goes before stage 3 and so on). It would also be useful to have consistent number convention for these conditions, as there are cases where Roman numerals or letters (A,B,C..) are used as opposed to Arabic numerals. Personally I think it would just make sense to stick to arabic numerals.
I think it would be best to keep this ordered in terms of logical time points, so using juvenile as the baseline and adult as the condition or “early”/prime” versus “late”
. |
May I suggest we should just produce an error when the baseline cannot be determined? A curator can always step it and explicitly identify the baseline. For timepoint, there might be enough consistency to parse the value with regular expressions and perform a numerical comparison. In the dev branch, I'm using prettytime to parse user-supplied dates on the CLI. That might be applicable too. |
There would be hundreds of "errors", I don't think that would be helpful, it would just add to curator burden and there wouldn't be much they can do about it without better software support. Providing more support for manual ordering in the UI might be useful, though it would also need data model support as categorical factor values don't have an "order" (other than the baseline). Currently AFAIK the UI doesn't even provide support for manual choosing of baselines, which the backend does support. So I'm going to stick by my original suggestion: We can add more hard-coded rules/heuristics for choosing a baseline and an ordering. As you say, we can parse time points like "1 d" (and leverage the category = timepoint) since we have tried to use consistent format and units, and "grade 1", "grade 2" etc. The first step is to enumerate such cases, do some harmonization of the curation if needed, and then code the rules. I believe there are three predominant types of EFs where this would be most relevant.
|
Yep, you can see the need for more harmonization there too (e.g. 48h vs 48 h; we'll move discussion of that to the curator workspace). |
For determinism, IDs are the best because everything else can essentially be changed in the curator interface. Sorting by ID might also be something we should do when generating VOs so we can also see it in the UI. |
For data sets without a "natural" baseline like "control", we pick randomly for the purposes of DEA.
In my review of the code, this seems non-deterministic. I suggest to increase consistency of comparisons, we at least make sure reanalysis of the same data set always yield the same baseline. Picking based on something like alphabetic order of the FV Subject value would work (picking by FV ID would also work, but wouldn't be consistent across data sets).
In some cases we might want to add some more rules. I'm also thinking about use cases like biological sex, where male or female might be picked as the baseline - we might as well be consistent. That is easily implemented as a rule.
This would somewhat ease comparisons across data sets by reducing any necessity of ensuring baselines are consistent.
The text was updated successfully, but these errors were encountered: