-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python implementation for StatisticsSurface #669
Python implementation for StatisticsSurface #669
Conversation
Hello @NicolasGensollen! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2022-09-06 10:01:09 UTC |
I have added It could be an argument for investigating an alternative with just |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @NicolasGensollen, Thanks for this work ! LGTM. There are a few suggestions in comments.
cc78ac0
to
9b86481
Compare
faee9b0
to
4d5a797
Compare
@NicolasGensollen, I'd suggest to drop the vendored surfstat in a seperate PR. It will make the review process easier. Also it'll isolate the creation from the deletion in case we need to fall back to the previous version for whichever reason. |
Co-authored-by: omar-rifai <[email protected]>
fsaverage_path = freesurfer_home / Path("subjects/fsaverage/surf") | ||
average_surface, average_mesh = _get_average_surface(fsaverage_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently, BrainStat provides its own fetchers. Perhaps they could be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can discuss this next week, I'm not really convinced that using their fetchers would be a better option compared to the solution implemented here.
Co-authored-by: Ghislain Vaillant <[email protected]>
Co-authored-by: Ghislain Vaillant <[email protected]>
Co-authored-by: Ghislain Vaillant <[email protected]>
Co-authored-by: Ghislain Vaillant <[email protected]>
Co-authored-by: Ghislain Vaillant <[email protected]>
I fixed the merge conflict and updated the |
Thanks for fixing the conflict @ghisvail |
You did well. It's a weird bug indeed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another batch of comments and suggestions. I stopped the numpydoc suggestions after I realized most of them are missing typing informations for parameters and returned values.
Please consider making another pass at them with the style guide open for a refresher.
if not Path(tsv_file).exists(): | ||
raise FileNotFoundError(f"File {tsv_file} does not exist.") | ||
tsv_data = pd.read_csv(tsv_file, sep="\t") | ||
if len(tsv_data.columns) < 2: | ||
raise ValueError(f"The TSV data in {tsv_file} should have at least 2 columns.") | ||
if tsv_data.columns[0] != TSV_FIRST_COLUMN: | ||
raise ValueError( | ||
f"The first column in {tsv_file} should always be {TSV_FIRST_COLUMN}." | ||
) | ||
if tsv_data.columns[1] != TSV_SECOND_COLUMN: | ||
raise ValueError( | ||
f"The second column in {tsv_file} should always be {TSV_SECOND_COLUMN}." | ||
) | ||
return tsv_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether the intent of this code could be simplified with something like:
if not Path(tsv_file).exists(): | |
raise FileNotFoundError(f"File {tsv_file} does not exist.") | |
tsv_data = pd.read_csv(tsv_file, sep="\t") | |
if len(tsv_data.columns) < 2: | |
raise ValueError(f"The TSV data in {tsv_file} should have at least 2 columns.") | |
if tsv_data.columns[0] != TSV_FIRST_COLUMN: | |
raise ValueError( | |
f"The first column in {tsv_file} should always be {TSV_FIRST_COLUMN}." | |
) | |
if tsv_data.columns[1] != TSV_SECOND_COLUMN: | |
raise ValueError( | |
f"The second column in {tsv_file} should always be {TSV_SECOND_COLUMN}." | |
) | |
return tsv_data | |
try: | |
return pd.read_tsv(tsv_file, sep="\t").set_index(["participant_id", "session_id"]) | |
except [...]: | |
# Deal with read_tsv and set_index errors. |
I find the intent here clearer: it needs to be a valid dataset read from TSV and contain a set of lines with unique participant_id
and session_id
columns. This is more in line with the Python way of coding, i.e. better ask for forgiveness (try / except) than permission (if / else).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, implemented in b225b69
) | ||
|
||
|
||
def _is_categorical(df: pd.DataFrame, column: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree for the responsibility issue. I made some changes in 1a1e9bf to improve this.
That's better indeed 👍
DataFrame we are dealing with here comes from a brutal tsv read with no a priori information
So how do you figure whether a column is categorical or not from a raw string format without additional metadata, then? This is a very good question.
The implementation says anything whose dtype name does not start with "float". Which is a proxy for testing whether a column contains values which are non-float. So any column which dtype is either strings, objects or integers would be considered as categorical. Is it really the right logic, though?
Let's take the following example: if I have an age column successfully parsed as integers, then it's categorical. If it's parsed as floats, then it's not. Date-like columns would be considered categorical too. Also, object
dtype is attributed when the pandas
sniffers fail to detect a precise dtype from a column. Those columns may be categorical or malformed.
Should we decide to keep this logic, then it would be better to rename this filter to is_nonfloat
rather than keep is_categorical
in my opinion. Another option would be to enhance the reader to analyze columns which have been successfully converted as strings after convert_dtype
, compute their histograms, apply a conservative heuristic between the range of different values over the total of lines and automatically convert them to a proper categorical dtype. It would be significantly more work though.
model_term = reduce( | ||
lambda x, y: x * y, [_build_model_term(_, df) for _ in sub_terms] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I meant math.prod
. Sorry for the hasty copy pasting from the previous comment.
For some reason, I thought Numpydoc had converged on the automatic integration of type hints like Napoleon already does... |
66bc1c9
to
b225b69
Compare
Fixes #643
Description
This PR proposes to replace the current implementation of the
StatisticsSurface
pipeline, which relies on the MATLAB SurfStats toolbox, by a pure Python implementation relying on BrainStat.There are several reasons motivating this migration, the main ones being that
SurfStats
isn't maintained anymore and that it is written in MATLAB. The current solution implemented in Clinica is to vendor the MATLAB toolbox with a custom wrapper to enable calling it from Python, which is obviously far from ideal...Test the PR
POC repo
First of all, here is the repo with the proof-of-concept I made before opening this PR: https://github.com/NicolasGensollen/POC_Stat_Pipeline
It is possible to experiment with the code more easily than through Clinica's pipelining architecture.
Run the StatisticsSurface pipeline in pure Python
Obviously, this requires to have Brainstat installed (since it is not a dependency of Clinica yet). This can be done easily with pip:
For now, I made the minimum amount of work to integrate it into Clinica.
So there is still a decent amount of work to do in order to have a clean integration into Clinica.
Nonetheless, it should be possible to run the pipeline (without the plots which are crashing atm for some reason...):
$ clinica run statistics-surface ./GitRepos/clinica_data_ci/data_ci/StatisticsSurface/in/caps/ UnitTest t1-freesurfer group_comparison ./GitRepos/clinica_data_ci/data_ci/StatisticsSurface/in/subjects.tsv group --covariates age --covariates sex -np 1 -wd $HOME/WD
Feel free to try it and take a look at the code.
Feedbacks are welcome as always! 😃