Tutorial: Data Exploration #46

aditya0by0 · 2024-08-26T18:06:48Z

Issue: Create data exploration tutorial #33
Dependency : PR Protein function prediction with GO - Part 2 #57

- jupyter/notebook#7002 - Fix using notebook formatter provided by pycharm professional

sfluegel05

Some additions that I would like to see in the notebook:

In the introduction, explain the role of the dataset in the training process (automatically created when needed)
A few links to the source code (for example, the first mention of XYBaseDataModule should link to the implementation)
Encodings: refer to documentation for the encodings
Changes:
I would put the explanation of specific input parameters further down in order to not overwhelm a first-time user that just needs the commands for creating a simple dataset (If the parameter explanations match the docstrings, it might also be sufficient to refer to them).
Encodings: Don't mention InCHI or SMARTS at all or mention that they are not a supported encoding (also, technically, SMARTS is not a chemical encoding, but an encoding for sets of chemicals - it does not belong in the same category as SMILES or SELFIES)
Protein sequences: -> separate file

- #46 (review)

aditya0by0 · 2024-10-01T08:42:48Z

I have done the suggested changes, Please review.

sfluegel05 · 2024-10-02T09:38:52Z

The changes we discussed earlier:

The sections 3 and 4 explain different aspects of the same files, I would reorder them in the following way:
- Overview of the 3 preprocessing stages
- For each file:
  - one-line description of content
  - code for loading file and printing content (with dynamic file names, e.g. os.path.join(self.processed_dir, self.processed_file_names_dict["data"]) instead of hard-coded paths
  - detailed description of content
- Section 5: Add code snippet showing the actual reader output, use this to explain the tokenisation

aditya0by0 · 2024-10-06T13:57:03Z

I have done the suggested changes, Please review.

The changes we discussed earlies:

The sections 3 and 4 explain different aspects of the same files, I would reorder them in the following way:

Overview of the 3 preprocessing stages

For each file:

one-line description of content

code for loading file and printing content (with dynamic file names, e.g. os.path.join(self.processed_dir, self.processed_file_names_dict["data"]) instead of hard-coded paths

detailed description of content

Section 5: Add code snippet showing the actual reader output, use this to explain the tokenisation

…tutorial_data_exploration

sfluegel05 · 2024-10-30T16:22:30Z

This should be nearly done. The only addition I would like to have is a code snippet that actually uses the splits.csv file to create a new dataclass.

aditya0by0 · 2024-11-02T10:25:38Z

I have done the suggested changes and added a cell to switch to root dir of project as suggested. Please review.

This should be nearly done. The only addition I would like to have is a code snippet that actually uses the splits.csv file to create a new dataclass.

sfluegel05

Thanks for implementing the changes.

Create data_exploration.ipynb

6bb1a85

aditya0by0 self-assigned this Aug 26, 2024

aditya0by0 linked an issue Aug 26, 2024 that may be closed by this pull request

Create data exploration tutorial #33

Closed

aditya0by0 added 3 commits August 27, 2024 00:04

added information stored in files

830184f

Molecule: Different Encodings

7005a69

add info related to protein dataset

13aa945

aditya0by0 added the documentation Improvements or additions to documentation label Aug 27, 2024

aditya0by0 added 2 commits August 27, 2024 12:33

fix - jupyter markdown cells formatting issue

0e4814f

- jupyter/notebook#7002 - Fix using notebook formatter provided by pycharm professional

move to tutorials dir

8539f3b

aditya0by0 requested a review from sfluegel05 August 27, 2024 10:38

sfluegel05 marked this pull request as ready for review September 24, 2024 15:33

sfluegel05 reviewed Sep 24, 2024

View reviewed changes

minor changes to texts

6b9024b

aditya0by0 marked this pull request as draft September 30, 2024 10:31

aditya0by0 added 6 commits September 30, 2024 16:01

Merge branch 'dev' into tutorial_data_exploration

75cac16

chebi notebook : suggested changes

4fc31da

- #46 (review)

go_notebook: data exploration

587c026

Delete data_exploration.ipynb

71e9888

add info on evidence codes + uniprot.data file + changes

c6b8d50

minor formatting changes

4c55b04

aditya0by0 requested a review from sfluegel05 September 30, 2024 21:56

sfluegel05 marked this pull request as ready for review October 1, 2024 11:52

move commands to the top, restructure section 2

33a5e64

aditya0by0 marked this pull request as draft October 5, 2024 20:07

aditya0by0 added 4 commits October 5, 2024 23:48

re-order section 3 and 4 as per suggestion

242db56

GO: reformat section 3 and 4 as per suggestion

748eebe

Merge branch 'dev' into tutorial_data_exploration

f5260d6

Chebi: reader class explanation

6911d8a

aditya0by0 added 2 commits October 6, 2024 15:51

GO: reader class explanation

6d162c7

chebi: minor change in tokenization and encoding

5c8c185

aditya0by0 and others added 5 commits October 11, 2024 12:52

Merge branch 'dev' into tutorial_data_exploration

5057445

update GO evidence codes info

18e3253

go evidence code minor info change

261e8c1

minor changes to GO notebook

8e91ca7

Merge remote-tracking branch 'origin/tutorial_data_exploration' into …

a69692f

…tutorial_data_exploration

aditya0by0 added 6 commits October 30, 2024 19:42

Merge branch 'dev' into tutorial_data_exploration

d42d622

GO: add cell to change the cwd to project root dir

5082829

chebi: add cell to change the cwd to project root dir

661a78a

GO: use spilt file to create new data class

eecc96f

GO: fix json parsing

c05b868

chebi: use spilt file to create new data class

b83e5cd

aditya0by0 added 3 commits November 2, 2024 11:52

go: changes to data as per new code change

b5abb0a

add output for prepare-setup data cell

f659311

reformat with precommit

22e864f

sfluegel05 marked this pull request as ready for review November 4, 2024 13:22

sfluegel05 approved these changes Nov 4, 2024

View reviewed changes

sfluegel05 merged commit cae7839 into dev Nov 4, 2024
2 checks passed

sfluegel05 deleted the tutorial_data_exploration branch November 4, 2024 13:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial: Data Exploration #46

Tutorial: Data Exploration #46

aditya0by0 commented Aug 26, 2024 •

edited

Loading

sfluegel05 left a comment

aditya0by0 commented Oct 1, 2024

sfluegel05 commented Oct 2, 2024 •

edited

Loading

aditya0by0 commented Oct 6, 2024

sfluegel05 commented Oct 30, 2024

aditya0by0 commented Nov 2, 2024

sfluegel05 left a comment

Tutorial: Data Exploration #46

Tutorial: Data Exploration #46

Conversation

aditya0by0 commented Aug 26, 2024 • edited Loading

Dependency : PR Protein function prediction with GO - Part 2 #57

sfluegel05 left a comment

Choose a reason for hiding this comment

aditya0by0 commented Oct 1, 2024

sfluegel05 commented Oct 2, 2024 • edited Loading

aditya0by0 commented Oct 6, 2024

sfluegel05 commented Oct 30, 2024

aditya0by0 commented Nov 2, 2024

sfluegel05 left a comment

Choose a reason for hiding this comment

aditya0by0 commented Aug 26, 2024 •

edited

Loading

sfluegel05 commented Oct 2, 2024 •

edited

Loading