Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorial: Data Exploration #46

Merged
merged 34 commits into from
Nov 4, 2024
Merged

Tutorial: Data Exploration #46

merged 34 commits into from
Nov 4, 2024

Conversation

aditya0by0
Copy link
Collaborator

@aditya0by0 aditya0by0 commented Aug 26, 2024

@aditya0by0 aditya0by0 self-assigned this Aug 26, 2024
@aditya0by0 aditya0by0 linked an issue Aug 26, 2024 that may be closed by this pull request
@aditya0by0 aditya0by0 added the documentation Improvements or additions to documentation label Aug 27, 2024
- jupyter/notebook#7002
- Fix using notebook formatter provided by pycharm professional
@aditya0by0 aditya0by0 requested a review from sfluegel05 August 27, 2024 10:38
@sfluegel05 sfluegel05 marked this pull request as ready for review September 24, 2024 15:33
Copy link
Collaborator

@sfluegel05 sfluegel05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additions that I would like to see in the notebook:

  • In the introduction, explain the role of the dataset in the training process (automatically created when needed)
  • A few links to the source code (for example, the first mention of XYBaseDataModule should link to the implementation)
  • Encodings: refer to documentation for the encodings
    Changes:
  • I would put the explanation of specific input parameters further down in order to not overwhelm a first-time user that just needs the commands for creating a simple dataset (If the parameter explanations match the docstrings, it might also be sufficient to refer to them).
  • Encodings: Don't mention InCHI or SMARTS at all or mention that they are not a supported encoding (also, technically, SMARTS is not a chemical encoding, but an encoding for sets of chemicals - it does not belong in the same category as SMILES or SELFIES)
  • Protein sequences: -> separate file

@aditya0by0 aditya0by0 marked this pull request as draft September 30, 2024 10:31
@aditya0by0
Copy link
Collaborator Author

I have done the suggested changes, Please review.

@sfluegel05 sfluegel05 marked this pull request as ready for review October 1, 2024 11:52
@sfluegel05
Copy link
Collaborator

sfluegel05 commented Oct 2, 2024

The changes we discussed earlier:

  • The sections 3 and 4 explain different aspects of the same files, I would reorder them in the following way:
    • Overview of the 3 preprocessing stages
    • For each file:
      • one-line description of content
      • code for loading file and printing content (with dynamic file names, e.g. os.path.join(self.processed_dir, self.processed_file_names_dict["data"]) instead of hard-coded paths
      • detailed description of content
    • Section 5: Add code snippet showing the actual reader output, use this to explain the tokenisation

@aditya0by0 aditya0by0 marked this pull request as draft October 5, 2024 20:07
@aditya0by0
Copy link
Collaborator Author

I have done the suggested changes, Please review.

The changes we discussed earlies:

  • The sections 3 and 4 explain different aspects of the same files, I would reorder them in the following way:

    • Overview of the 3 preprocessing stages

    • For each file:

      • one-line description of content
      • code for loading file and printing content (with dynamic file names, e.g. os.path.join(self.processed_dir, self.processed_file_names_dict["data"]) instead of hard-coded paths
      • detailed description of content
    • Section 5: Add code snippet showing the actual reader output, use this to explain the tokenisation

@sfluegel05
Copy link
Collaborator

This should be nearly done. The only addition I would like to have is a code snippet that actually uses the splits.csv file to create a new dataclass.

@aditya0by0
Copy link
Collaborator Author

I have done the suggested changes and added a cell to switch to root dir of project as suggested. Please review.

This should be nearly done. The only addition I would like to have is a code snippet that actually uses the splits.csv file to create a new dataclass.

@sfluegel05 sfluegel05 marked this pull request as ready for review November 4, 2024 13:22
Copy link
Collaborator

@sfluegel05 sfluegel05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing the changes.

@sfluegel05 sfluegel05 merged commit cae7839 into dev Nov 4, 2024
2 checks passed
@sfluegel05 sfluegel05 deleted the tutorial_data_exploration branch November 4, 2024 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create data exploration tutorial
2 participants