-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JOSS Paper #66
JOSS Paper #66
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start!
joss_paper/paper.md
Outdated
|
||
# Acknowledgements | ||
Thank you to Google for supporting the Google Summer of Code program (GSoC) which provided | ||
financial support for this project. Thank you to Dr. Hugo MacDermott-Opeskin and Dr. Yuxuan Zhuang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they are authors, they typically do not show up in ack.
joss_paper/paper.md
Outdated
streaming these trajectories from AWS S3, Google Cloud Buckets, and Azure Blob Storage and Data | ||
Lakes without ever downloading them using the standard `MDAnalysis` trajectory reader API. | ||
This is possible thanks to the `Zarr` [@Zarr:2024] package which allows streaming array-like | ||
data from a variety of storage mediums and `Kerchunk`, which extends the capability of `Zarr` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reference (or link) for Kerchunk
HPC environment, rendering it unusable by the broader scientific community. | ||
Zarrtraj enables these trajectories to be read directly from cloud storage providers | ||
like AWS, Google Cloud, and Microsoft Azure into MDAnalysis, a popular Python | ||
package for analyzing trajectory data, providing a method to open up access to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make clear why anyone would actually want to access MD trajectories --- what's the motivation for streaming?
I'd also cite efforts like MDDB and the eLife MDverse paper https://elifesciences.org/articles/90061
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And machine learning, of course -- @hmacdope may have suitable citations at hand.
- file-format | ||
- mdanalysis | ||
- zarr | ||
authors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll (1) need to have each author agree on authorship (see JOSS for what that means for them) and (2) need to come up with an order that is acceptable to everyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ljwoods2 normal protocol is to write an email to each author asking them if they would like to be an author if they haven't already agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can CC myself and Oliver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And see https://joss.readthedocs.io/en/latest/submitting.html#authorship — as an author you have responsibilities
The authors themselves assume responsibility for deciding who should be credited with co-authorship, and co-authors must always agree to be listed. In addition, co-authors agree to be accountable for all aspects of the work, and to notify JOSS if any retraction or correction of mistakes are needed after publication.
I would also add that authors need to have read the manuscript and approve it (approval review on GitHub) and be willing to contribute to the writing if asked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m happy to be included on the author list :)
which ports `H5MD` to the `Zarr` filetype. This work builds on the existing `MDAnalysis` `H5MDReader` | ||
[@H5MDReader:2021], and similarly uses `NumPy` [@NumPy:2020] as a common interface in-between `MDAnalysis` | ||
and the file storage medium. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would show some explicit Python code to demonstrate what this looks like in practice.
Minimal timing information is also important --- the first question anyone asks is "how slow is this".
Also address any known limitations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First go through, have a crack at these and ping for re-review.
joss_paper/paper.md
Outdated
# Statement of need | ||
|
||
The computing power in HPC environments has increased to the point where | ||
running simulation algorithms is often no longer the constraint in obtaining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simulation algorithms are the constraint in obtaining trajectories still (they are the only way ) but they are not the constraint in obtaining insight / results.
Something like
The computing power in HPC environments has increased to the point where
running simulation algorithms is often no longer the constraint in obtaining scientific insights from molecular dynamics trajectory data
joss_paper/paper.md
Outdated
|
||
The computing power in HPC environments has increased to the point where | ||
running simulation algorithms is often no longer the constraint in obtaining | ||
molecular dynamics trajectory data for analysis. Instead, the speed of writing to disk and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disk write speed is not an issue, but the ability to process analyse and share large volumes of data is.
joss_paper/paper.md
Outdated
molecular dynamics trajectory data for analysis. Instead, the speed of writing to disk and | ||
the ability to share generated data provide new constraints on research in this field. | ||
While exposing download links on the open internet offers one solution this problem, | ||
molecular dynamics trajectories are often massive files which are slow to download and expensive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be specific
on-disk representations of molecular dynamics trajectories often range in size with large datasets up to TBs in scale
[cite]. https://registry.opendata.aws/foldingathome-covid19/, https://www.deshawresearch.com/publications/Accelerating%20Parallel%20Analysis%20of%20Scientific%20Simulation%20Data%20via%20Zazen.pdf to name a few (need to find where that last one was actually published.)
orcid: 0000-0002-3241-1846 | ||
affiliations: 1 | ||
affiliations: | ||
- name: Placeholder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to gather the affiliations of the authors.
joss_paper/paper.md
Outdated
While exposing download links on the open internet offers one solution this problem, | ||
molecular dynamics trajectories are often massive files which are slow to download and expensive | ||
to store at scale, so a solution which could prevent this duplication of storage and unnecessary | ||
download step would be more ideal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
download step would be more ideal. | |
download step would provide greater utility for the computational molecular sciences ecosystem. |
Also mention that this encourages FAIR data access: https://www.nature.com/articles/d41586-019-01720-7
joss_paper/paper.md
Outdated
Enter `Zarrtraj`, an `MDAnalysis` [@MDAnalysis:2016] `MDAKit` [@MDAKits:2023] which enables | ||
streaming these trajectories from AWS S3, Google Cloud Buckets, and Azure Blob Storage and Data | ||
Lakes without ever downloading them using the standard `MDAnalysis` trajectory reader API. | ||
This is possible thanks to the `Zarr` [@Zarr:2024] package which allows streaming array-like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add inspiration from geosciences community and cite the relevant papers
https://www.frontiersin.org/journals/climate/articles/10.3389/fclim.2021.782909/full
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- others
joss_paper/paper.md
Outdated
in the `H5MD` format [@H5MD:2014], which builds on top of `HDF5`, and the experimental `ZarrMD` format, | ||
which ports `H5MD` to the `Zarr` filetype. This work builds on the existing `MDAnalysis` `H5MDReader` | ||
[@H5MDReader:2021], and similarly uses `NumPy` [@NumPy:2020] as a common interface in-between `MDAnalysis` | ||
and the file storage medium. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the use of zarr allows efficent slicing and seeking (possibly in parallel). Also are we making use of this parallelism? Otherwise its not different to mounting an S3 store with fsspec
joss_paper/paper.md
Outdated
[@H5MDReader:2021], and similarly uses `NumPy` [@NumPy:2020] as a common interface in-between `MDAnalysis` | ||
and the file storage medium. | ||
|
||
<!-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can help write this bit, but now is a chance to be bold! without boasting, what do we see the community to be like with this in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shaping up nicely. Please see comments inline.
at roughly 1/2 or 1/3 the speed it can iterate through the same trajectory from disk and roughly | ||
1/5 to 1/10 the speed it can iterate through the same trajectory on disk in XTC format \autoref{fig:benchmark}. | ||
However, it should be noted that this speed is influenced by network latency and that | ||
writing parallelized algorithms can offset this loss of speed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't put out bold claims with evidence. If you can show an example of "parallelized algorithms" then explain how and show data to back up your claim. Otherwise leave it out.
Basic rule of academic writing: When you write a statement you back it up
- by showing your own data
- by citing someone else's work
Otherwise you leave it out or mention it at the very end as possible avenues for future work.
affiliations: 1 | ||
- name: Oliver Beckstein | ||
orcid: 000-0003-1340-0831 | ||
affiliation: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For myself [1,2], Edis [1, 2]:
- name: Department of Physics, Arizona State University, Tempe, Arizona, United States of America
index: 1
- name: Center for Biological Physics, Arizona State University, Tempe, AZ, United States of America
index: 2
For yourself: something similar but for engineering and you could also add a present affiliation for SMS.
import zarrtraj | ||
import MDAnalysis as mda | ||
|
||
u = mda.Universe("sample_topology.top", "s3://sample-bucket-name/trajectory.h5md") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just say topology.tpr
— sample is not needed and top is not a topology file format that is used directly. You can also use psf or pdb instead of tpr.
This work builds on the existing `MDAnalysis` `H5MDReader` | ||
[@H5MDReader:2021], and similarly uses `NumPy` [@NumPy:2020] as a common interface in-between `MDAnalysis` | ||
and the file storage medium. `Zarrtraj` was inspired and made possible by similar efforts in the | ||
geosciences community to align data practices with FAIR principles [@PANGEO:2022]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move up instead of a final paragraph. This is more intro/methods. Reserve the last sentences for a bigger statement, such as the "envision" paragraph.
|
||
|
||
# Acknowledgements | ||
Thank you to Dr. Jenna Swarthout Goddard for supporting the GSoC program at MDAnalysis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit colloquial, write it as "We thank xxx for yyy."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming along great.
which extends the capability of `Zarr` by allowing it to read `HDF5` files. | ||
Because it implements the standard `MDAnalysis` trajectory reader API, | ||
`Zarrtraj` can leverage `Zarr`'s ability to read a file in parallel to perform analysis | ||
algorithms in parallel using the "split-apply-combine" paradigm. In addition to the `H5MD` format, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly mention ability to read slices somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Citation for split-apply-combine
Hadley Wickham. The split-apply-combine strategy for data analysis. Journal of Statistical Software, 40 (1):1–29, 2011. doi: 10.18637/jss.v040.i01.
Have you tried it with 2.8.0-dev and the parallelized RMSD class?
`Zarrtraj` can stream and write trajectories in the experimental `ZarrMD` | ||
format, which ports the `H5MD` layout to the `Zarr` filetype. | ||
|
||
One imported, `Zarrtraj` allows passing trajectory URLs just like ordinary files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One imported, `Zarrtraj` allows passing trajectory URLs just like ordinary files: | |
Once imported, `Zarrtraj` allows passing trajectory URLs just like ordinary files: |
``` | ||
Initial benchmarks show that `Zarrtraj` can iterate | ||
through an AWS S3 cloud trajectory (load into memory one frame at a time) | ||
at roughly 1/2 or 1/3 the speed it can iterate through the same trajectory from disk and roughly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explicitly mention this was done in serial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See more comments (in addition to the earlier ones).
|
||
Enter `Zarrtraj`, the first fully-functioning tool to our knowledge that allows | ||
streaming trajectories into analysis software using an established trajectory format. | ||
`Zarrtraj` is implemented as an `MDAnalysis` [@MDAnalysis:2016] `MDAKit` [@MDAKits:2023] that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"MDAKit" is not typeset in monospace.
I'd reserve monospace for code and perhaps package names (although I prefer italics for package names.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MDAnalysis is a proper noun (name of the project) so no monospace here and elsewhere.
(I just put up an update to the MDA branding Style Guide to make this clearer MDAnalysis/branding#10 )
Enter `Zarrtraj`, the first fully-functioning tool to our knowledge that allows | ||
streaming trajectories into analysis software using an established trajectory format. | ||
`Zarrtraj` is implemented as an `MDAnalysis` [@MDAnalysis:2016] `MDAKit` [@MDAKits:2023] that | ||
enables streaming MD trajectories in the popular `HDF5`-based H5MD format [@H5MD:2014] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HDF5 is an abbreviation used as a proper noun and I would not typeset in monospace.
`Zarrtraj` is implemented as an `MDAnalysis` [@MDAnalysis:2016] `MDAKit` [@MDAKits:2023] that | ||
enables streaming MD trajectories in the popular `HDF5`-based H5MD format [@H5MD:2014] | ||
from AWS S3, Google Cloud Buckets, and Azure Blob Storage & Data Lakes without ever downloading them. | ||
This is possible thanks to the `Zarr` [@Zarr:2024] package which allows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zarr is a proper noun.
which extends the capability of `Zarr` by allowing it to read `HDF5` files. | ||
Because it implements the standard `MDAnalysis` trajectory reader API, | ||
`Zarrtraj` can leverage `Zarr`'s ability to read a file in parallel to perform analysis | ||
algorithms in parallel using the "split-apply-combine" paradigm. In addition to the `H5MD` format, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Citation for split-apply-combine
Hadley Wickham. The split-apply-combine strategy for data analysis. Journal of Statistical Software, 40 (1):1–29, 2011. doi: 10.18637/jss.v040.i01.
Have you tried it with 2.8.0-dev and the parallelized RMSD class?
However, it should be noted that this speed is influenced by network latency and that | ||
writing parallelized algorithms can offset this loss of speed. | ||
|
||
![Benchmarks performed on a machine with 2 Intel Xeon 2.00GHz CPUs, 32GB of RAM, and an SSD configured with RAID 0.\label{fig:benchmark}](benchmark.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add more details on the test trajectory. Number of particles, number of frames, size (GB).
Is it one from MDAnalysisData, if so provide details (eg doi of location).
Make your paper as reproducible as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the bandwidth of the network?
- file-format | ||
- mdanalysis | ||
- zarr | ||
authors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m happy to be included on the author list :)
affiliation: 1 | ||
- name: Yuxuan Zhuang | ||
orcid: 0000-0003-4390-8556 | ||
affiliations: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me:
[1] Department of Computer Science, Stanford University, Stanford, CA 94305, USA.
[2]D epartments of Molecular and Cellular Physiology and Structural Biology, Stanford University School of Medicine, Stanford, CA 94305, USA.
However, it should be noted that this speed is influenced by network latency and that | ||
writing parallelized algorithms can offset this loss of speed. | ||
|
||
![Benchmarks performed on a machine with 2 Intel Xeon 2.00GHz CPUs, 32GB of RAM, and an SSD configured with RAID 0.\label{fig:benchmark}](benchmark.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the bandwidth of the network?
Other groups in the field recognize this same need for adherence to | ||
FAIR principles [@FAIR:2019] including the MDDB (Molecular Dynamics Data Bank), an EU-scale | ||
repository for biosimulation data [@MDDB:2024] and MDverse, a prototype search engine | ||
for publicly-available Gromacs simulation data [@MDverse:2024]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all caps: GROMACS
Instead, the ability to process, analyze and share large volumes of data provide | ||
new constraints on research in this field. | ||
|
||
Other groups in the field recognize this same need for adherence to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think MDsrv https://academic.oup.com/nar/article/50/W1/W483/6593534 should also be mentioned here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other papers worth citing:
- Bringing Molecular Dynamics Simulation Data into View https://www.cell.com/trends/biochemical-sciences/fulltext/S0968-0004(19)30137-9
- GPCRmd uncovers the dynamics of the 3D-GPCRome
https://www.nature.com/articles/s41592-020-0884-y - Sharing Data from Molecular Simulations https://pubs.acs.org/doi/full/10.1021/acs.jcim.9b00665?casa_token=Ivo26Sn_xFAAAAAA%3AwJLKekUvaYfOPZyi97LPZR7zir_x9cWlcbEb8UOZN7ZTqeNpmPE96rsLChkot3xPeGUtR4Es42lx7VKFyg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emphasizing that they only provide predefined analysis results or simple geometric features.
analyses of large, conglomerate datasets from different sources, and training | ||
machine learning models without downloading and storing trajectory data. | ||
|
||
# Statement of need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all the contents below belong to Statement of need
; maybe add one or more sections about Examples
, Design Principles
etc.
Accidentally merged this while removing a bug that I found when testing with mda 2.8.0 on another branch, will address all the comments here and open a new PR |
No description provided.