Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JOSS Paper #66

Merged
merged 8 commits into from
Oct 3, 2024
Merged

JOSS Paper #66

merged 8 commits into from
Oct 3, 2024

Conversation

ljwoods2
Copy link
Collaborator

No description provided.

Copy link

codecov bot commented Sep 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.10%. Comparing base (d57e287) to head (d83a5df).
Report is 53 commits behind head on main.

Additional details and impacted files

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start!


# Acknowledgements
Thank you to Google for supporting the Google Summer of Code program (GSoC) which provided
financial support for this project. Thank you to Dr. Hugo MacDermott-Opeskin and Dr. Yuxuan Zhuang
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are authors, they typically do not show up in ack.

streaming these trajectories from AWS S3, Google Cloud Buckets, and Azure Blob Storage and Data
Lakes without ever downloading them using the standard `MDAnalysis` trajectory reader API.
This is possible thanks to the `Zarr` [@Zarr:2024] package which allows streaming array-like
data from a variety of storage mediums and `Kerchunk`, which extends the capability of `Zarr`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference (or link) for Kerchunk

HPC environment, rendering it unusable by the broader scientific community.
Zarrtraj enables these trajectories to be read directly from cloud storage providers
like AWS, Google Cloud, and Microsoft Azure into MDAnalysis, a popular Python
package for analyzing trajectory data, providing a method to open up access to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make clear why anyone would actually want to access MD trajectories --- what's the motivation for streaming?

I'd also cite efforts like MDDB and the eLife MDverse paper https://elifesciences.org/articles/90061

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And machine learning, of course -- @hmacdope may have suitable citations at hand.

- file-format
- mdanalysis
- zarr
authors:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll (1) need to have each author agree on authorship (see JOSS for what that means for them) and (2) need to come up with an order that is acceptable to everyone.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljwoods2 normal protocol is to write an email to each author asking them if they would like to be an author if they haven't already agreed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can CC myself and Oliver

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And see https://joss.readthedocs.io/en/latest/submitting.html#authorship — as an author you have responsibilities

The authors themselves assume responsibility for deciding who should be credited with co-authorship, and co-authors must always agree to be listed. In addition, co-authors agree to be accountable for all aspects of the work, and to notify JOSS if any retraction or correction of mistakes are needed after publication.

I would also add that authors need to have read the manuscript and approve it (approval review on GitHub) and be willing to contribute to the writing if asked.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m happy to be included on the author list :)

which ports `H5MD` to the `Zarr` filetype. This work builds on the existing `MDAnalysis` `H5MDReader`
[@H5MDReader:2021], and similarly uses `NumPy` [@NumPy:2020] as a common interface in-between `MDAnalysis`
and the file storage medium.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would show some explicit Python code to demonstrate what this looks like in practice.

Minimal timing information is also important --- the first question anyone asks is "how slow is this".

Also address any known limitations.

Copy link
Collaborator

@hmacdope hmacdope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First go through, have a crack at these and ping for re-review.

# Statement of need

The computing power in HPC environments has increased to the point where
running simulation algorithms is often no longer the constraint in obtaining
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simulation algorithms are the constraint in obtaining trajectories still (they are the only way ) but they are not the constraint in obtaining insight / results.

Something like

The computing power in HPC environments has increased to the point where
running simulation algorithms is often no longer the constraint in obtaining scientific insights from molecular dynamics trajectory data


The computing power in HPC environments has increased to the point where
running simulation algorithms is often no longer the constraint in obtaining
molecular dynamics trajectory data for analysis. Instead, the speed of writing to disk and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disk write speed is not an issue, but the ability to process analyse and share large volumes of data is.

molecular dynamics trajectory data for analysis. Instead, the speed of writing to disk and
the ability to share generated data provide new constraints on research in this field.
While exposing download links on the open internet offers one solution this problem,
molecular dynamics trajectories are often massive files which are slow to download and expensive
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be specific

on-disk representations of molecular dynamics trajectories often range in size with large datasets up to TBs in scale

[cite]. https://registry.opendata.aws/foldingathome-covid19/, https://www.deshawresearch.com/publications/Accelerating%20Parallel%20Analysis%20of%20Scientific%20Simulation%20Data%20via%20Zazen.pdf to name a few (need to find where that last one was actually published.)

orcid: 0000-0002-3241-1846
affiliations: 1
affiliations:
- name: Placeholder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to gather the affiliations of the authors.

While exposing download links on the open internet offers one solution this problem,
molecular dynamics trajectories are often massive files which are slow to download and expensive
to store at scale, so a solution which could prevent this duplication of storage and unnecessary
download step would be more ideal.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
download step would be more ideal.
download step would provide greater utility for the computational molecular sciences ecosystem.

Also mention that this encourages FAIR data access: https://www.nature.com/articles/d41586-019-01720-7

Enter `Zarrtraj`, an `MDAnalysis` [@MDAnalysis:2016] `MDAKit` [@MDAKits:2023] which enables
streaming these trajectories from AWS S3, Google Cloud Buckets, and Azure Blob Storage and Data
Lakes without ever downloading them using the standard `MDAnalysis` trajectory reader API.
This is possible thanks to the `Zarr` [@Zarr:2024] package which allows streaming array-like
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add inspiration from geosciences community and cite the relevant papers

https://www.frontiersin.org/journals/climate/articles/10.3389/fclim.2021.782909/full

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • others

in the `H5MD` format [@H5MD:2014], which builds on top of `HDF5`, and the experimental `ZarrMD` format,
which ports `H5MD` to the `Zarr` filetype. This work builds on the existing `MDAnalysis` `H5MDReader`
[@H5MDReader:2021], and similarly uses `NumPy` [@NumPy:2020] as a common interface in-between `MDAnalysis`
and the file storage medium.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the use of zarr allows efficent slicing and seeking (possibly in parallel). Also are we making use of this parallelism? Otherwise its not different to mounting an S3 store with fsspec

[@H5MDReader:2021], and similarly uses `NumPy` [@NumPy:2020] as a common interface in-between `MDAnalysis`
and the file storage medium.

<!--
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can help write this bit, but now is a chance to be bold! without boasting, what do we see the community to be like with this in place.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shaping up nicely. Please see comments inline.

at roughly 1/2 or 1/3 the speed it can iterate through the same trajectory from disk and roughly
1/5 to 1/10 the speed it can iterate through the same trajectory on disk in XTC format \autoref{fig:benchmark}.
However, it should be noted that this speed is influenced by network latency and that
writing parallelized algorithms can offset this loss of speed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't put out bold claims with evidence. If you can show an example of "parallelized algorithms" then explain how and show data to back up your claim. Otherwise leave it out.

Basic rule of academic writing: When you write a statement you back it up

  • by showing your own data
  • by citing someone else's work

Otherwise you leave it out or mention it at the very end as possible avenues for future work.

affiliations: 1
- name: Oliver Beckstein
orcid: 000-0003-1340-0831
affiliation: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For myself [1,2], Edis [1, 2]:

 - name: Department of Physics, Arizona State University, Tempe, Arizona, United States of America
   index: 1
 - name: Center for Biological Physics, Arizona State University, Tempe, AZ, United States of America
   index: 2

For yourself: something similar but for engineering and you could also add a present affiliation for SMS.

import zarrtraj
import MDAnalysis as mda

u = mda.Universe("sample_topology.top", "s3://sample-bucket-name/trajectory.h5md")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just say topology.tpr — sample is not needed and top is not a topology file format that is used directly. You can also use psf or pdb instead of tpr.

Comment on lines +109 to +112
This work builds on the existing `MDAnalysis` `H5MDReader`
[@H5MDReader:2021], and similarly uses `NumPy` [@NumPy:2020] as a common interface in-between `MDAnalysis`
and the file storage medium. `Zarrtraj` was inspired and made possible by similar efforts in the
geosciences community to align data practices with FAIR principles [@PANGEO:2022].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move up instead of a final paragraph. This is more intro/methods. Reserve the last sentences for a bigger statement, such as the "envision" paragraph.



# Acknowledgements
Thank you to Dr. Jenna Swarthout Goddard for supporting the GSoC program at MDAnalysis.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit colloquial, write it as "We thank xxx for yyy."

joss_paper/paper.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@hmacdope hmacdope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming along great.

which extends the capability of `Zarr` by allowing it to read `HDF5` files.
Because it implements the standard `MDAnalysis` trajectory reader API,
`Zarrtraj` can leverage `Zarr`'s ability to read a file in parallel to perform analysis
algorithms in parallel using the "split-apply-combine" paradigm. In addition to the `H5MD` format,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly mention ability to read slices somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Citation for split-apply-combine

Hadley Wickham. The split-apply-combine strategy for data analysis. Journal of Statistical Software, 40 (1):1–29, 2011. doi: 10.18637/jss.v040.i01.

Have you tried it with 2.8.0-dev and the parallelized RMSD class?

`Zarrtraj` can stream and write trajectories in the experimental `ZarrMD`
format, which ports the `H5MD` layout to the `Zarr` filetype.

One imported, `Zarrtraj` allows passing trajectory URLs just like ordinary files:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
One imported, `Zarrtraj` allows passing trajectory URLs just like ordinary files:
Once imported, `Zarrtraj` allows passing trajectory URLs just like ordinary files:

```
Initial benchmarks show that `Zarrtraj` can iterate
through an AWS S3 cloud trajectory (load into memory one frame at a time)
at roughly 1/2 or 1/3 the speed it can iterate through the same trajectory from disk and roughly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicitly mention this was done in serial.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See more comments (in addition to the earlier ones).


Enter `Zarrtraj`, the first fully-functioning tool to our knowledge that allows
streaming trajectories into analysis software using an established trajectory format.
`Zarrtraj` is implemented as an `MDAnalysis` [@MDAnalysis:2016] `MDAKit` [@MDAKits:2023] that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"MDAKit" is not typeset in monospace.

I'd reserve monospace for code and perhaps package names (although I prefer italics for package names.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MDAnalysis is a proper noun (name of the project) so no monospace here and elsewhere.

(I just put up an update to the MDA branding Style Guide to make this clearer MDAnalysis/branding#10 )

Enter `Zarrtraj`, the first fully-functioning tool to our knowledge that allows
streaming trajectories into analysis software using an established trajectory format.
`Zarrtraj` is implemented as an `MDAnalysis` [@MDAnalysis:2016] `MDAKit` [@MDAKits:2023] that
enables streaming MD trajectories in the popular `HDF5`-based H5MD format [@H5MD:2014]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HDF5 is an abbreviation used as a proper noun and I would not typeset in monospace.

`Zarrtraj` is implemented as an `MDAnalysis` [@MDAnalysis:2016] `MDAKit` [@MDAKits:2023] that
enables streaming MD trajectories in the popular `HDF5`-based H5MD format [@H5MD:2014]
from AWS S3, Google Cloud Buckets, and Azure Blob Storage & Data Lakes without ever downloading them.
This is possible thanks to the `Zarr` [@Zarr:2024] package which allows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zarr is a proper noun.

which extends the capability of `Zarr` by allowing it to read `HDF5` files.
Because it implements the standard `MDAnalysis` trajectory reader API,
`Zarrtraj` can leverage `Zarr`'s ability to read a file in parallel to perform analysis
algorithms in parallel using the "split-apply-combine" paradigm. In addition to the `H5MD` format,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Citation for split-apply-combine

Hadley Wickham. The split-apply-combine strategy for data analysis. Journal of Statistical Software, 40 (1):1–29, 2011. doi: 10.18637/jss.v040.i01.

Have you tried it with 2.8.0-dev and the parallelized RMSD class?

However, it should be noted that this speed is influenced by network latency and that
writing parallelized algorithms can offset this loss of speed.

![Benchmarks performed on a machine with 2 Intel Xeon 2.00GHz CPUs, 32GB of RAM, and an SSD configured with RAID 0.\label{fig:benchmark}](benchmark.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more details on the test trajectory. Number of particles, number of frames, size (GB).

Is it one from MDAnalysisData, if so provide details (eg doi of location).

Make your paper as reproducible as possible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the bandwidth of the network?

joss_paper/paper.md Outdated Show resolved Hide resolved
- file-format
- mdanalysis
- zarr
authors:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m happy to be included on the author list :)

affiliation: 1
- name: Yuxuan Zhuang
orcid: 0000-0003-4390-8556
affiliations: 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me:
[1] Department of Computer Science, Stanford University, Stanford, CA 94305, USA.
[2]D epartments of Molecular and Cellular Physiology and Structural Biology, Stanford University School of Medicine, Stanford, CA 94305, USA.

However, it should be noted that this speed is influenced by network latency and that
writing parallelized algorithms can offset this loss of speed.

![Benchmarks performed on a machine with 2 Intel Xeon 2.00GHz CPUs, 32GB of RAM, and an SSD configured with RAID 0.\label{fig:benchmark}](benchmark.png)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the bandwidth of the network?

Other groups in the field recognize this same need for adherence to
FAIR principles [@FAIR:2019] including the MDDB (Molecular Dynamics Data Bank), an EU-scale
repository for biosimulation data [@MDDB:2024] and MDverse, a prototype search engine
for publicly-available Gromacs simulation data [@MDverse:2024].

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all caps: GROMACS

Instead, the ability to process, analyze and share large volumes of data provide
new constraints on research in this field.

Other groups in the field recognize this same need for adherence to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think MDsrv https://academic.oup.com/nar/article/50/W1/W483/6593534 should also be mentioned here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emphasizing that they only provide predefined analysis results or simple geometric features.

analyses of large, conglomerate datasets from different sources, and training
machine learning models without downloading and storing trajectory data.

# Statement of need

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all the contents below belong to Statement of need; maybe add one or more sections about Examples, Design Principles etc.

@ljwoods2 ljwoods2 merged commit d83a5df into main Oct 3, 2024
24 checks passed
@ljwoods2
Copy link
Collaborator Author

ljwoods2 commented Oct 3, 2024

Accidentally merged this while removing a bug that I found when testing with mda 2.8.0 on another branch, will address all the comments here and open a new PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants