JOSS Paper #66

ljwoods2 · 2024-09-23T02:31:25Z

No description provided.

codecov · 2024-09-23T02:35:06Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.10%. Comparing base (d57e287) to head (d83a5df).
Report is 53 commits behind head on main.

Additional details and impacted files

orbeckst

Good start!

orbeckst · 2024-09-23T04:39:59Z

joss_paper/paper.md

+
+# Acknowledgements
+Thank you to Google for supporting the Google Summer of Code program (GSoC) which provided
+financial support for this project. Thank you to Dr. Hugo MacDermott-Opeskin and Dr. Yuxuan Zhuang 


If they are authors, they typically do not show up in ack.

orbeckst · 2024-09-23T04:40:29Z

joss_paper/paper.md

+streaming these trajectories from AWS S3, Google Cloud Buckets, and Azure Blob Storage and Data
+Lakes without ever downloading them using the standard `MDAnalysis` trajectory reader API.
+This is possible thanks to the `Zarr` [@Zarr:2024] package which allows streaming array-like
+data from a variety of storage mediums and `Kerchunk`, which extends the capability of `Zarr`


reference (or link) for Kerchunk

orbeckst · 2024-09-23T04:42:25Z

joss_paper/paper.md

+HPC environment, rendering it unusable by the broader scientific community.
+Zarrtraj enables these trajectories to be read directly from cloud storage providers
+like AWS, Google Cloud, and Microsoft Azure into MDAnalysis, a popular Python 
+package for analyzing trajectory data, providing a method to open up access to


Make clear why anyone would actually want to access MD trajectories --- what's the motivation for streaming?

I'd also cite efforts like MDDB and the eLife MDverse paper https://elifesciences.org/articles/90061

And machine learning, of course -- @hmacdope may have suitable citations at hand.

orbeckst · 2024-09-23T04:45:05Z

joss_paper/paper.md

+  - file-format
+  - mdanalysis
+  - zarr
+authors:


You'll (1) need to have each author agree on authorship (see JOSS for what that means for them) and (2) need to come up with an order that is acceptable to everyone.

@ljwoods2 normal protocol is to write an email to each author asking them if they would like to be an author if they haven't already agreed.

You can CC myself and Oliver

And see https://joss.readthedocs.io/en/latest/submitting.html#authorship — as an author you have responsibilities

The authors themselves assume responsibility for deciding who should be credited with co-authorship, and co-authors must always agree to be listed. In addition, co-authors agree to be accountable for all aspects of the work, and to notify JOSS if any retraction or correction of mistakes are needed after publication.

I would also add that authors need to have read the manuscript and approve it (approval review on GitHub) and be willing to contribute to the writing if asked.

I’m happy to be included on the author list :)

orbeckst · 2024-09-23T04:46:46Z

joss_paper/paper.md

+which ports `H5MD` to the `Zarr` filetype. This work builds on the existing `MDAnalysis` `H5MDReader`
+[@H5MDReader:2021], and similarly uses `NumPy` [@NumPy:2020] as a common interface in-between `MDAnalysis`
+and the file storage medium.
+


I would show some explicit Python code to demonstrate what this looks like in practice.

Minimal timing information is also important --- the first question anyone asks is "how slow is this".

Also address any known limitations.

hmacdope

First go through, have a crack at these and ping for re-review.

hmacdope · 2024-09-24T00:48:10Z

joss_paper/paper.md

+# Statement of need
+
+The computing power in HPC environments has increased to the point where
+running simulation algorithms is often no longer the constraint in obtaining


simulation algorithms are the constraint in obtaining trajectories still (they are the only way ) but they are not the constraint in obtaining insight / results.

Something like

The computing power in HPC environments has increased to the point where running simulation algorithms is often no longer the constraint in obtaining scientific insights from molecular dynamics trajectory data

hmacdope · 2024-09-24T00:49:02Z

joss_paper/paper.md

+
+The computing power in HPC environments has increased to the point where
+running simulation algorithms is often no longer the constraint in obtaining
+molecular dynamics trajectory data for analysis. Instead, the speed of writing to disk and


Disk write speed is not an issue, but the ability to process analyse and share large volumes of data is.

hmacdope · 2024-09-24T00:57:10Z

joss_paper/paper.md

+molecular dynamics trajectory data for analysis. Instead, the speed of writing to disk and
+the ability to share generated data provide new constraints on research in this field.
+While exposing download links on the open internet offers one solution this problem,
+molecular dynamics trajectories are often massive files which are slow to download and expensive


Be specific

on-disk representations of molecular dynamics trajectories often range in size with large datasets up to TBs in scale

[cite]. https://registry.opendata.aws/foldingathome-covid19/, https://www.deshawresearch.com/publications/Accelerating%20Parallel%20Analysis%20of%20Scientific%20Simulation%20Data%20via%20Zazen.pdf to name a few (need to find where that last one was actually published.)

hmacdope · 2024-09-24T01:03:39Z

joss_paper/paper.md

+    orcid: 0000-0002-3241-1846
+    affiliations: 1
+affiliations:
+ - name: Placeholder


You need to gather the affiliations of the authors.

hmacdope · 2024-09-24T01:06:04Z

joss_paper/paper.md

+While exposing download links on the open internet offers one solution this problem,
+molecular dynamics trajectories are often massive files which are slow to download and expensive
+to store at scale, so a solution which could prevent this duplication of storage and unnecessary 
+download step would be more ideal.


Suggested change

download step would be more ideal.

download step would provide greater utility for the computational molecular sciences ecosystem.

Also mention that this encourages FAIR data access: https://www.nature.com/articles/d41586-019-01720-7

hmacdope · 2024-09-24T01:08:13Z

joss_paper/paper.md

+Enter `Zarrtraj`, an `MDAnalysis` [@MDAnalysis:2016] `MDAKit` [@MDAKits:2023] which enables 
+streaming these trajectories from AWS S3, Google Cloud Buckets, and Azure Blob Storage and Data
+Lakes without ever downloading them using the standard `MDAnalysis` trajectory reader API.
+This is possible thanks to the `Zarr` [@Zarr:2024] package which allows streaming array-like


Add inspiration from geosciences community and cite the relevant papers

https://www.frontiersin.org/journals/climate/articles/10.3389/fclim.2021.782909/full

hmacdope · 2024-09-24T01:09:51Z

joss_paper/paper.md

+in the `H5MD` format [@H5MD:2014], which builds on top of `HDF5`, and the experimental `ZarrMD` format,
+which ports `H5MD` to the `Zarr` filetype. This work builds on the existing `MDAnalysis` `H5MDReader`
+[@H5MDReader:2021], and similarly uses `NumPy` [@NumPy:2020] as a common interface in-between `MDAnalysis`
+and the file storage medium.


Note the use of zarr allows efficent slicing and seeking (possibly in parallel). Also are we making use of this parallelism? Otherwise its not different to mounting an S3 store with fsspec

hmacdope · 2024-09-24T01:15:31Z

joss_paper/paper.md

+[@H5MDReader:2021], and similarly uses `NumPy` [@NumPy:2020] as a common interface in-between `MDAnalysis`
+and the file storage medium.
+
+<!-- 


I can help write this bit, but now is a chance to be bold! without boasting, what do we see the community to be like with this in place.

orbeckst

Shaping up nicely. Please see comments inline.

orbeckst · 2024-09-26T21:14:29Z

joss_paper/paper.md

+at roughly 1/2 or 1/3 the speed it can iterate through the same trajectory from disk and roughly 
+1/5 to 1/10 the speed it can iterate through the same trajectory on disk in XTC format \autoref{fig:benchmark}.
+However, it should be noted that this speed is influenced by network latency and that
+writing parallelized algorithms can offset this loss of speed.


I wouldn't put out bold claims with evidence. If you can show an example of "parallelized algorithms" then explain how and show data to back up your claim. Otherwise leave it out.

Basic rule of academic writing: When you write a statement you back it up

by showing your own data

by citing someone else's work

Otherwise you leave it out or mention it at the very end as possible avenues for future work.

orbeckst · 2024-09-26T21:18:44Z

joss_paper/paper.md

+    affiliations: 1
+  - name: Oliver Beckstein
+    orcid: 000-0003-1340-0831
+    affiliation: 1


For myself [1,2], Edis [1, 2]:

- name: Department of Physics, Arizona State University, Tempe, Arizona, United States of America index: 1 - name: Center for Biological Physics, Arizona State University, Tempe, AZ, United States of America index: 2

For yourself: something similar but for engineering and you could also add a present affiliation for SMS.

orbeckst · 2024-09-26T21:19:05Z

joss_paper/paper.md

+import zarrtraj
+import MDAnalysis as mda
+
+u = mda.Universe("sample_topology.top", "s3://sample-bucket-name/trajectory.h5md")


I just say topology.tpr — sample is not needed and top is not a topology file format that is used directly. You can also use psf or pdb instead of tpr.

orbeckst · 2024-09-26T21:21:58Z

joss_paper/paper.md

+This work builds on the existing `MDAnalysis` `H5MDReader`
+[@H5MDReader:2021], and similarly uses `NumPy` [@NumPy:2020] as a common interface in-between `MDAnalysis`
+and the file storage medium. `Zarrtraj` was inspired and made possible by similar efforts in the 
+geosciences community to align data practices with FAIR principles [@PANGEO:2022].


Move up instead of a final paragraph. This is more intro/methods. Reserve the last sentences for a bigger statement, such as the "envision" paragraph.

orbeckst · 2024-09-26T21:22:23Z

joss_paper/paper.md

+
+
+# Acknowledgements
+Thank you to Dr. Jenna Swarthout Goddard for supporting the GSoC program at MDAnalysis. 


A bit colloquial, write it as "We thank xxx for yyy."

joss_paper/paper.md

hmacdope

Coming along great.

hmacdope · 2024-10-01T11:33:40Z

joss_paper/paper.md

+which extends the capability of `Zarr` by allowing it to read `HDF5` files.
+Because it implements the standard `MDAnalysis` trajectory reader API,
+`Zarrtraj` can leverage `Zarr`'s ability to read a file in parallel to perform analysis 
+algorithms in parallel using the "split-apply-combine" paradigm. In addition to the `H5MD` format, 


Explicitly mention ability to read slices somewhere.

Citation for split-apply-combine

Hadley Wickham. The split-apply-combine strategy for data analysis. Journal of Statistical Software, 40 (1):1–29, 2011. doi: 10.18637/jss.v040.i01.

Have you tried it with 2.8.0-dev and the parallelized RMSD class?

hmacdope · 2024-10-01T11:34:00Z

joss_paper/paper.md

+`Zarrtraj` can stream and write trajectories in the experimental `ZarrMD` 
+format, which ports the `H5MD` layout to the `Zarr` filetype.
+
+One imported, `Zarrtraj` allows passing trajectory URLs just like ordinary files:


Suggested change

One imported, `Zarrtraj` allows passing trajectory URLs just like ordinary files:

Once imported, `Zarrtraj` allows passing trajectory URLs just like ordinary files:

hmacdope · 2024-10-01T11:34:45Z

joss_paper/paper.md

+```
+Initial benchmarks show that `Zarrtraj` can iterate
+through an AWS S3 cloud trajectory (load into memory one frame at a time)
+at roughly 1/2 or 1/3 the speed it can iterate through the same trajectory from disk and roughly 


explicitly mention this was done in serial.

orbeckst

See more comments (in addition to the earlier ones).

orbeckst · 2024-10-01T21:52:42Z

joss_paper/paper.md

+
+Enter `Zarrtraj`, the first fully-functioning tool to our knowledge that allows 
+streaming trajectories into analysis software using an established trajectory format.
+`Zarrtraj` is implemented as an `MDAnalysis` [@MDAnalysis:2016] `MDAKit` [@MDAKits:2023] that


"MDAKit" is not typeset in monospace.

I'd reserve monospace for code and perhaps package names (although I prefer italics for package names.)

MDAnalysis is a proper noun (name of the project) so no monospace here and elsewhere.

(I just put up an update to the MDA branding Style Guide to make this clearer MDAnalysis/branding#10 )

orbeckst · 2024-10-01T21:53:53Z

joss_paper/paper.md

+Enter `Zarrtraj`, the first fully-functioning tool to our knowledge that allows 
+streaming trajectories into analysis software using an established trajectory format.
+`Zarrtraj` is implemented as an `MDAnalysis` [@MDAnalysis:2016] `MDAKit` [@MDAKits:2023] that
+enables streaming MD trajectories in the popular `HDF5`-based H5MD format [@H5MD:2014]


HDF5 is an abbreviation used as a proper noun and I would not typeset in monospace.

orbeckst · 2024-10-01T21:54:11Z

joss_paper/paper.md

+`Zarrtraj` is implemented as an `MDAnalysis` [@MDAnalysis:2016] `MDAKit` [@MDAKits:2023] that
+enables streaming MD trajectories in the popular `HDF5`-based H5MD format [@H5MD:2014]
+from AWS S3, Google Cloud Buckets, and Azure Blob Storage & Data Lakes without ever downloading them.
+This is possible thanks to the `Zarr` [@Zarr:2024] package which allows 


Zarr is a proper noun.

orbeckst · 2024-10-01T22:29:15Z

joss_paper/paper.md

+which extends the capability of `Zarr` by allowing it to read `HDF5` files.
+Because it implements the standard `MDAnalysis` trajectory reader API,
+`Zarrtraj` can leverage `Zarr`'s ability to read a file in parallel to perform analysis 
+algorithms in parallel using the "split-apply-combine" paradigm. In addition to the `H5MD` format, 


Citation for split-apply-combine

Hadley Wickham. The split-apply-combine strategy for data analysis. Journal of Statistical Software, 40 (1):1–29, 2011. doi: 10.18637/jss.v040.i01.

Have you tried it with 2.8.0-dev and the parallelized RMSD class?

orbeckst · 2024-10-01T22:31:43Z

joss_paper/paper.md

+However, it should be noted that this speed is influenced by network latency and that
+writing parallelized algorithms can offset this loss of speed.
+
+![Benchmarks performed on a machine with 2 Intel Xeon 2.00GHz CPUs, 32GB of RAM, and an SSD configured with RAID 0.\label{fig:benchmark}](benchmark.png)


Add more details on the test trajectory. Number of particles, number of frames, size (GB).

Is it one from MDAnalysisData, if so provide details (eg doi of location).

Make your paper as reproducible as possible.

Also the bandwidth of the network?

joss_paper/paper.md

yuxuanzhuang · 2024-10-03T18:43:08Z

joss_paper/paper.md

+  - file-format
+  - mdanalysis
+  - zarr
+authors:


I’m happy to be included on the author list :)

yuxuanzhuang · 2024-10-03T18:55:30Z

joss_paper/paper.md

+    affiliation: 1
+  - name: Yuxuan Zhuang
+    orcid: 0000-0003-4390-8556
+    affiliations: 1


For me:
[1] Department of Computer Science, Stanford University, Stanford, CA 94305, USA.
[2]D epartments of Molecular and Cellular Physiology and Structural Biology, Stanford University School of Medicine, Stanford, CA 94305, USA.

yuxuanzhuang · 2024-10-03T19:04:13Z

joss_paper/paper.md

+However, it should be noted that this speed is influenced by network latency and that
+writing parallelized algorithms can offset this loss of speed.
+
+![Benchmarks performed on a machine with 2 Intel Xeon 2.00GHz CPUs, 32GB of RAM, and an SSD configured with RAID 0.\label{fig:benchmark}](benchmark.png)


Also the bandwidth of the network?

yuxuanzhuang · 2024-10-03T19:11:58Z

joss_paper/paper.md

+Other groups in the field recognize this same need for adherence to 
+FAIR principles [@FAIR:2019] including the MDDB (Molecular Dynamics Data Bank), an EU-scale 
+repository for biosimulation data [@MDDB:2024] and MDverse, a prototype search engine 
+for publicly-available Gromacs simulation data [@MDverse:2024].


all caps: GROMACS

yuxuanzhuang · 2024-10-03T19:18:24Z

joss_paper/paper.md

+Instead, the ability to process, analyze and share large volumes of data provide 
+new constraints on research in this field.
+
+Other groups in the field recognize this same need for adherence to 


I think MDsrv https://academic.oup.com/nar/article/50/W1/W483/6593534 should also be mentioned here.

other papers worth citing:

Bringing Molecular Dynamics Simulation Data into View https://www.cell.com/trends/biochemical-sciences/fulltext/S0968-0004(19)30137-9

GPCRmd uncovers the dynamics of the 3D-GPCRome
https://www.nature.com/articles/s41592-020-0884-y

Sharing Data from Molecular Simulations https://pubs.acs.org/doi/full/10.1021/acs.jcim.9b00665?casa_token=Ivo26Sn_xFAAAAAA%3AwJLKekUvaYfOPZyi97LPZR7zir_x9cWlcbEb8UOZN7ZTqeNpmPE96rsLChkot3xPeGUtR4Es42lx7VKFyg

emphasizing that they only provide predefined analysis results or simple geometric features.

yuxuanzhuang · 2024-10-03T19:26:27Z

joss_paper/paper.md

+analyses of large, conglomerate datasets from different sources, and training
+machine learning models without downloading and storing trajectory data.
+
+# Statement of need


Not all the contents below belong to Statement of need; maybe add one or more sections about Examples, Design Principles etc.

ljwoods2 · 2024-10-03T20:26:42Z

Accidentally merged this while removing a bug that I found when testing with mda 2.8.0 on another branch, will address all the comments here and open a new PR

ljwoods2 added 4 commits September 22, 2024 13:14

paper first draft

12089ba

typo

502821c

typo

b7a74f3

spelling

7ff2ef3

orbeckst reviewed Sep 23, 2024

View reviewed changes

hmacdope requested changes Sep 24, 2024

View reviewed changes

hmacdope reviewed Sep 24, 2024

View reviewed changes

ljwoods2 added 3 commits September 26, 2024 12:13

revisions

7c9cd2c

benchmak figures

cce22ad

minor tweaks

60104e4

orbeckst requested changes Sep 26, 2024

View reviewed changes

hmacdope reviewed Oct 1, 2024

View reviewed changes

update GSOC acknowledgement in joss_paper/paper.md

d83a5df

orbeckst requested changes Oct 1, 2024

View reviewed changes

yuxuanzhuang reviewed Oct 3, 2024

View reviewed changes

ljwoods2 merged commit d83a5df into main Oct 3, 2024
24 checks passed

	download step would be more ideal.
	download step would provide greater utility for the computational molecular sciences ecosystem.



		# Acknowledgements
		Thank you to Dr. Jenna Swarthout Goddard for supporting the GSoC program at MDAnalysis.

	One imported, `Zarrtraj` allows passing trajectory URLs just like ordinary files:
	Once imported, `Zarrtraj` allows passing trajectory URLs just like ordinary files:

JOSS Paper #66

JOSS Paper #66

Conversation

ljwoods2 commented Sep 23, 2024

codecov bot commented Sep 23, 2024 • edited Loading

Codecov Report

orbeckst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmacdope left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbeckst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmacdope left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbeckst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ljwoods2 commented Oct 3, 2024 • edited Loading

codecov bot commented Sep 23, 2024 •

edited

Loading

ljwoods2 commented Oct 3, 2024 •

edited

Loading