You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We should to add a continually-updated record of the examples/second, second/batch, and other statistics discussed in #51 to a new file docs/Benchmarking.md (or ComputationalEfficiency.md, etc.).
AFAIK, neither Kates-Harbeck et al (2019) or Svyatkovskiy (2017) discussed single-node or single GPU computational efficiency, since they focused on the scaling of multi-node parallelism (CUDA-aware MPI).
Given that we have multiple active users of the software distributed across the country (world?), it would be good for collaboration to provide easily-accessible metrics of performance expectations. The absence of these figures has already caused some confusion when we got access to V100 GPUs on the Princeton Traverse cluster.
We need to establish a benchmark or set of benchmarks for FRNN in order to measure and communicate consistent and useful metrics. E.g. we could store measurements from only a single benchmark consisting of 0D and 1D d3d signal data with our LSTM architecture on a single GPU/device with batch_size=256. Then, a user would have to extrapolate the examples/second to the simpler network but the longer average pulse lengths on JET if using jet_data_0d.
The conf.yaml configuration choices that have first-order effects on performance include:
Network architecture (LSTM vs. TCN vs. Transformer, inclusion of 1D data via convolutional layers, etc.)
Hyperparameters (number of layers, hidden units per layer, LSTM length, batch size, etc.)
Data set: pulse length of shots, number of features per timestep in the input vector, etc.
Similar to #41, these figures will be useless in the long run unless we store details of their context, including:
SHA1 of Git version of the repository
Conda environment
CUDA, MPI, etc. libraries (Apex?)
Specific hardware details, including computer name, interconnect, specific model of GPU
Summary of hardware we have/had/will have access to for computational performance measurements:
K80 (OLCF Titan, ALCF Cooley, Princeton Tiger 1)
P100 (Princeton Tiger 2)
V100 (Princeton Traverse, OLCF Summit)
Intel KNL 7230 (ALCF Theta)
Even when hardware is retired (e.g. OLCF Titan), it would be good to keep those figures for posterity.
Also store MPI scaling metrics as discussed in the above papers?
Track memory usage on the GPU?
examples/second and sec/batch should already be independent of the pulse length. I.e. the size of an example depends on the sampling frequency dt and LSTM length (TRNN</> in the Nature paper). But the gross throughput statistics such as seconds/epoch could be normalized by pulse length.
Related to #58, #52, and #51.
We should to add a continually-updated record of the
examples/second
,second/batch
, and other statistics discussed in #51 to a new filedocs/Benchmarking.md
(orComputationalEfficiency.md
, etc.).AFAIK, neither Kates-Harbeck et al (2019) or Svyatkovskiy (2017) discussed single-node or single GPU computational efficiency, since they focused on the scaling of multi-node parallelism (CUDA-aware MPI).
Given that we have multiple active users of the software distributed across the country (world?), it would be good for collaboration to provide easily-accessible metrics of performance expectations. The absence of these figures has already caused some confusion when we got access to V100 GPUs on the Princeton Traverse cluster.
We need to establish a benchmark or set of benchmarks for FRNN in order to measure and communicate consistent and useful metrics. E.g. we could store measurements from only a single benchmark consisting of 0D and 1D
d3d
signal data with our LSTM architecture on a single GPU/device withbatch_size=256
. Then, a user would have to extrapolate theexamples/second
to the simpler network but the longer average pulse lengths on JET if usingjet_data_0d
.The
conf.yaml
configuration choices that have first-order effects on performance include:Similar to #41, these figures will be useless in the long run unless we store details of their context, including:
Summary of hardware we have/had/will have access to for computational performance measurements:
Even when hardware is retired (e.g. OLCF Titan), it would be good to keep those figures for posterity.
examples/second
andsec/batch
should already be independent of the pulse length. I.e. the size of an example depends on the sampling frequencydt
and LSTMlength
(TRNN</> in the Nature paper). But the gross throughput statistics such asseconds/epoch
could be normalized by pulse length.keras2c
, etc.?guarantee_preprocessed.py
runtimes? Should at least quote a single approximate runtime expectation.The text was updated successfully, but these errors were encountered: