- Energy Efficient Architecture for Graph Analytics Accelerators
- A Template Based Design Methodology for Graph Parallel Hardware Accelerators
- System Simulation with gem5 and SystemC
- GAIL: The Graph Algorithm Iron Law
- Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Serve
- Graphicionado A High Performance and Energy-Efficient Accelerator for Graph Analytics
- Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads
- Alleviating Irregularity in Graph Analytics Acceleration: a Hardware/Software Design Approach
- GNN Performance Optimization
- Dissecting the Graphcore IPU Architecture
- Using the Graphcore IPU for Traditional HPC Applications
- Roofline: An Insightful Visual Performance Model
- CUDA New Features and Beyond
- A Study of Persistent Threads Style GPU Programming for GPGPU Workloads
- BrainTorrent: A Peer to Peer Environment for Decentralized Federated Learning
- Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU
- Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing
- A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs
- The Broker Queue: A Fast, Linearizable FIFO Queue for Fine-Granular Work Distribution on the GPU
- Softshell: Dynamic Scheduling on GPUs
- Gravel: Fine-Grain GPU-Initiated Network Messages
- SPIN:Seamless Operating System Integration of Peer to Peer DMA Between SSDs and GPUs
- Automatic Graph Partitioning for Very Large-scale Deep Learning
- Stateful Dataflow Multigraphs: A data-centric model for performance portability on heterogeneous architectures
- Productivity, Portability, Performance: Data-Centric Python
- Interferences between Communications and Computations in Distributed HPC Systems
- MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters
- GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters
- GPUnet: Networking Abstractions for GPU Programs
- GPUrdma: GPU-side library for high performance networking from GPU kernels
- Trends in Data Locality Abstractions for HPC Systems
- Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
- Benchmarking GPUs to Tune Dense Linear Algebra
- Brook for GPUs: stream computing on graphics hardware
- IPUG: Accelerating Breadth-First Graph Traversals using Manycore Graphcore IPUs
- Supporting RISC-V Performance Counters through Performance analysis tools for Linux
- Merrimac: Supercomputing with Streams
- Peta-scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer
- A-RISC-V-Simulator-and-Benchmark-Suite-for-Designing-and-Evaluating-Vector-Architectures
- PyTorch Distributed Experiences on Accelerating Data Parallel Training
- An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks
- ZeRO Memory Optimizations Toward Training Trillion Parameter Models
- Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
- XeFlow: Streamlining Inter-Processor Pipeline Execution for the Discrete CPU-GPU Platform
- Architecture and Performance of Devito, a System for Automated Stencil Computation
- Distributed Training of Deep Learning Models A Taxonomic Perspective
- Performance Trade-offs in GPU Communication A Study of Host and Device-initiated Approaches
- Assessment of NVSHMEM for High Performance Computing
- Sparse GPU Kernels for Deep Learning
- The State of Sparsity in Deep Neural Networks
- Pruning neural networks without any data by iteratively conserving synaptic flow
- SNIP: Single-shot Network Pruning based on Connection Sensitivity
- Comparing Rewinding and Fine-tuning in Neural Network Pruning
- The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
- Torch.fx: Practical Program Capture and Transformation for Deep Learning in Python
- An asynchronous message driven parallel framework for extreme scale deep learning
- Bolt: Bridging The Gap Between Auto Tuners And Hardware Native Performance
- Efficient Tensor Core-Based GPU Kernels for Structured Sparsity under Reduced Precision
- Attention is All You Need
- Scaling Laws for Neural Language Models
- Language Models are Few-Shot Learners
- BERT Pre-training of Deep Bidirectional Transformers for Language Understanding
- RoBERTa: A Robustly Optimized BERT Pretraining Approach
- Longformer: The Long-Document Transformer
- Linformer: Self-Attention with Linear Complexity
- The Efficiency Misnomer
- A Survey of Transformers
- PipeTransformer-Automated Elastic Pipelining for Distributed Training of Large-scale Models
- Training Compute-Optimal Large Language Models
- WholeGraph: A Fast Graph Neural Network Training Framework with Multi-GPU Distributed Shared Memory Architecture
- Sparse-GPT-Massive-Language-Models-Can-Be-Accurately-Pruned-in-One-Shot
- Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU
- Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing
- A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs
- The Broker Queue: A Fast, Linearizable FIFO Queue for Fine-Granular Work Distribution on the GPU
- SPIN:Seamless Operating System Integration of Peer to Peer DMA Between SSDs and GPUs
- GPU-to-CPU Callbacks
- PyTorch: An Imperative Style, High-Performance Deep Learning Library -> Zero technical depth. Please give my time back.
- Automatic Graph Partitioning for Very Large-scale Deep Learning
- Stateful Dataflow Multigraphs: A data-centric model for performance portability on heterogeneous architectures
- Productivity, Portability, Performance: Data-Centric Python
- Analyzing Put/Get APIs for Thread-collaborative Processors
- Analyzing Communication Models for Distributed Thread-Collaborative Processors in Terms of Energy and Time
- Interferences between Communications and Computations in Distributed HPC Systems
- Memory Bandwidth Contention: Communication vs Computation Tradeoffs in Supercomputers with Multicore Architectures
- MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters
- GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters
- GPUnet: Networking Abstractions for GPU Programs
- GPUrdma: GPU-side library for high performance networking from GPU kernels
- Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
- Deep Residual Learning for Image Recognition
- Benchmarking GPUs to Tune Dense Linear Algebra
- Brook for GPUs: stream computing on graphics hardware
- IPUG: Accelerating Breadth-First Graph Traversals using Manycore Graphcore IPUs
- Supporting RISC-V Performance Counters through Performance analysis tools for Linux
- Merrimac: Supercomputing with Streams
- Peta-scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer
- Toward a Scalable and Distributed Infrastructure for Deep Learning Applications
- A Data-centric Optimization Framework for Machine Learning
- A RISC-V Simulator and Benchmark Suite for Designing and Evaluating Vector Architectures
- PyTorch Distributed Experiences on Accelerating Data Parallel Training
- An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks
- ZeRO Memory Optimizations Toward Training Trillion Parameter Models
- ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
- Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
- Unpublished paper (will update if it's accepted)
- XeFlow: Streamlining Inter-Processor Pipeline Execution for the Discrete CPU-GPU Platform
- Architecture and Performance of Devito, a System for Automated Stencil Computation
- Distributed Training of Deep Learning Models A Taxonomic Perspective
- Performance Trade-offs in GPU Communication A Study of Host and Device-initiated Approaches
- Assessment of NVSHMEM for High Performance Computing
- A Data-Centric Optimization Framework for Machine Learning
- Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA
- ✅ Moores Law is ending
- A new golden age for computer architecture
- Abstract machine models and proxy architectures for exascale computing
- ✅ Trends in Data Locality Abstractions for HPC Systems
- ✅ Merrimac: Supercomputing with Streams
- Synergistic Processing in Cell's Multicore Architecture
- Knights Landing: Second Generation Intel Xeon Phi Product
- ✅ Roofline: an insightful visual performance model for multicore architectures
- ExaSAT: An exascale co-design tool for performance modeling
- hwloc: A generic framework for managing hardware affinities in HPC applications
- Optimization of sparse matrix-vector multiplication on emerging multicore platforms
- Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
- ✅ Benchmarking GPUs to tune dense linear algebra
- ✅ Brook for GPUs: Stream Computing on Graphics Hardware
- OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES
- Productivity and performance using partitioned global address space languages
- Kokkos: Enabling manycore performance portability through polymorphic memory access patterns
- Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines
- Chill: A Framework for High Level Loop Transformations
- Pluto: A Practical and Automatic Polyhedral Program Optimization System
- Cilk: An Efficient Multithreaded Runtime System
- StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
- Legion: expressing locality and independence with logical regions
- Charm++ A portable concurrent object oriented system based on C++
- PyTorch Distributed Experiences on Accelerating Data Parallel Training
- An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks
- ZeRO Memory Optimizations Toward Training Trillion Parameter Models
- Distributed Training of Deep Learning Models A Taxonomic Perspective
- Sparse GPU Kernels for Deep Learning
- The State of Sparsity in Deep Neural Networks
- Pruning neural networks without any data by iteratively conserving synaptic flow
- SNIP: Single-shot Network Pruning based on Connection Sensitivity
- Comparing Rewinding and Fine-tuning in Neural Network Pruning
- The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
- Torch.fx: Practical Program Capture and Transformation for Deep Learning in Python
- An asynchronous message driven parallel framework for extreme scale deep learning
- Bolt: Bridging The Gap Between Auto Tuners And Hardware Native Performance
- Efficient Tensor Core-Based GPU Kernels for Structured Sparsity under Reduced Precision
- Attention is All You Need
- Scaling Laws for Neural Language Models
- Language Models are Few-Shot Learners
- BERT Pre-training of Deep Bidirectional Transformers for Language Understanding
- RoBERTa: A Robustly Optimized BERT Pretraining Approach
- Longformer: The Long-Document Transformer
- Linformer: Self-Attention with Linear Complexity
- The Efficiency Misnomer
- A Survey of Transformers
- PipeTransformer-Automated Elastic Pipelining for Distributed Training of Large-scale Models
- Training Compute-Optimal Large Language Models
- Sparse-GPT-Massive-Language-Models-Can-Be-Accurately-Pruned-in-One-Shot