Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds dynamic shared mem allocation to cuda kernels #413

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 26 additions & 1 deletion build.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
use std::process::Command;

fn main() {
let cores = num_cpus::get();
let tpcl2 = (cores as f64).log2().floor() as u32;
Expand All @@ -6,6 +8,7 @@ fn main() {
println!("cargo:rerun-if-changed=src/hvm.c");
println!("cargo:rerun-if-changed=src/run.cu");
println!("cargo:rerun-if-changed=src/hvm.cu");
println!("cargo:rerun-if-changed=src/get_shared_mem.cu");
println!("cargo:rustc-link-arg=-rdynamic");

match cc::Build::new()
Expand All @@ -23,17 +26,39 @@ fn main() {
}

// Builds hvm.cu
if std::process::Command::new("nvcc").arg("--version").stdout(std::process::Stdio::null()).stderr(std::process::Stdio::null()).status().is_ok() {
if Command::new("nvcc").arg("--version").stdout(std::process::Stdio::null()).stderr(std::process::Stdio::null()).status().is_ok() {
if let Ok(cuda_path) = std::env::var("CUDA_HOME") {
println!("cargo:rustc-link-search=native={}/lib64", cuda_path);
} else {
println!("cargo:rustc-link-search=native=/usr/local/cuda/lib64");
}

// Compile get_shared_mem.cu
let shared_mem_value = Command::new("nvcc")
.args(&["src/get_shared_mem.cu", "-o", "get_shared_mem"])
.output()
.and_then(|_| Command::new("./get_shared_mem").output())
.ok()
.and_then(|output| if output.status.success() {
Some(String::from_utf8_lossy(&output.stdout).trim().to_string())
} else {
None
})
.unwrap_or_else(|| {
println!("cargo:warning=\x1b[1m\x1b[31mWARNING: Failed to get shared memory size. Using default value.\x1b[0m");
"0x2000".to_string()
});

// Clean up temporary executable
let _ = std::fs::remove_file("get_shared_mem");

println!("cargo:warning=\x1b[1m\x1b[33mShared memory size set to: {}\x1b[0m", shared_mem_value);

cc::Build::new()
.cuda(true)
.file("src/run.cu")
.define("IO", None)
.define("HVM_SHARED_MEM", Some(shared_mem_value.as_str()))
.flag("-diag-suppress=177") // variable was declared but never referenced
.flag("-diag-suppress=550") // variable was set but never used
.flag("-diag-suppress=20039") // a __host__ function redeclared with __device__, hence treated as a __host__ __device__ function
Expand Down
24 changes: 24 additions & 0 deletions src/get_shared_mem.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#include <cuda_runtime.h>
#include <cstdio>

int main() {
int device = 0;
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, device);

size_t sharedMemPerBlock = prop.sharedMemPerBlock;
int maxSharedMemPerBlockOptin;
cudaDeviceGetAttribute(&maxSharedMemPerBlockOptin, cudaDevAttrMaxSharedMemoryPerBlockOptin, device);

size_t maxSharedMem = (sharedMemPerBlock > (size_t)maxSharedMemPerBlockOptin) ? sharedMemPerBlock : (size_t)maxSharedMemPerBlockOptin;

// Subtract 3KB (3072 bytes) from the max shared memory as is allocated somewhere else
maxSharedMem -= 3072;
Comment on lines +15 to +16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you compute this? I think this is from the automatic variables in some of the kernels. I think overall this is a fine approach for now but if we change the 3KB alloc in the runtime this will have to change too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the sum of the alloc size of some local shared arrays is roughly ~3KB, i know that this is not the perfect way to do this, but it works for now.


// Calculate the hex value
unsigned int hexValue = (unsigned int)(maxSharedMem / 12);

printf("0x%X", hexValue);

return 0;
}
9 changes: 7 additions & 2 deletions src/hvm.cu
Original file line number Diff line number Diff line change
Expand Up @@ -126,9 +126,14 @@ struct RBag {
Pair lo_buf[RLEN];
};

// Default value for shared memory (96KB)
#ifndef HVM_SHARED_MEM
#define HVM_SHARED_MEM 0x2000
#endif

// Local Net
const u32 L_NODE_LEN = 0x2000;
const u32 L_VARS_LEN = 0x2000;
const u32 L_NODE_LEN = HVM_SHARED_MEM;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't there places in the code that expect L_NODE_LEN to have this exact value?

If you change it I think we'll start to get either memory leaks, out of bounds access or non local memory use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if this number is smaller then default, the programs compiled by bend won't run.

If it's smaller the performance will also tank incredibly fast. (see the issue where someone halved this number and got worse performamce than in the cpu)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't there places in the code that expect L_NODE_LEN to have this exact value?

i wouldn't say that it expects to have this value, but rather that it expects to have the numbers of a device with 8.9 of compute capability, but then again, this is not a general optimization of the CUDA version, this is just to make the allocation of shared mem on local net dynamic, so that users can install the hvm without having to manually change it on the hvm.cu, the rest of the code is still the same, made with the numbers of a 4090 in mind. We can slowly improve the code to make it device based, instead of hard-coded to the specs of a 4090.

Also, if this number is smaller then default, the programs compiled by bend won't run.

can you show me an example of what was ran and what was the device/numbers used when this happened?

If it's smaller the performance will also tank incredibly fast. (see the issue where someone halved this number and got worse performamce than in the cpu)

i mean, that's of course, if you give it less memory, it will have less memory, besides the fact that the rest of the code is 'optimized' for a 4090.

we can like, whenever someone installs it using a GPU with <99KB of max shared mem per block, we give a warning saying something along the lines of:
"HVM is currently optimized to run on devices with >=96KB of max shared mem, please be aware that your GPU performance will be reduced dramatically"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show me an example of what was ran and what was the device/numbers used when this happened?

Also, if this number is smaller then default, the programs compiled by bend won't run.

can you show me an example of what was ran and what was the device/numbers used when this happened?

The number of nodes in the local buffer determines the maximum size of hvm definitions. If we decrease L_NODE_LEN then Bend also needs to generate programs with smaller maximum definition size.

const u32 L_VARS_LEN = HVM_SHARED_MEM;
struct LNet {
Pair node_buf[L_NODE_LEN];
Port vars_buf[L_VARS_LEN];
Expand Down
Loading