-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mpi base implementation #31
Conversation
Cuda staging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a good work division strategy, some weird remnants of C++ and inconsistent name style
n_gpus=$(shell nvidia-smi --query-gpu=name --format=csv,noheader | wc -l); \ | ||
mpirun --oversubscribe -np $$n_gpus ./speed_gpu ./weights_and_biases.txt ./tensors 100000 | ||
|
||
test: build | ||
./speed_gpu ./weights_and_biases.txt ./tensors 1000000 | ||
n_gpus=$(shell nvidia-smi --query-gpu=name --format=csv,noheader | wc -l); \ | ||
mpirun --oversubscribe -np $$n_gpus ./speed_gpu ./weights_and_biases.txt ./tensors 1000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need to oversubscribe on the final build.
@@ -1,6 +1,7 @@ | |||
#include "matrix.cuh" | |||
#include <dirent.h> | |||
#include <iostream> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there C++ in here
@@ -156,11 +157,23 @@ __global__ void infer(float* d_inputs, int* d_results, matrix** d_weights, matri | |||
} | |||
|
|||
int main(int argc, char* argv[]) { | |||
MPI_Init(&argc, &argv); | |||
int TotalProcess, ProcessId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalisation codestyle
int local_input_count = input_count / TotalProcess + (ProcessId < (input_count % TotalProcess) ? 1 : 0); | ||
int start_idx = ProcessId * (input_count / TotalProcess) + std::min(ProcessId, input_count % TotalProcess); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there weird capitalised naming and C++ functions
int counter = 0; | ||
while ((entry = readdir(dir)) != NULL) { | ||
if (entry->d_type == DT_REG) { | ||
strcpy(file_num_str, entry->d_name); | ||
file_num_str[strlen(entry->d_name) - 7] = '\0'; | ||
file_num = atoi(entry->d_name); | ||
strcpy(file_name, directory_path); | ||
strcat(file_name, "/"); | ||
strcat(file_name, entry->d_name); | ||
read_tensor((float*)&inputs[(file_num - 1) * 225], file_name); | ||
if (file_num >= start_idx + 1 && file_num <= start_idx + local_input_count) { | ||
strcpy(file_name, directory_path); | ||
strcat(file_name, "/"); | ||
strcat(file_name, entry->d_name); | ||
read_tensor(&inputs[counter * 225], file_name); | ||
counter++; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not divide inputs between gpus, divide inferences
for (int i = 0; i < input_count; i++) { | ||
for (int i = 0; i < local_input_count; i++) { | ||
infer<<<BLOCKS, THREADS_PER_BLOCK>>>(d_inputs, d_results, d_weights, d_biases, it_num, i); | ||
err = cudaGetLastError(); | ||
if (err != cudaSuccess) { | ||
printf("CUDA error: %s\n", cudaGetErrorString(err)); | ||
} | ||
CUDA_CHECK(cudaGetLastError()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rearrange to divide more healthily
printf("Process %d - Total: %lu us\n", ProcessId, | ||
(stop.tv_sec - start.tv_sec) * 1000000 + stop.tv_usec - start.tv_usec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also want timing from the root process for how long the whole run takes
Not tested with actual 8 gpu. feel free to pr for changes