Mpi base implementation #31

johnathanchann · 2024-07-06T09:41:55Z

Not tested with actual 8 gpu. feel free to pr for changes

Cuda staging

rozukke

Not a good work division strategy, some weird remnants of C++ and inconsistent name style

rozukke · 2024-07-06T11:25:07Z

Makefile

+	n_gpus=$(shell nvidia-smi --query-gpu=name --format=csv,noheader | wc -l); \
+	mpirun --oversubscribe -np $$n_gpus ./speed_gpu ./weights_and_biases.txt ./tensors 100000

 test: build
-	./speed_gpu ./weights_and_biases.txt ./tensors 1000000
+	n_gpus=$(shell nvidia-smi --query-gpu=name --format=csv,noheader | wc -l); \
+	mpirun --oversubscribe -np $$n_gpus ./speed_gpu ./weights_and_biases.txt ./tensors 1000000


We do not need to oversubscribe on the final build.

rozukke · 2024-07-06T11:25:28Z

src/main.cu

@@ -1,6 +1,7 @@
 #include "matrix.cuh"
 #include <dirent.h>
 #include <iostream>


Why is there C++ in here

rozukke · 2024-07-06T11:26:14Z

src/main.cu

@@ -156,11 +157,23 @@ __global__ void infer(float* d_inputs, int* d_results, matrix** d_weights, matri
 }

 int main(int argc, char* argv[]) {
+    MPI_Init(&argc, &argv);
+    int TotalProcess, ProcessId;


Capitalisation codestyle

rozukke · 2024-07-06T11:26:43Z

src/main.cu

+    int local_input_count = input_count / TotalProcess + (ProcessId < (input_count % TotalProcess) ? 1 : 0);
+    int start_idx = ProcessId * (input_count / TotalProcess) + std::min(ProcessId, input_count % TotalProcess);


Why is there weird capitalised naming and C++ functions

rozukke · 2024-07-06T11:28:28Z

src/main.cu

+    int counter = 0;
    while ((entry = readdir(dir)) != NULL) {
        if (entry->d_type == DT_REG) {
            strcpy(file_num_str, entry->d_name);
            file_num_str[strlen(entry->d_name) - 7] = '\0';
            file_num = atoi(entry->d_name);
-            strcpy(file_name, directory_path);
-            strcat(file_name, "/");
-            strcat(file_name, entry->d_name);
-            read_tensor((float*)&inputs[(file_num - 1) * 225], file_name);
+            if (file_num >= start_idx + 1 && file_num <= start_idx + local_input_count) {
+                strcpy(file_name, directory_path);
+                strcat(file_name, "/");
+                strcat(file_name, entry->d_name);
+                read_tensor(&inputs[counter * 225], file_name);
+                counter++;
+            }


Do not divide inputs between gpus, divide inferences

rozukke · 2024-07-06T11:29:06Z

src/main.cu

-    for (int i = 0; i < input_count; i++) {
+    for (int i = 0; i < local_input_count; i++) {
        infer<<<BLOCKS, THREADS_PER_BLOCK>>>(d_inputs, d_results, d_weights, d_biases, it_num, i);
-        err = cudaGetLastError();
-        if (err != cudaSuccess) {
-            printf("CUDA error: %s\n", cudaGetErrorString(err));
-        }
+        CUDA_CHECK(cudaGetLastError());


Rearrange to divide more healthily

rozukke · 2024-07-06T11:29:51Z

src/main.cu

+    printf("Process %d - Total: %lu us\n", ProcessId,
+           (stop.tv_sec - start.tv_sec) * 1000000 + stop.tv_usec - start.tv_usec);


Also want timing from the root process for how long the whole run takes

johnathanchann and others added 2 commits July 6, 2024 19:03

Merge pull request #1 from kachi-group/cuda-staging

0459841

Cuda staging

mpi-base

f1de8af

rozukke requested changes Jul 6, 2024

View reviewed changes

rozukke assigned johnathanchann Jul 6, 2024

johnathanchann closed this Jul 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mpi base implementation #31

Mpi base implementation #31

johnathanchann commented Jul 6, 2024

rozukke left a comment

rozukke Jul 6, 2024

rozukke Jul 6, 2024

rozukke Jul 6, 2024

rozukke Jul 6, 2024

rozukke Jul 6, 2024

rozukke Jul 6, 2024

rozukke Jul 6, 2024

		int local_input_count = input_count / TotalProcess + (ProcessId < (input_count % TotalProcess) ? 1 : 0);
		int start_idx = ProcessId * (input_count / TotalProcess) + std::min(ProcessId, input_count % TotalProcess);

		printf("Process %d - Total: %lu us\n", ProcessId,
		(stop.tv_sec - start.tv_sec) * 1000000 + stop.tv_usec - start.tv_usec);

Mpi base implementation #31

Mpi base implementation #31

Conversation

johnathanchann commented Jul 6, 2024

rozukke left a comment

Choose a reason for hiding this comment

rozukke Jul 6, 2024

Choose a reason for hiding this comment

rozukke Jul 6, 2024

Choose a reason for hiding this comment

rozukke Jul 6, 2024

Choose a reason for hiding this comment

rozukke Jul 6, 2024

Choose a reason for hiding this comment

rozukke Jul 6, 2024

Choose a reason for hiding this comment

rozukke Jul 6, 2024

Choose a reason for hiding this comment

rozukke Jul 6, 2024

Choose a reason for hiding this comment