You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As #273 explains, the migration of the main goroutine from one thread to another would cause lots of threads created and a large footprint.
In fact, if we simulate the situation in C++, the problem can also be reproduced. As a result, this problem may degrade performance in an online inference setting. Of course, we can always set OMP_NUM_THREADS to 1 to avoid the problem.
#include<stdlib.h>
#include<thread>
#include<string>
#include<iostream>
#include<sstream>
#include<chrono>
#include"torch/torch.h"namespacenn= torch::nn; // for the literal `ms`usingnamespacestd::chrono_literals;
std::mutex mu;
intmain(int argc, char* argv[]) {
std::string argv0 = argv[0];
if (auto pos = argv0.rfind('/'); pos != std::string::npos) {
argv0 = argv0.substr(pos + 1);
}
std::stringstream thread_count_command;
thread_count_command << "ps -T|grep " << argv0 <<"| wc -l";
std::cout << "Thread count command: " << thread_count_command.str() << std::endl;
std::cout << std::string(20, '-') << std::endl;
std::vector<std::thread> pool;
auto model = nn::Conv2d(nn::Conv2dOptions(3, 64, 1).stride(1).bias(false));
auto total = std::thread::hardware_concurrency();
if (argc > 1) total = std::atoi(argv[1]);
for (int i = 0; i < total; ++i) {
pool.push_back(std::thread([&, i] {
int step = 0;
while (true) {
step += 1;
{
std::lock_guard<std::mutex> lock(mu);
std::cout << "Thread "<< i << "(" << std::this_thread::get_id()
<< "), step " << step << std::endl;
std::cout << "#Threads before `forward`:" << std::endl;
auto _ = system(thread_count_command.str().c_str());
std::vector<torch::Tensor> data;
while (data.size() < 32) data.push_back(torch::rand({3, 599, 599}));
auto output = model->forward(torch::stack(data));
std::cout << "#Threads after `forward`:" << std::endl;
_ = system(thread_count_command.str().c_str());
std::cout << std::string(20, '-') << std::endl;
}
std::this_thread::sleep_for(10ms); // Yield to another thread
}
}));
}
for (auto& t: pool) t.join();
}
Without the expected output from the above program, I am not sure if I understand what it reveals.
On my iMac with quad-core Intel i5, I built and ran this program. The main function created 4 threads as expected, and there had been always 6 threads in total -- I am not sure if 6 is the "a lot of threads"?
I re-ran the program with OMP_NUM_THREADS set to 1, the result was the same -- the main function created 4 threads and the process had 6 threads in total.
Then, I set both OMP_NUM_THREADS and MKL_NUM_THREADS to 1, the result was the same again.
The steps to build and run the above program include:
As #273 explains, the migration of the main goroutine from one thread to another would cause lots of threads created and a large footprint.
In fact, if we simulate the situation in C++, the problem can also be reproduced. As a result, this problem may degrade performance in an online inference setting. Of course, we can always set OMP_NUM_THREADS to 1 to avoid the problem.
Compile under the
gotorch/cgotorch
directory:A typical output of the program on a Docker container with 6 cores:
The text was updated successfully, but these errors were encountered: