Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A minimal C++ example to reproduce the problem in #273 #331

Open
shendiaomo opened this issue Sep 15, 2020 · 2 comments
Open

A minimal C++ example to reproduce the problem in #273 #331

shendiaomo opened this issue Sep 15, 2020 · 2 comments

Comments

@shendiaomo
Copy link
Collaborator

shendiaomo commented Sep 15, 2020

As #273 explains, the migration of the main goroutine from one thread to another would cause lots of threads created and a large footprint.

In fact, if we simulate the situation in C++, the problem can also be reproduced. As a result, this problem may degrade performance in an online inference setting. Of course, we can always set OMP_NUM_THREADS to 1 to avoid the problem.

#include <stdlib.h>
#include <thread>
#include <string>
#include <iostream>
#include <sstream>
#include <chrono>

#include "torch/torch.h"


namespace nn = torch::nn;  // for the literal `ms`

using namespace std::chrono_literals;

std::mutex mu;

int main(int argc, char* argv[]) {
  std::string argv0 = argv[0];
  if (auto pos = argv0.rfind('/'); pos != std::string::npos) {
    argv0 = argv0.substr(pos + 1);
  }
  std::stringstream thread_count_command;
  thread_count_command << "ps -T|grep " << argv0 <<"| wc -l";
  std::cout << "Thread count command: " << thread_count_command.str() << std::endl;
  std::cout << std::string(20, '-') << std::endl;

  std::vector<std::thread> pool;
  auto model = nn::Conv2d(nn::Conv2dOptions(3, 64, 1).stride(1).bias(false));

  auto total = std::thread::hardware_concurrency();
  if (argc > 1) total = std::atoi(argv[1]);

  for (int i = 0; i < total; ++i) {
    pool.push_back(std::thread([&, i] {
      int step = 0;
      while (true) {
        step += 1;
        {
          std::lock_guard<std::mutex> lock(mu);
          std::cout << "Thread "<< i << "(" << std::this_thread::get_id()
                    << "), step " << step << std::endl;
          std::cout << "#Threads before `forward`:" << std::endl;
          auto _ = system(thread_count_command.str().c_str());
          std::vector<torch::Tensor> data;
          while (data.size() < 32) data.push_back(torch::rand({3, 599, 599}));
          auto output = model->forward(torch::stack(data));
          std::cout << "#Threads after `forward`:" << std::endl;
          _ = system(thread_count_command.str().c_str());
          std::cout << std::string(20, '-') << std::endl;
        }
        std::this_thread::sleep_for(10ms); // Yield to another thread
      }
    }));
  }
  for (auto& t: pool) t.join();
}

Compile under the gotorch/cgotorch directory:

g++ -std=c++17 -I .. -I libtorch/include -I libtorch/include/torch/csrc/api/include -L linux/libtorch/lib many_threads.cpp  -O  -Wl,-rpath,libtorch/lib -lc10 -ltorch -ltorch_cpu -pthread

A typical output of the program on a Docker container with 6 cores:

Thread count command: ps -T|grep a.out| wc -l
--------------------
Thread 0(140561573615360), step 1
#Threads before `forward`:
7
#Threads after `forward`:
12
--------------------
Thread 1(140561565222656), step 1
#Threads before `forward`:
12
#Threads after `forward`:
17
--------------------
Thread 3(140561548437248), step 1
#Threads before `forward`:
17
#Threads after `forward`:
22
--------------------
Thread 4(140561540044544), step 1
#Threads before `forward`:
22
#Threads after `forward`:
27
--------------------
Thread 2(140561556829952), step 1
#Threads before `forward`:
27
#Threads after `forward`:
32
--------------------
Thread 5(140561461802752), step 1
#Threads before `forward`:
32
#Threads after `forward`:
37
--------------------
Thread 0(140561573615360), step 2
#Threads before `forward`:
37
#Threads after `forward`:
37
--------------------
@wangkuiyi
Copy link
Owner

wangkuiyi commented Sep 15, 2020

Without the expected output from the above program, I am not sure if I understand what it reveals.

On my iMac with quad-core Intel i5, I built and ran this program. The main function created 4 threads as expected, and there had been always 6 threads in total -- I am not sure if 6 is the "a lot of threads"?

I re-ran the program with OMP_NUM_THREADS set to 1, the result was the same -- the main function created 4 threads and the process had 6 threads in total.

Then, I set both OMP_NUM_THREADS and MKL_NUM_THREADS to 1, the result was the same again.

The steps to build and run the above program include:

  1. Copy-n-paste it to /tmp/a.cc.
  2. cp -r $GOPATH/src/github.com/wangkuiyi/gotorch/cgotorch/libtorch /tmp/
  3. make with the attached Makefile.
a : a.cc
	${CXX} -std=c++14 \
	-I .. \
	-I libtorch/include \
	-I libtorch/include/torch/csrc/api/include \
	-L libtorch/lib \
	-fPIC \
	$< \
	-o $@ \
	-Wl,-rpath,libtorch/lib \
	-lc10 -ltorch -ltorch_cpu \
	-D_GLIBCXX_USE_CXX11_ABI=1

@shendiaomo
Copy link
Collaborator Author

This problem is very likely to be caused by the function lazy_init_num_threads introduced in https://github.com/pytorch/pytorch/pull/37461/files#diff-7678d6e1a6fd4451bb1c23d73b3240a0R38-R45
This function is called by parallel_for and parallel_reduce, which are called by aten/src/ATen/native/ConvolutionMM2d.cpp and/or many other ops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants