No Speedup in distributed gmres? #1732

marco-butz · 2024-11-25T14:27:11Z

marco-butz
Nov 25, 2024

Hi everyone,

I am using ginkgo to solve large sparse complex linear systems with a system Matrix A of dimensions up to about 14400000x14400000. On a single Nvidia A100 solving the system takes about ten minutes. I employ ginkgo in an iterative optimization setting, where performance is a critical factor. Each Node where I run my computations on is equipped with up to four (some eight) Nvidia A100. Therefore, I want to use the distributed solving feature of ginkgo.

I tried to adapt the "The distributed-solver program" from the documentation. However, I do not see any speedup. Even more, my implementation with MPI is slower the more MPI processes I use.

Performance vs. number of A100s employed:

I checked through nvidia-smi, that the problem indeed seems to be distributed over multiple GPUs. The memory allocated on each A100 in case of two processes is about half of the memory allocated in the single-gpu case, which also makes sense to me. Only slightly above 50% of each GPU is used though. A single process uses 100%. The solution is correct, independent of how many GPUs are involved.

Below is my code, does anyone see any obvious mistakes? Loading the matrices is a bit of a mess because I employ the fix mentioned in #1731 . The correct functionality of this approach is confirmed by a ginkgo program not employing any distributed features.

Any help would be very much appreciated.

Best regards,
Marco

#include <iostream>

#include <chrono>
#include <fstream>
#include <ginkgo/ginkgo.hpp>
#include <fast_matrix_market/app/Eigen.hpp>

int main(int argc, char* argv[])
{
    const gko::experimental::mpi::environment env(argc, argv);

    using GlobalIndexType = gko::int32;
    using LocalIndexType = gko::int32;
    using ValueType = std::complex<double>;
    using RealValueType = gko::remove_complex<ValueType>;
    using vec = gko::matrix::Dense<ValueType>;
    using real_vec = gko::matrix::Dense<RealValueType>;
    using dist_vec = gko::experimental::distributed::Vector<ValueType>;
    using dist_mtx = gko::experimental::distributed::Matrix<ValueType, LocalIndexType, GlobalIndexType>;
    using part_type = gko::experimental::distributed::Partition<LocalIndexType, GlobalIndexType>;
    using solver = gko::solver::Gmres<ValueType>;

    const char* fileNameA = argv[1];
    const char* fileNameB = argv[2];
    const char* fileNameX0 = argv[3];
    const char* fileNameResult = argv[4];
    auto max_iters = std::stoi(argv[5]);
    auto tolerance = std::stof(argv[6]);
    bool verbose = std::string(argv[7]) == "true";

    const auto comm = gko::experimental::mpi::communicator(MPI_COMM_WORLD);
    const auto rank = comm.rank();

    int device_id = gko::experimental::mpi::map_rank_to_device_id(MPI_COMM_WORLD, gko::CudaExecutor::get_num_devices());

    if (verbose) {
      std::cout << "device_id: " << device_id << " reporting in" << std::endl;
      std::cout << "comm size: " << comm.size() << std::endl;
    }

    auto gpu = gko::CudaExecutor::create(device_id, gko::OmpExecutor::create());

    auto start_A_eigen = gko::experimental::mpi::get_walltime();
    Eigen::SparseMatrix<std::complex<double>, Eigen::RowMajor> A_eigen;
    std::ifstream A_stream(fileNameA);
    fast_matrix_market::read_matrix_market_eigen(A_stream, A_eigen);

    Eigen::VectorXcd b_eigen;
    Eigen::Matrix<std::complex<double>, Eigen::Dynamic, Eigen::Dynamic> matB;
    std::ifstream b_stream(fileNameB);
    fast_matrix_market::read_matrix_market_eigen_dense(b_stream, matB);
    b_eigen = matB.col(0);

    Eigen::VectorXcd x_eigen;
    Eigen::Matrix<std::complex<double>, Eigen::Dynamic, Eigen::Dynamic> matX;
    std::ifstream x0_stream(fileNameX0);
    fast_matrix_market::read_matrix_market_eigen_dense(x0_stream, matX);
    x_eigen = matX.col(0);

    auto end_A_eigen = gko::experimental::mpi::get_walltime();
    if (verbose && comm.rank() == 0) {
        std::cout << "Reading matrices took " << end_A_eigen - start_A_eigen << " seconds." << std::endl;
    }

    auto row_ptrs = static_cast<GlobalIndexType*>(A_eigen.outerIndexPtr());
    auto col_idxs = static_cast<GlobalIndexType*>(A_eigen.innerIndexPtr());
    auto values = static_cast<ValueType*>(A_eigen.valuePtr());

    auto row_ptrs_view = gko::array<GlobalIndexType>::view(gpu->get_master(), A_eigen.outerSize() + 1, row_ptrs);
    auto col_idxs_view = gko::array<GlobalIndexType>::view(gpu->get_master(), A_eigen.nonZeros(), col_idxs);
    auto values_view = gko::array<ValueType>::view(gpu->get_master(), A_eigen.nonZeros(), values);

    auto gko_dim = gko::dim<2>{static_cast<gko::size_type>(A_eigen.rows()),
                               static_cast<gko::size_type>(A_eigen.cols())};

    auto A_loaded = gko::share(gko::matrix::Csr<ValueType, GlobalIndexType>::create(
         gpu->get_master(), gko_dim, std::move(values_view), std::move(col_idxs_view),
         std::move(row_ptrs_view)));

    auto values_b = static_cast<ValueType*>(b_eigen.data());
    auto values_view_b = gko::array<ValueType>::view(gpu->get_master(), b_eigen.rows(), values_b);
    auto gko_dim_b = gko::dim<2>{static_cast<gko::size_type>(b_eigen.rows()), 1};
    auto b_loaded = gko::matrix::Dense<ValueType>::create(gpu->get_master(), gko_dim_b, std::move(values_view_b), 1);

    auto values_x = static_cast<ValueType*>(x_eigen.data());
    auto values_view_x = gko::array<ValueType>::view(gpu->get_master(), x_eigen.rows(), values_x);
    auto gko_dim_x = gko::dim<2>{static_cast<gko::size_type>(x_eigen.rows()), 1};
    auto x_loaded = gko::matrix::Dense<ValueType>::create(gpu->get_master(), gko_dim_x, std::move(values_view_x), 1);

    auto start_distribution = gko::experimental::mpi::get_walltime();

    auto partition = gko::share(part_type::build_from_global_size_uniform(
      gpu->get_master(),
      comm.size(),
      static_cast<GlobalIndexType>(b_eigen.rows())));

    comm.synchronize();

    auto A_host = gko::share(dist_mtx::create(gpu->get_master(), comm));
    auto x_host = dist_vec::create(gpu->get_master(), comm);
    auto b_host = dist_vec::create(gpu->get_master(), comm);

    gko::matrix_data<ValueType, GlobalIndexType> A_mat_data;
    A_loaded->write(A_mat_data);
    gko::matrix_data<ValueType, GlobalIndexType> b_mat_data;
    b_loaded->write(b_mat_data);
    gko::matrix_data<ValueType, GlobalIndexType> x_mat_data;
    x_loaded->write(x_mat_data);
    A_host->read_distributed(A_mat_data, partition);
    b_host->read_distributed(b_mat_data, partition);
    x_host->read_distributed(x_mat_data, partition);

    auto A = gko::share(dist_mtx::create(gpu, comm));
    auto x = dist_vec::create(gpu, comm);
    auto b = dist_vec::create(gpu, comm);
    A->copy_from(A_host);
    b->copy_from(b_host);
    x->copy_from(x_host);

    comm.synchronize();

    auto end_distribution = gko::experimental::mpi::get_walltime();
    if (verbose && comm.rank() == 0) {
        std::cout << "Setting up distributed solving took " << end_distribution - start_distribution << " seconds." << std::endl;
    }

    auto start_solve = gko::experimental::mpi::get_walltime();

    std::shared_ptr<const gko::log::Convergence<ValueType>> logger =
        gko::log::Convergence<ValueType>::create();

    auto solver_obj =
        solver::build()
            .with_criteria(
                gko::stop::Iteration::build().with_max_iters(max_iters),
                gko::stop::ResidualNorm<>::build().with_reduction_factor(tolerance))
            .with_krylov_dim(30)
            .on(gpu);

    solver_obj->add_logger(logger);
    solver_obj->generate(A)->apply(b, x);

    comm.synchronize();

    auto res_norm = gko::clone(gpu->get_master(),
                               gko::as<real_vec>(logger->get_residual_norm()));

    auto end_solve = gko::experimental::mpi::get_walltime();
    if (verbose && rank == 0) {
      std::cout << "Solving took " << end_solve - start_solve << " seconds." << std::endl;
      std::cout << "Final Res norm: " << res_norm->at(0, 0)
               << "\nIteration count: " << logger->get_num_iterations()
               << std::endl;
    }

    if (rank == 0) {
      auto x_master = gko::clone(gpu->get_master(), x->get_local_vector());

      auto start_write_result = gko::experimental::mpi::get_walltime();

      auto output_file = std::ofstream(fileNameResult);
      gko::write(output_file, gko::as<vec>(x_master.get()), gko::layout_type::array);
      output_file.close();

      auto end_write_result = gko::experimental::mpi::get_walltime();
      if (verbose && rank == 0) {
          std::cout << "Writing result took " << end_write_result - start_write_result << " seconds." << std::endl;
      }
    }
}

MarcelKoch · 2024-11-25T15:29:51Z

MarcelKoch
Nov 25, 2024
Maintainer

Looking at your code, I can't see any obvious issues. To check that there is nothing wrong with your system, you could try using our benchmarks, which are part of our repository. You have to build the benchmark solver_distributed_dcomplex (this is also the cmake target). This benchmark also allows you to read in your matrix file, but that might not be the best idea due to the issue you described previously.
You can run the benchmark with a synthesized matrix by providing the input:

[
        {
                "size": 1400000,
                "stencil": "27pt",
                "comm_pattern": "stencil",
                "optimal": { "spmv": "csr-csr" }
        }
]

If you save this as in.json, you can run the benchmark as:

mpirun -n <n> ./solver_distributed_dcomplex --solvers=gmres --gmres_restart=30 --max_iters=100 --detailed=false < in.json

If this gives normal speedup behavior, then I would guess that the performance issues are due to matrix partition. Maybe something more sophisticated like metis or scotch is necessary to reduce the communication overhead.

Also, how is your MPI configured? Does it support communication with device pointers? If so you can set during cmake -DGINKGO_HAVE_GPU_AWARE_MPI=ON, which will remove copying the communication buffers between the host and device.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No Speedup in distributed gmres? #1732

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

No Speedup in distributed gmres? #1732

marco-butz Nov 25, 2024

Replies: 1 comment

MarcelKoch Nov 25, 2024 Maintainer

marco-butz
Nov 25, 2024

MarcelKoch
Nov 25, 2024
Maintainer