Thread safe event-bus
sudo apt-get install python3
pip3 install conan
cmake --preset={unix-release/unix-dev}
cmake --build {Release/Debug}
cmake --preset={windows-release-x64/windows-dev-x64}
cd {Release/Debug}
msbuild.exe HFT.sln /p:Configuration={Release/Debug} /p:Platform=x64
CPU : 12th Gen Intel(R) Core(TM) i7-1265U
Queue size 1000 with 64 byte blocks | |
---|---|
Million operation per second | Number of threads |
16.5 | 1 |
11.5 | 2 |
9.5 | 4 |
7.5 | 8 |
cmake --preset unix-profile
cmake --build Profile
./scripts/valgrind.sh Profile/BenchMark
./BenchMark
ls gmon.out
gprof Profile/BenchMark gmon.out > profile.txt
valgrind --tool=callgrind ./BenchMark
kcachegrind profile.callgrind
I have add batch read/write. I mean acquire memory in batch and then fill them and send to reduce contention between writer threads. In acquire function :
+ static constexpr int BATCH = 5;
+ static thread_local int index = BATCH;
+ static thread_local uint8_t *data = nullptr;
+
+ if (index == BATCH) [[unlikely]] {
+ data = static_cast<uint8_t *>(superqueue::dequeue<superqueue::SyncType::MULTI_THREAD,
+ superqueue::Behavior::FIXED>(mempool->pool, BATCH));
+ if (data == nullptr) [[unlikely]]
+ return nullptr;
+ else
+ index = 0;
+ }
+
+ return data - (index++ * BLOCK);
But After this change there is not improvment and the reason is current bottle-neck is in single thread consumer side. I will try multi-consumer mode.
As mentioned in section 1, we have problem in reader thread, so according to profiler we have spend considreable amount of time in vtable find of related to process function of events. As events are one layer polymorphicy call process function, I have expected compiler optimized them out, but ot is wrong expectation. So we will try static polymorphism or pass static function and use plain struct as events.