Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify core 2 #13

Draft
wants to merge 33 commits into
base: master
Choose a base branch
from
Draft

Conversation

yilongli
Copy link
Member

Made some more improvements atop #11.

yilongli and others added 30 commits October 6, 2020 01:56
…ation

Details:
- Disable DpdkDriver in CMakeLists.txt temporarily
- Remove unused method Driver::Packet::getMaxPayloadSize()
- Remove param `driver` in handleXXXPacket
- Change Driver::Packet to become a POD struct
- Change Driver::Packet::{address,priority} into params in Driver::sendPacket
- Remove the opaque Driver::Address
- Use IP packets as the common interface between transport and driver
- Extend Homa packet headers to include L4 src/dst port numbers
- Use SocketAddress (i.e., ip + port) as opposed to Driver::Address to identify
  the src/dst address of a message
- Reenable DPDK in CMakeLists.txt
- Initialize an ARP table at driver startup using the content of /proc/net/arp
- Select eth port via the symbolic name of the network interface (e.g., eno1d1)
  (the current implementation uses ioctl to obtain the IP and MAC addresses of
  a network interface)
- Add a system test for DPDK driver: test/dpdk_test.cc
- Change IpAddress from typedef to a POD type to provide some type safety
- Add a third argument to Driver::receivePackets() to hold source addresses
  of ingress packets when the method returns
- Eliminate Driver::Packet (use Homa::PacketSpec instead)
- Move L4 header fields sport/dport into header prefix
Timeouts are now checked incrementally with one bucket checked
every poll iteration. Previously, timeout checking was more
concentrated; checking was triggered less frequently but all
buckets would be checked once triggered. The previous design allows
the poll iteration to complete quickly when no timeout checking is
triggered but significantly increases poll() execution times when
checking needs to occur. This causes large, periodic latency spikes
when using the transport. In the new design sightly increases the
minimum poll() execution times but allows the work of checking
timeouts to be more evenly distributed over time and thus reduces
latency spikes.
Homa will now ensure that an OutMessage is fully sent before it is
cleaned up if an application releases the message after calling
send(). Previously, released messages that were in progress of
being sent were cancelled and immediately cleaned up. Canceling a
message now requires an explicit call to cancel().
Make the Perf::Counters::Stat::add(T val) operation thread unsafe.
This significantly reduces the overhead of updating a Stat and is
ok because add() is only called on thread-local Stat instances.
Previously, the number of messages completed per call to trySend()
was limited by an std::array allocated on the stack. This was done
to reduce the allocation overhead. New Perf microbenchmarks show
that reserving capacity in an std::vector is likely efficient
enough and would allow trySend() to seamlessly handle an unlimited
number of message completions per call to trySend().
The Sender will now skip pinging a message if the message still has
granted but unsent packets (e.g. it is waiting on itself).
Will be used in future integration with non-polling interface.
Will be used by a future non-polling interface.
@yilongli yilongli requested a review from cstlee October 28, 2020 05:33
Copy link
Member

@cstlee cstlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clean up, especially of the interfaces, looks pretty good; I don't have any issues there. My main concern is with the locking.

Overall, it seems like this PR reduces the size/scope of the bucket and queue/scheduler mutex critical sections in most places. Unfortunately, doing this seems to have introduced potential race conditions and TOCTOU bugs. Perhaps I missed some critical factors that rule out these issues, but the locking did not make it obvious that the code is race-free. I would argue that using a locking style more like monitors with the larger critical sections makes it easier to convince ourselves that the logic is thread-safe. What was the motivation for using fine-grain critical sections? Can we achieve your goals and still have obviously thread-safe code? Let's find a time to discuss.

@@ -179,7 +177,6 @@ Driver::Packet*
DpdkDriver::Impl::allocPacket()
{
DpdkDriver::Impl::Packet* packet = nullptr;
SpinLock::Lock lock(packetLock);
static const int MBUF_ALLOC_LIMIT = NB_MBUF - NB_MBUF_RESERVED;
if (mbufsOutstanding < MBUF_ALLOC_LIMIT) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mbufsOutstanding is now unprotected. Have you double checked to make sure there aren't syncronization and TOCTOU issues? At the very least, it should probablly be declared atomic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to make the thread-safety of ObjectPool configurable via a template argument, which is generally useful anyway. Since I am lazy, I am going to revert all changes on DpdkDriver.

src/Drivers/DPDK/DpdkDriverImpl.cc Outdated Show resolved Hide resolved
src/ObjectPool.h Show resolved Hide resolved
src/Timeout.h Outdated
{
list.remove(&timeout->node);
timeout->expirationCycleTime =
PerfUtils::Cycles::rdtsc() + timeoutIntervalCycles;
timeout->expirationCycleTime = now + timeoutIntervalCycles;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a caller provided now is problematic. There is no gaurantee that the caller will provide monotically increasing now values, which would break the timer list's ordering invariant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it; revert the change. By the way, this method won't be called very often anyway once my changes on the timeout mechanism are merged.

src/Sender.h Outdated Show resolved Hide resolved
src/Receiver.cc Outdated
{
SpinLock::Lock lock_bucket(bucket->mutex);
message = bucket->findMessage(id, lock_bucket);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Message might be invalid after bucket mutex release.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extended the critical section to the entire method.

src/Receiver.h Outdated Show resolved Hide resolved
src/Receiver.cc Outdated
// Things that must be true (sanity check)
assert(id == message->id);
assert(message->driver == driver);
// Copy the payload into the message buffer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why copy the data out while processing the packet instead of when the application calls receive()? I can think of potential pros and cons but I'm wondering what you are considering.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to be able to release the packet buffer back to the driver when this method returns, you would have to do the copy here. In general, it's going to be faster to do copy while the packet data is still hot; this also helps reduce the number of RX buffers in use in the driver.

src/Receiver.cc Outdated
message->numResendTimeouts++;
if (message->numResendTimeouts >= MESSAGE_TIMEOUT_INTERVALS) {
// Message timed out before being fully received; drop the message.
messageAllocator.destroy(message);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this deadlock on ~Message()? The bucket->mutex is held and also used the destructor, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good catch. Now I used a UniqueLock and released the lock before calling dropMessage (the alternative is to duplicate functionality from dropMessage like the original code). Yes, I also revived dropMessage because it feels like information leakage to detach a message from the transport inside its destructor.

src/Receiver.cc Outdated Show resolved Hide resolved
@yilongli
Copy link
Member Author

The clean up, especially of the interfaces, looks pretty good; I don't have any issues there. My main concern is with the locking.

Overall, it seems like this PR reduces the size/scope of the bucket and queue/scheduler mutex critical sections in most places. Unfortunately, doing this seems to have introduced potential race conditions and TOCTOU bugs. Perhaps I missed some critical factors that rule out these issues, but the locking did not make it obvious that the code is race-free. I would argue that using a locking style more like monitors with the larger critical sections makes it easier to convince ourselves that the logic is thread-safe. What was the motivation for using fine-grain critical sections? Can we achieve your goals and still have obviously thread-safe code? Let's find a time to discuss.

There are three advantages with the fine-grain critical section approach. First, the code is more explicit about what need to be protected. Second, we can flip the locking order constraint to be queueMutex before bucket mutex and make the code in trySend() simpler. Third, it becomes easier to change the code to make sure no locks are held when calling into the driver. Obviously, I missed the point that the original code was also using the bucket mutex to achieve safe memory reclamation. I decided not to try the reference counting approach so I have now reverted to the original design (at least for Sender, the problems in Receiver seems rather easy to fix).

Copy link
Member

@cstlee cstlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Mostly small comments. The possible remaining discussion topics are:

  • John's comments about the copying interface.
  • Test coverage
  • C++ version
  • Next steps

@@ -99,38 +102,66 @@ class ObjectPool {
template <typename... Args>
T* construct(Args&&... args)
{
void* backing = NULL;
void* backing = nullptr;
enterCS();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea of making the locking optional. These methods are small so I doesn't matter all that much but I'm wondering if there is an easy RAII version of locking and unlocking; they tend to be less error prone.

// A new Message needs to be entered into the scheduler.
SpinLock::Lock lock_scheduler(schedulerMutex);
schedule(message, lock_scheduler);
} else if (message->scheduled) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 148 already says if (message->scheduled). One of these lines is probably wrong. Does this patch include test coverage? If so, a unit test is likely missing to test this case.


// Deliver the message to the user of the transport.
SpinLock::Lock lock_received_messages(receivedMessages.mutex);
receivedMessages.queue.push_back(&message->receivedMessageNode);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we safely access the message's members without the bucket_lock? If so, it should be documented as to why.

@@ -785,30 +599,27 @@ Receiver::trySendGrants()
* Reminder to hold the Receiver::schedulerMutex during this call.
*/
void
Receiver::schedule(Receiver::Message* message, const SpinLock::Lock& lock)
Receiver::schedule(Receiver::Message* message,
[[maybe_unused]] SpinLock::Lock& lock)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What version of C++/gcc are you assuming? The project currently checks for C++11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants