Separate trace collection from trace decoding. #99

vext01 · 2022-10-14T13:48:50Z

We are about to add our own trace decoder (as an alternative to libipt) to hwtracer. In order to do this we need the ability to have separate notions of trace collection and trace decoding.

In other words, instead of having a PerfPT backend, we now have a PerfTraceCollector and a LibIPTTraceDecoder.

(There was also something called a Dummy backend, which I saw no utility to, and have thus removed)

This is a large change, that will be hard to review I'm afraid.

In doing this work I've realised how crusty and unidiomatic some of the hwtracer code is. Regardless I've tried hard to resist the urge to uncrustify unrelated parts of code. I will fix some of this stuff in follow up PRs.

yk companion: ykjit/yk#580

src/collect/mod.rs

src/collect/perf/mod.rs

src/decode/libipt/mod.rs

src/decode/mod.rs

ltratt · 2022-10-14T14:35:44Z

I've tried to comment on things that I think are either new-ish or are easily changed. Nothing major.

ltratt · 2022-10-14T15:05:57Z

Just one comment left!

ltratt · 2022-10-14T15:17:48Z

Please squash.

vext01 · 2022-10-14T15:19:19Z

splat

ltratt · 2022-10-14T15:19:34Z

bors r+

99: Separate trace collection from trace decoding. r=ltratt a=vext01 Co-authored-by: Edd Barrett <[email protected]>

bors · 2022-10-14T16:22:09Z

Build failed:

buildbot/yk-buildbot-build-script-hw

vext01 · 2022-10-17T14:04:31Z

Just had a quick look at this.

We are maxing out the trace buffer, which is currently set (by default) to 1024 pages (a page is 4096 bytes by default on a typical linux system, so that's about 4MiB, by my calculations).

The test in question does in indeed test a "long" trace: it calls SystemTime::now().elapsed().unwrap().subsec_nanos() in a loop 3000 times.

Shall we try bumping the default trace buffer size?

ltratt · 2022-10-17T14:05:07Z

Yes, 4MiB is very small. Just make it 64MiB (at least for testing purposes) and be done with it IMHO.

vext01 · 2022-10-17T14:11:18Z

Something like that? (untested)

ltratt · 2022-10-17T14:12:48Z

Actually, are we sure pages are a fixed size? I don't think that's true anymore, at least on Linux. We probably have to read the page size value from somewhere to calculate this.

vext01 · 2022-10-17T14:24:01Z

Page size is configurable. That's what was implied by "typical Linux system" in the commit message.

We can use the (surprisingly well-named) getpagesize() if you like.

ltratt · 2022-10-17T14:25:33Z

We can use the (surprisingly well-named) getpagesize() if you like.

Yes we should. There's no point not doing so as we'll simply forget and confuse our future selves!

vext01 · 2022-10-18T13:35:19Z

a0ba640 takes into account variable page sizes.

978aa7f fixes the intermittent test failures we spoke about in person (the commit message addresses some of the questions we had).

In doing the latter I've found a nasty bug, which I will raise an issue to explain. We should deal with this in a separate PR.

vext01 · 2022-10-18T15:20:31Z

Also pushed af4f168 as a result of in-person conversation.

ltratt · 2022-10-19T09:03:02Z

src/collect/mod.rs

@@ -25,6 +25,10 @@ const PERF_DFLT_INITIAL_TRACE_BUFSIZE: size_t = 1024 * 1024; // 1MiB
 /// The interface offered by all trace collectors.
 pub trait TraceCollector: Send + Sync {
    /// Return a `ThreadTraceCollector` for tracing the current thread.
+    ///
+    /// Note that an instance of `ThreadTraceCollector` is not thread local, and an attempt to


I find this really difficult to parse. Can we simplify it to "You must not start two TraceCollectors in the same thread. Doing so is is undefined behaviour". Also I think this function (or the trait?) should be unsafe.

You must not start two TraceCollectors in the same thread

But that's not strictly true. You can start two different ThreadTraceCollectors on the same thread as long as their "tracing sessions" don't overlap (in terms of time).

unsafe

Yep, forgot about that one!

I was thinking yesterday: why does a user facing ThreadTraceCollector even exist? If we had TraceCollector::start_thread_collector(), then this issue (I think) disappears? There can we a hidden thread-local ThreadTraceCollector and the problem described in #101 I think disappears too.

I don't know the internal API, so I'm happy with whatever change you suggest (but I do want the comments to be understandable by those of us whose English is not as good as yours :)).

I'm actually OK with making a slightly stronger restriction than what the (current) API allows, so how about something really simple, like:

You may only call thread_tracer() once per thread. Calling it more than once per-thread will result in undefined behavior.

This will force the user to only ever have one hanging around.

you may only obtain a thread collector once (but the collector is reusable).

overly restricts the user, but is very easy to explain. In other words, there are cases where it is safe to obtain > 1 thread collector, but if we tell the user they must obtain only one, it rules all unsafe cases out.

The test suite cannot abide by these rules though.

A thread may obtain multiple ThreadTraceCollectors but must only collect a trace with one at a time.

precisely describes when it is safe to obtain >1 thread collector.

I hope that makes sense?

[The more I see the confusion, the more I want to fix the API]

If we can fix the API without difficulty, that's probably more productive than trying to work out how to encode a complex condition in prose!

I don't think it will take that long to fix, but I would like to do it as a fresh PR if that's OK.

Are you OK with putting a placeholder comment here for now if I promise that my next PR fixes the API?

We could use:

A thread may obtain multiple ThreadTraceCollectors but must only collect a trace with one at a time.

Even if it's a little weak.

(Actually I already have a boring docstring branch to go in after this one -- if that one could go in next, I could save time resolving conflicts later)

ltratt · 2022-10-19T09:06:34Z

src/collect/perf/collect.c

@@ -66,7 +66,7 @@
 #define SYSFS_PT_TYPE   "/sys/bus/event_source/devices/intel_pt/type"
 #define MAX_PT_TYPE_STR 8

-#define MAX_OPEN_PERF_TRIES  20000
+#define MAX_OPEN_PERF_TRIES  50000


I would suggest that adding a pthread_yield or sched_yield (or even just "sleep 100ms" or similar!) wherever this loop is will be more robust: it will make it much more likely that the other thread acquiring the lock will succeed before we try again.

We can do, although I thought yielding was kind of frowned upon? (don't ask me why, I have a vague recollection of reading something on a mailing list)

(btw, there's already a sleep:

hwtracer/src/collect/perf/collect.c

Line 70 in 978aa7f

#define OPEN_PERF_WAIT_NSECS 1000 * 30

)

Yielding doesn't always do what people expect (e.g. in spinlocks).

1000 * 30 nanoseconds is 1/33333th of a second -- under such circumstances sleep might not even do anything! I'm fine with sleeping, but let's make it least 1/100th of a second.

Fine by me. I don't remember where these numbers even came from TBH.

I went digging. Here's the discussion on the "first version" of # of iterations and how long to sleep:
#43 (comment)

We have bumped the numbers on a couple of occasions since though.

Not sure if that changes anything? Shall we still use 1/100th of a sec?

Personally I'd use at least 1/100th of a second, perhaps as much as 1/10th. In the grand scheme of things, small delays like that don't really slow down the test suite to any noticeable degree.

vext01 · 2022-10-19T13:58:18Z

I've pushed what I propose we do for this PR.

For clarity, next PRs:

1. easy docstring/tidying PR (work already done)
1. fix API

vext01 · 2022-10-20T08:56:37Z

I have a branch with a preliminary API redesign now. Let's move this one forward?

ltratt · 2022-10-20T08:57:52Z

Please squash.

vext01 · 2022-10-20T09:25:20Z

splat.

ltratt · 2022-10-20T09:40:37Z

bors r+

99: Separate trace collection from trace decoding. r=ltratt a=vext01 Co-authored-by: Edd Barrett <[email protected]>

bors · 2022-10-20T09:46:12Z

Build failed:

buildbot/yk-buildbot-build-script-hw

vext01 · 2022-10-20T09:48:16Z

  ┌─ unicode-ident 1.0.5 (registry+https://github.com/rust-lang/crates.io-index):4:13
  │
4 │ license = "(MIT OR Apache-2.0) AND Unicode-DFS-2016"
  │            -^^^----^^^^^^^^^^------^^^^^^^^^^^^^^^^
  │            ││      │               │
  │            ││      │               rejected: not explicitly allowed
  │            ││      accepted: license is explicitly allowed
  │            │accepted: license is explicitly allowed
  │            license expression retrieved via Cargo.toml `license`
  │

vext01 · 2022-10-20T09:57:19Z

OK, so strum has pulled in this unicode license and the ISC license.

with the unicode one says:

 either

    (a) this copyright and permission notice appear with all copies of the Data Files or Software, or
    (b) this copyright and permission notice appear in associated Documentation.

Are we a "copy of the software"?

ltratt · 2022-10-20T09:58:41Z

IMHO this license is incredibly unclear (I've encountered it in another project) but the only plausible thing we can do is to allow it in cargo deny and hope that we've done the right thing.

vext01 · 2022-10-20T10:06:31Z

Or eliminate strum from our code. Which would be a shame...

vext01 · 2022-10-20T10:19:55Z

OK with those fixes?

ltratt · 2022-10-20T10:29:05Z

Please squash.

We are about to add our own trace decoder (as an alternative to libipt) to hwtracer. In order to do this we need the ability to have separate notions of trace collection and trace decoding. In other words, instead of having a `PerfPT backend`, we now have a `PerfTraceCollector` and a `LibIPTTraceDecoder`. (There was also something called a `Dummy` backend, which I saw no utility to, and have thus removed) This is a large change, that will be hard to review I'm afraid. In doing this work I've realised how crusty and unidiomatic some of the hwtracer code is. Regardless I've tried hard to resist the urge to uncrustify unrelated parts of code. I will fix some of this stuff in follow up PRs.

This makes the trace buffer 64MiB, adjusts the sleep interval between attempts to open perf and ups the number of attempts before giving up.

ykjit#101 We will need to fix this API in the near future.

vext01 · 2022-10-20T10:31:27Z

splat.

ltratt · 2022-10-20T10:32:00Z

bors r+

bors · 2022-10-20T10:39:03Z

Build succeeded:

buildbot/yk-buildbot-build-script-hw

vext01 assigned ltratt Oct 14, 2022

vext01 mentioned this pull request Oct 14, 2022

Adapt to the new hwtracer design. ykjit/yk#580

Merged