Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read SP task crash dumps over the network #1921

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

mkeeter
Copy link
Collaborator

@mkeeter mkeeter commented Nov 18, 2024

Hubris counterpart to oxidecomputer/management-gateway-service#316, allowing us to read task dumps over the network via MGS.

These MGS messages – unlike dump-agent – require more intelligence into the Hubris side. The external API is just "start read" and "continue read"; Hubris is responsible for knowing how humpty writes stuff into RAM.

We allow for up to 4 simultaneous readers, distinguished by caller-provided keys. If there are more than four simultaneous readers, then (1) what the heck, and (2) some readers will return an error indicating that their key is no longer valid.

There are also minor tweaks to dump-agent and jefe implementations:

  • dump-agent learns how to read dump data into a lease (instead of returning a 256-byte value)
  • jefe caches the most recent dump area → memory address lookup, so we don't always need to start from the beginning of the linked list. This isn't useful if we have multiple readers, but makes the single-reader case less Accidentally Quadratic.

Copy link
Collaborator

@labbott labbott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little bit iffy about failure modes for more than 4 clients. It seems like the only result should be bad data on the client end? A sufficiently malicious client would potentially only cause a DoS?

app/gimlet/base.toml Show resolved Hide resolved
@@ -30,6 +30,7 @@ enum Trace {
GetDumpArea(u8),
Base(u32),
GetDumpAreaFailed(humpty::DumpError<()>),
GetDumpAreaFailed2(Result<DumpArea, DumpAgentError>),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a different name? DumpAreaAgentFailure?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this was for debugging; removed

offset: u32,
}

const MAX_DUMP_CLIENTS: usize = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 4? is this to allow two MGSes, plus...two more MGSes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly vibes, but I was thinking something like "max of two MGSes plus two tech port connections".

Comment on lines 44 to 80
for index in 0.. {
// XXX accidentally quadratic!
let data = self
.agent
.read_dump(index, 0)
.map_err(|_e| SpError::Dump(DumpError::BadArea))?;
let header = humpty::DumpAreaHeader::read_from(
&data[..size_of::<humpty::DumpAreaHeader>()],
)
.unwrap_lite();
if header.contents == humpty::DumpContents::SingleTask.into()
&& header.nsegments > 0
{
count += 1;
}
if header.next == 0 {
break;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it would make sense to have the dump agent just have a counter of the number of single-task dumps, so that we could just ask it for that, instead of probing every dump area? Looking at the dump-agent implementation, that might be somewhat difficult, as reinitialize_dump_from would have to read the reinitialized memory in order to reason about how many single-task dumps are being discarded...maybe this is the better approach?

}

// Move along to the next area if we're at the end
if state.offset + size_of::<humpty::DumpSegmentData>() as u32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considered obnoxious, feel free to ignore: size_of::<humpty::DumpSegmentData>() and size_of::<humpty::DumpAreaHeader>() are a little bit hard to tell apart in this function, because they're both long-ish expressions and the type parameter to size_of starts with humpty::Dump and my eyes glazed over at that point. I might consider doing something like

const HEADER_SIZE: u32 = size_of::<humpy::DumpAreaHeader>() as u32;
const SEGMENT_DATA_SIZE: u32 = size_of::<humpy::DumpSegmentData>() as u32;

and then referring to them by those constants elsewhere, so that it's more obvious that they're different things? but maybe this is just a "me thing", in which case you're welcome to ignore me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done in 1ddd388

#[cfg(feature = "dump")]
dump_areas: u32,

/// Cache of most recently checked dump area
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be nice to mention what this is used for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, done

task/jefe/src/main.rs Outdated Show resolved Hide resolved
if index == prev.index {
// Easy case: we've already looked up this area
Ok(prev)
} else if let Some(offset) = index.checked_sub(prev.index) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was going to suggest that this expression might be better represented as a match, since the two default cases could be combined, but...i think you can't write a match guard like

Some(prev) if let Some(offset) = index.checked_sub(prev.index) =>

which is a bit of a bummer. I suppose you could instead write

Some(prev) if prev.index >= index => 

and then do the subtraction in the body of the match arm, but using checked_sub both as the check and the actual subtraction feels much nicer, and the case where the checked_sub returns None would still have to be handled despite never actually happening, which feels quite sad...

@mkeeter
Copy link
Collaborator Author

mkeeter commented Nov 20, 2024

I'm a little bit iffy about failure modes for more than 4 clients. It seems like the only result should be bad data on the client end? A sufficiently malicious client would potentially only cause a DoS?

Yeah, that's correct – if I open 5x terminals and run

faux-mgs --log-level=debug --interface en0 dump read --index 0

in all of them, the first one stops with Error: Error response from SP: dump: invalid key

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants