Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate charts for Fio benchmark #3549

Merged
merged 5 commits into from
Oct 3, 2023

Conversation

MVarshini
Copy link
Contributor

PBENCH-1214

Generate charts for Fio benchmark

@MVarshini MVarshini added the Dashboard Of and relating to the Dashboard GUI label Sep 7, 2023
@MVarshini MVarshini self-assigned this Sep 7, 2023
Copy link
Member

@dbutenhof dbutenhof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't update the pquisby version in server/requirements.txt, which I'd expected. What Quisby are you using?

Aside from that, rather than making specific source comments, I pulled the branch and tried it, and what I'm seeing confuses me:

When I run with the current server lock (pquisby==0.0.17), visualization doesn't work at all for fio: I get errors like Quisby processing failure. Exception: list index out of range on either of the functional test fio runs and the new one we generated the other day.

On the other hand if I change to pquisby==0.0.25, which is the latest, one of the two "canned" functional test FIO runs (fio_rw_2018.02.01T22.40.57) "works" (although the display is a bit odd), but the other canned FIO run and the new one we generated the other day both fail with Quisby processing failure. Exception: not enough values to unpack (expected 2, got 0)

Are you using a different Pquisby version???

And with the one fio dataset that does work, I get
image

Obviously it's not very "interesting", but that's OK. However, what does it mean that the caption says <>_d-<>_j-<>_iod? Is that intentional, or a formatting issue? (Or missing data?)

@MVarshini
Copy link
Contributor Author

@dbutenhof I was pointing to the latest version of pquisby and didn't check in that file.

Since, disk-job value is unknown <>_d-<>_j-<>_iod left empty in the response

@dbutenhof
Copy link
Member

@dbutenhof I was pointing to the latest version of pquisby and didn't check in that file.

Hmm. So 0.0.25 is working for you? Because it wasn't for me, on 2 of the three fio datasets, which makes me nervous. Was there something else in your test environment that didn't get onto the PR branch, aside from the requirements.txt change? 😦

Since, disk-job value is unknown <>_d-<>_j-<>_iod left empty in the response

That's rather ugly, but it sounds like maybe that's some of the extra information Soumya was hoping to get from the server? (Although I'm not sure we know how to find it.)

Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should leave the pquisby dependency floating. (I'm not going to block the merge on that basis, but I'm not approving, either....)

Also, I have a concern about the behavior of getChartValues() when the benchmark isn't suitable for visualization...where do we guard against that misbehavior?

Other than those, I have a few minor things for your consideration.

dashboard/src/actions/comparisonActions.js Outdated Show resolved Hide resolved
dashboard/src/actions/comparisonActions.js Outdated Show resolved Hide resolved
dashboard/src/actions/comparisonActions.js Outdated Show resolved Hide resolved
dashboard/src/actions/comparisonActions.js Show resolved Hide resolved
Comment on lines 158 to 161
}, Object.create(null));

for (const [key, value] of Object.entries(result)) {
console.log(key);

const map = {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting (new) line 158, should we be using Object.create(null) at line 161 instead of {}?

Comment on lines +187 to +189
export const toggleCompareSwitch = () => ({
type: TYPES.TOGGLE_COMPARE_SWITCH,
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you omit the trailing comma on line 188, I think this will all fit on one line.

pquisby==0.0.17
pquisby
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that we want to let this "float"?

Given recent experiences, I think we should peg this at a "known good" version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered at Webb's duplicate of my comment, but when I just started re-reviewing today I realized that I never actually published that earlier comment. Ooops!

So I'll just add here that pquisby is now at 0.0.27, vs 0.0.25 when I wrote that earlier comment, and I have no idea what's changed. Does it still work? I really do think we should be deliberately managing this dependency rather than letting it float.

dashboard/src/actions/comparisonActions.js Outdated Show resolved Hide resolved
@sousinha97
Copy link

sousinha97 commented Sep 8, 2023

@dbutenhof , In order to analyse Fio benchmark data, knowing the following is important apparently(I am not an expert on FIO)-

  1. No of disks used
  2. IODepth
  3. Numjobs

which we are adding to the graphs, if it's not possible to fetch this data, we can remove this from the graph (It would have been great if we were able to extract this data tho, it reduces the effort for the user to go back to their runs and check manually the value of these fields). Would love to know your views on this. Accordingly we can plan and modify the graphs.

I guess these values are provided while running fio command, if not, default values are taken into consideration.

@dbutenhof
Copy link
Member

dbutenhof commented Sep 8, 2023

@dbutenhof , In order to analyse Fio benchmark data, knowing the following is important apparently(I am not an expert on FIO)-

We're definitely not FIO experts, either. We've pieced together how to run it in trivial configurations, but the data in the result.csv that Quisby consumes was determined years ago as part of a complicated Pbench Agent post-processing step we've barely touched.

  1. No of disks used

I assume we could figure out how to extract this information from the more detailed benchmark logs ... but I'm not even sure if it's global configuration or private to each iteration. (Or, for that matter, whether the Pbench "iteration" concept, which is how the Agent organizes the raw data, directly corresponds to the fio "job" configuration...)

Each fio "job" seems to be targeted to a specific filesystem path, which would define the disk used. Since the result.csv lists each Pbench Agent "iteration" (which I think equates directly to a fio "job" in the standard benchmark wrapper, but don't quote me on that), I suspect that each job is a single disk.

But a lot of these constraints are due to the design of the pbench-fio wrapper script. Right now, Quisby relies on the post-processed result.csv file which means we can only visualize/compare the output of pbench-fio. We also have people who run fio directly, for example, using pbench-user-benchmark, often to take advantage of the massive flexibility of the fio command configuration that's not supported by the Pbench Agent wrapper. Right now we have no way to visualize/compare those runs as we don't even know it's "fio" and we don't have the agent post-processing.

  1. IODepth

This appears to be a global configuration setting on the fio job file, and we can extract it from a fio-generated JSON output file or from the raw text job description file.

  1. Numjobs

As I mentioned above, I'm not entirely certain how the pbench-fio wrapper maps the fio "job" concept into Pbench Agent "iterations". I suspect (without much proof at this point) that it's one-to-one. However, a fio configuration file can apparently define multiple "jobs". We can read the list from the input file or fio's output to get it.

which we are adding to the graphs, if it's not possible to fetch this data, we can remove this from the graph (It would have been great if we were able to extract this data tho, it reduces the effort for the user to go back to their runs and check manually the value of these fields). Would love to know your views on this. Accordingly we can plan and modify the graphs.

I guess these values are provided while running fio command, if not, default values are taken into consideration.

We can figure this all out, but first we need to figure out the relative importance of figuring it out compared to the other stuff we need to do. 😄

FYI: I've summarized this at PBENCH-1274.

@webbnh
Copy link
Member

webbnh commented Sep 8, 2023

Just poking around a randomly-selected FIO result, I found the generate-benchmark-summary.cmd at the top level which contains

/opt/pbench-agent/bench-scripts/postprocess/generate-benchmark-summary "fio" "--block-sizes=4,1024 --iodepth=8 --numjobs=10 --ramptime=10 --runtime=60 --samples=5 --targets=/fio --test-types=read,write --clients=192.168.122.211" "/var/lib/pbench-agent/fio__2023.09.06T12.05.46"

It looks like the values for iodepth and numjobs are there, and, extending Dave's suggestion above, I expect that the number of disks can be gleaned from the targets value there.

If we really want to dig, we have the fio.job files for each iteration, each of which contains a global section with the iodepth and a job-<dev> section for each device (which we can count to get the number of devices) each of which contains the number of jobs.

However, if (for the moment) we want to restrict ourselves to the information that we already have, I think that the Pbench Dashboard can glean the number of disks from the Pbench Server metadata, dataset.metalog.iterations/<iteration>.dev (i.e., in the metadata.log file, under each [iterations/...] section, there is a dev entry which contains the same value as the targets in the command above). And, I think Quisby can deduce the number of jobs by counting the iops_sec:client_hostname:* columns in the result.csv file. However, I don't see any way to get the I/O depth from the information that we already have on hand.

[I've added this information in a reply to PBENCH-1274.]

@sousinha97
Copy link

sousinha97 commented Sep 11, 2023

@dbutenhof @webbnh thanks for your input. Yes this process requires a bit of discussion, we can surely look into other major issues first and keep this in backlog as an optimisation task. I will contact with Varshini, and remove this feature for now.

@dbutenhof
Copy link
Member

No match for argument: rsyslog-mmjsonparse
Error: Unable to find a match: rsyslog-mmjsonparse

That's distressing. I'm not sure whether this is a transient failure in accessing repos or a real change in the dependencies. I'm going to re-trigger the build to see if it happens again.

Copy link
Member

@dbutenhof dbutenhof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Webb has some remaining good suggestions, but I don't believe any of them are critical.

Also, I have a concern about the behavior of getChartValues() when the benchmark isn't suitable for visualization...where do we guard against that misbehavior?

The server API returns a useful message in this case, which the dashboard displays:

image

I think that's fine.

I'm still slightly concerned about allowing pquisby to float, although the latest 0.0.27 seems to work.

On one hand I'd like to give Webb another chance to re-review, but I'm also tempted to merge it this week and deploy onto staging at least by the ops review on Thursday. I'll have to wrestle with that one; but for now I'm going to approve.

(Incidentally, I also think it's intriguing that Jenkins re-ran the CI with my container build fix even though Varshini's commits weren't rebased on top of that.)

@@ -12,7 +12,7 @@ flask-restful>=0.3.9
flask-sqlalchemy
gunicorn
humanize
pquisby==0.0.17
pquisby
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is OK, given the inconsistency we've seen in pquisby PyPi packages, I think I'd prefer to see pquisby==0.0.25 if that's what we're supporting for uperf and fio visualization. We can change the version later as necessary.

pquisby==0.0.17
pquisby
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered at Webb's duplicate of my comment, but when I just started re-reviewing today I realized that I never actually published that earlier comment. Ooops!

So I'll just add here that pquisby is now at 0.0.27, vs 0.0.25 when I wrote that earlier comment, and I have no idea what's changed. Does it still work? I really do think we should be deliberately managing this dependency rather than letting it float.

@dbutenhof dbutenhof merged commit cc22946 into distributed-system-analysis:main Oct 3, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dashboard Of and relating to the Dashboard GUI
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants