Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added pipeline analysis for OP Admin Dashboard #42

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

TeachMeTW
Copy link

Add Pipeline Analysis for OP Admin Dashboard

Description

Introduces pipeline analysis notebooks for the OP Admin Dashboard. It includes two versions to comply with repository guidelines regarding data privacy and output handling:

  1. Pipeline Analysis with Masked Outputs

    • File: pipeline_analysis_with_output.ipynb
    • Description: Contains masked outputs to protect sensitive information. Suitable for public viewing and aggregate analyses.
  2. Pipeline Analysis without Outputs

    • File: pipeline_analysis_no_output.ipynb
    • Description: All outputs have been cleared to ensure no sensitive or individual-specific data is exposed. Suitable for individual analyses.

Changes

  • Added pipeline_analysis_with_output.ipynb with masked outputs.
  • Added pipeline_analysis_no_output.ipynb with outputs cleared.

@shankari
Copy link
Contributor

@TeachMeTW why do you only have 8 entries?

@TeachMeTW
Copy link
Author

TeachMeTW commented Oct 31, 2024

@shankari What do you mean by 8 entries?
For me it shows:
Total documents in Stage_timeseries with metadata.key 'stats/pipeline_time': 215954

Oh I see now.. On the aggregate For August 20, 2023 (Row 1), there were 8 entries.

What is the expected count for entries?

@shankari
Copy link
Contributor

shankari commented Oct 31, 2024

more seriously, on cell 13, we see only 8 entries
Which makes me think that the rest of the stats (all the visualization) is based on 8 entries
Also, it is not clear what the units on the y axis are?!

@TeachMeTW
Copy link
Author

@shankari I figured it out -- I accidentally put a limit on it; should be fixed, the results seem way better. I also added axis labels.

@shankari
Copy link
Contributor

shankari commented Oct 31, 2024

Are you sure the reading values are milliseconds? Where did you get that from?
Also, I don't believe your later graphs (e.g. cell 51) - I am 99% sure from ad-hoc investigations that there is no way that section segmentation takes more time that trip segmentation. Maybe the average is masking the outliers? No, cell 50 shows that the trip_segmentation is the new bottleneck
I would drop output_gen since it has already been removed from the pipeline so is no longer a target for optimization
e-mission/e-mission-server@fac1cb2

@TeachMeTW
Copy link
Author

Are you sure the reading values are milliseconds? Where did you get that from? Also, I don't believe your later graphs - I am 99% sure from ad-hoc investigations that there is no way that section segmentation takes more time that trip segmentation.

I am suspicious of it being milliseconds as well -- when I discussed with Jack, we came to the conclusion that reading is in ms and ts is in seconds. I agree that seconds make a lot more sense from personal testing experiences

@TeachMeTW
Copy link
Author

@shankari It is indeed ms; the avg is just skewed. As for the graphs, they were right. The labels were just shifted due to their orientation. Trip segmentation is indeed more time than section.

@shankari
Copy link
Contributor

shankari commented Nov 1, 2024

It is indeed ms

  1. How did you verify this?
  2. Not sure how the table from cell 95 and the table from cell 89 are related - where are the 1000+ (secs/ms) entries from cell 89 in cell 95?

@TeachMeTW
Copy link
Author

It is indeed ms

  1. How did you verify this?
  2. Not sure how the table from cell 95 and the table from cell 89 are related - where are the 1000+ (secs/ms) entries from cell 89 in cell 95?
  1. I reverified it by checking the pipeline intake stage and saw I was incorrect, it is seconds; the confusion stems from the fact dashboard readings was in ms.

  2. It is aggregated and averaged; the 210k entries -> 150 unique users with the averaged values

@shankari
Copy link
Contributor

shankari commented Nov 1, 2024

I reverified it by checking the pipeline intake stage and saw I was incorrect, it is seconds; the confusion stems from the fact dashboard readings was in ms.

When we clean up the dashboard readings after this report (e.g. rename etc) we should also convert the readings to seconds for consistency.

For the record, why are the dashboard readings in ms instead of seconds?

@TeachMeTW TeachMeTW force-pushed the Add-Pipeline-Analysis branch from ff10cf1 to dd2af1f Compare December 7, 2024 05:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants