Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add precheck scoring functionality #356

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

Gregory-Pereira
Copy link
Collaborator

@Gregory-Pereira Gregory-Pereira commented May 16, 2024

draft implementation of scoring answers based on LAB: LARGE-SCALE ALIGNMENT FOR CHATBOTS.
/cc @mingxzhao @nerdalert @vishnoianil

@lhawthorn
Copy link
Member

@Gregory-Pereira Isn't BAM an IBM internal only service? If so, I am concerned that you may be leaking company confidential details in this PR.

I would also note that separating out founding participant corporate matters from open source project community operations is quite difficult when the a community is very young - InstructLab is our 13 week old puppy as of next Monday, May 20, 2024. However, it is not best practice nor highly awesome for IBM or Red Hat - or any company for that matter - to mention internal only matters as part of our operations in this open source community space.

We InstructLab participants was to be both highly awesome and excellent stewards of this community so everyone feels welcomed and encouraged to join us as participants and leaders.

Perhaps we could have this discussion elsewhere or reconsider how we are framing the PR text in the future.

Yours in kindness, LH

@Gregory-Pereira
Copy link
Collaborator Author

Apologies @lhawthorn, still getting used to this new paradigm. Thank you for your kind and swift response. Hopefully not much is impacted, as I only referenced the service name and did not provide any details about what it is or how it works or how to use it.

@russellb
Copy link
Member

I'm not sure what was here originally, but it's no secret that this bot uses model endpoint APIs hosted on IBM infrastructure.

@nerdalert nerdalert requested a review from mingxzhao May 17, 2024 19:46
Signed-off-by: greg pereira <[email protected]>
this change allows us to use different model names for the precheckEndpoint and precheckScoringEndpoint

Signed-off-by: greg pereira <[email protected]>
@Gregory-Pereira Gregory-Pereira force-pushed the allow-precheck-scoring-endpoint branch from ca19ae6 to e20843c Compare May 18, 2024 01:15
@Gregory-Pereira Gregory-Pereira marked this pull request as ready for review May 18, 2024 01:32
@Gregory-Pereira
Copy link
Collaborator Author

RFR cc @nerdalert @vishnoianil

mingxzhao added 2 commits May 17, 2024 21:29
update scoring prompt

Signed-off-by: Mingxuan Zhao <[email protected]>
Signed-off-by: Mingxuan Zhao <[email protected]>
@vishnoianil
Copy link
Member

vishnoianil commented May 18, 2024

I am wondering if we should expose precheck-score command to the user? I am assuming that both precheck and precheck scoring endpoints both are separate endpoints to hit. @Gregory-Pereira @mingxzhao

@mingxzhao
Copy link
Collaborator

I believe the end goal is to tie the precheck score directly to the original precheck. With the current scoring process it is still more expensive than precheck so if precheck is not open to users precheck score likely wouldn't be.

@mingxzhao
Copy link
Collaborator

Theoretically precheck and precheck score hit endpoints that are equally expensive

@Gregory-Pereira
Copy link
Collaborator Author

I agree with Ming's stance on this. A middle ground that I would find acceptable (putting aside the implementation details for now), would be allowing users to select an individual human answer - endpoint answer pairing and score that. It feels a bit heavy handed to give a user access to running the batch run of all seed data scoring.

@Gregory-Pereira Gregory-Pereira force-pushed the allow-precheck-scoring-endpoint branch from 2506bfa to ac9db5c Compare May 19, 2024 00:50
mingxzhao added 5 commits May 18, 2024 19:02
Swap question target location

Signed-off-by: Mingxuan Zhao <[email protected]>
typo

Signed-off-by: Mingxuan Zhao <[email protected]>
prompt update


Signed-off-by: Mingxuan Zhao <[email protected]>
Signed-off-by: Mingxuan Zhao <[email protected]>
lint error

Signed-off-by: Mingxuan Zhao <[email protected]>
@Gregory-Pereira Gregory-Pereira changed the title WIP: add precheck scoring functionality Add precheck scoring functionality May 19, 2024
@Gregory-Pereira
Copy link
Collaborator Author

Current status for provenance: Functionality there. At this point were prompt engineering for performance.

Copy link
Member

@vishnoianil vishnoianil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good to me. But I think I am bit skeptical about the overall user experience, so let's narrow that down. For example, "@instructlab-bot precheck" will give you scoring in the yaml output if the "scoring" endpoint is set, otherwise it won't. It's two different output for the same "precheck" command based on the different backend configuration, which the user might not be aware of (because it's the backend configuration that we admin configure). So It seems like we are kind of intermingling two separate stuff here.

Although we can always set the precheck-endpoing and scoring-endpoint both, but in this case it will always give you scoring. which sounds bit better, but if that's the case, than i think we should just make that assumption that precheck will always give a score and make sure that precheck-endpoing and scoring-endpoint both must be configured for the backend.

Other option is to add a new "@instructlab-bot precheck-score" command, which gives user scoring as well. I think both of these options probably is less confusing options. wdyt? happy to hear any other ideas as well.

ui/apiserver/apiserver.go Show resolved Hide resolved
@@ -118,6 +121,7 @@ func init() {
generateCmd.Flags().StringVarP(&WorkDir, "work-dir", "w", "", "Directory to work in")
generateCmd.Flags().StringVarP(&VenvDir, "venv-dir", "v", "", "The virtual environment directory")
generateCmd.Flags().StringVarP(&PreCheckEndpointURL, "precheck-endpoint-url", "e", "http://localhost:8000/v1", "Endpoint hosting the model API. Default, it assumes the model is served locally.")
generateCmd.Flags().StringVarP(&PreCheckScoringEndpointURL, "precheck-scoring-endpoint-url", "", PreCheckEndpointURL, "Endpoint hosting the model API that will be scoring the output of precheck against the answers supplied in the PR. Default, it assumes the model is the same as precheck model and is served locally.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question for my own understanding- if user doesn't provide the ScoringEndpoint, and it endup using the default PrecheckEndpoint, who does the scoring ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You nailed it. A model will do the scoring, whether or not the same model as precheck or a different one. Note, by scoring we do not mean anything used in training, just a summary of if the model was close to the human answer for triager conenience.

for i := 0; i < len(precheckPRAnswers); i++ {
err, promptTemplate := generatePrecheckScoringPrompt(precheckPRAnswers[i], precheckEndpointAnswers[i], precheckPRQuestions[i])
if err != nil {
w.logger.Errorf("Failed to generate a prompt for precheck scorring: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scorring -> scoring

@@ -378,6 +472,12 @@ func (w *Worker) runPrecheck(lab, outputDir, modelName string) error {
w.logger.Error("Question not found or not a string")
continue
}
answer, ok := example["answer"].(string)
if !ok {
w.logger.Error("Question not found or not a string")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to fix the log ?

@@ -450,7 +553,8 @@ func (w *Worker) runPrecheck(lab, outputDir, modelName string) error {
time.Sleep(1 * time.Second)
}
}
return nil
return nil, precheckPRAnswers, precheckEndpointAnswers, precheckPRQuestions
// return nil, precheckPRAnswers, precheckPRQuestions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// return nil, precheckPRAnswers, precheckPRQuestions
// return nil, precheckPRAnswers, precheckPRQuestions

@vishnoianil
Copy link
Member

vishnoianil commented May 20, 2024

I agree with Ming's stance on this. A middle ground that I would find acceptable (putting aside the implementation details for now), would be allowing users to select an individual human answer - endpoint answer pairing and score that. It feels a bit heavy handed to give a user access to running the batch run of all seed data scoring.

i understand the cost aspect. But I think the user experience is bit confusing the way currently it's implemented. I added some more details about it in the comment above. Now the question about single pair vs batch scoring, i think we just need to see how user can actually do that from github comment. Overall the intention is to keep end to end gobot UX simple and less confusing.

mingxzhao added 3 commits May 20, 2024 13:44
Signed-off-by: Mingxuan Zhao <[email protected]>
Signed-off-by: Mingxuan Zhao <[email protected]>
Signed-off-by: Mingxuan Zhao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants