Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose adding a file format for custom pipelines #109

Merged
merged 1 commit into from
Jul 12, 2024

Conversation

markmc
Copy link
Contributor

@markmc markmc commented Jul 2, 2024

Related to instructlab/instructlab#1546

See instructlab/sdg#86 for an implementation

@markmc markmc marked this pull request as draft July 2, 2024 17:20
@russellb
Copy link
Member

russellb commented Jul 3, 2024

For the spellcheck job to be happy ...

$ cat << EOF >> .spellcheck-en-custom.txt
Params
TODO
configs
freeform
iters
merlinite
mixtral
num
src
templating
yaml
EOF

$ make spellcheck-sort

docs/sdg/sdg-flow-yaml.md Outdated Show resolved Hide resolved
docs/sdg/sdg-flow-yaml.md Outdated Show resolved Hide resolved
docs/sdg/sdg-flow-yaml.md Outdated Show resolved Hide resolved
Copy link

@gabe-l-hart gabe-l-hart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Mark! Russell pointed me at this draft since I'm plugged into various flavors of "how do we extend SDG" on the IBM Research side. I like the direction a lot and just have a few ideas/thoughts.

docs/sdg/sdg-flow-yaml.md Show resolved Hide resolved
docs/sdg/sdg-flow-yaml.md Outdated Show resolved Hide resolved
@markmc
Copy link
Contributor Author

markmc commented Jul 5, 2024

For the spellcheck job to be happy ...

Fixed in commit 590cc8e thank you

@markmc markmc force-pushed the sdg-flow-yaml branch 3 times, most recently from bf880d2 to ac6ca80 Compare July 5, 2024 12:03
@markmc markmc marked this pull request as ready for review July 5, 2024 16:26
russellb pushed a commit to russellb/sdg that referenced this pull request Jul 7, 2024
See instructlab/dev-docs#109

In order to support custom flows, add a YAML based file format.

However, to make the default flows easier to reason about and develop,
also convert them to the YAML file format.

This changes the top-level API from:

```
mmlu_block_configs = MMLUBenchFlow().render()
knowledge_block_configs = SynthKnowledgeFlow().render()
knowledge_pipe = Pipeline(ctx, mmlu_flow + knowledge_flow)
```

to:

```
knowledge_pipe = pipeline.Pipeline(
    ctx, [pipeline.MMLU_BENCH_FLOW, pipeline.SYNTH_KNOWLEDGE_FLOW]
)
```

Signed-off-by: Mark McLoughlin <[email protected]>
Copy link
Member

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @markmc for pushing this proposal. I like the direction. Small suggestions included.

docs/sdg/sdg-flow-yaml.md Outdated Show resolved Hide resolved
Copy link
Contributor

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal, a few comments inline.

docs/sdg/sdg-flow-yaml.md Outdated Show resolved Hide resolved
docs/sdg/sdg-flow-yaml.md Outdated Show resolved Hide resolved
docs/sdg/sdg-flow-yaml.md Outdated Show resolved Hide resolved
docs/sdg/sdg-flow-yaml.md Outdated Show resolved Hide resolved
docs/sdg/sdg-flow-yaml.md Outdated Show resolved Hide resolved
docs/sdg/sdg-flow-yaml.md Outdated Show resolved Hide resolved
docs/sdg/sdg-flow-yaml.md Outdated Show resolved Hide resolved
docs/sdg/sdg-flow-yaml.md Show resolved Hide resolved
docs/sdg/sdg-flow-yaml.md Outdated Show resolved Hide resolved
@markmc markmc requested a review from leseb July 8, 2024 20:09
Copy link
Contributor

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @markmc, nice proposal

@markmc
Copy link
Contributor Author

markmc commented Jul 9, 2024

Thanks for the reviews.

Please see also instructlab/sdg#61 - I'd like to use this file format change as an opportunity to further simplify by removing the "Flow" noun from the API

The implication for this dev doc is that I would change it from talking about defining "custom flows" to "custom pipelines" ... the doc already uses these terms interchangeably, which just highlights the confusion

Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great - thank you!

We will add an API that instantiates a `Pipeline` object from a list of files:

```python
pipeline = Pipeline.from_flows(ctx, ['mycustomflow.yaml'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want .from_flows instead of .from_flow?
i can see you are intended to support chaining a list of flows here but I don't think this is where such a thing should appear.
plus, this deviates from what is current supported (i.e. initializing a pipeline from a flow).
discussion around this is beyond the subject of this proposal (this is a proposal for adding a file format)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It comes from the observation that within the library (in generate_data()) we are creating a pipeline that chains together MMLUBenchFlow and SynthKnowledgeFlow

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you point the code to me? i only see the code have a SDG that chains a list of pipelines, but not a pipeline that chains a list of flows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm proposing below to merge MMLUBenchFlow into SynthKnowledgeFlow so that a pipeline is a directory with freeform_skills.yaml, grounded_skills.yaml, and knowledge.yaml

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responded below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current API allows SDG([pipeline1, pipeline2]) or Pipeline(block_configs + block_configs) and they are identical. I've been using the latter because "SDG" isn't a noun (it doesn't convey what it does) and because as pep20 says "There should be one-- and preferably only one --obvious way to do it."

This all becomes moot if we merge MMLUBenchFlow and SynthKnowledgeFlow into a knowledge.yaml

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics between SDG or SyntheticDataGeneration and Pipeline is probably worth another discussion (I think you already have a PR on that). But here I'm only trying to say let's decouple that kind of changes from the YAML file support.
btw feel free to resolve this comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all becomes moot if we merge MMLUBenchFlow and SynthKnowledgeFlow into a knowledge.yaml

I'm interpreting this as we don't actually need support for a list of files under the current proposal of a single knowledge.yaml, so I can adjust this to be from_file() that loads a single yaml, since that's sufficient for the immediate requirements. We could always add an additional from_files() later if it comes up.

Let me know if I got that wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or from_flow() .... or from_yaml() ... i typoed it, but the point being singular is good enough for the moment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xukai92 please take a look at the latest version and see if you're good with the changes to this section.

docs/sdg/sdg-flow-yaml.md Show resolved Hide resolved
docs/sdg/sdg-flow-yaml.md Outdated Show resolved Hide resolved
docs/sdg/sdg-flow-yaml.md Show resolved Hide resolved
docs/sdg/sdg-flow-yaml.md Show resolved Hide resolved
docs/sdg/sdg-flow-yaml.md Outdated Show resolved Hide resolved

Use the built-in `full` pipeline.

> `ilab data generate --pipeline full --pipeline-extend path/to/config.yaml`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again to repeat my previous comment, I don't think it makes sense to have a way to chain workflows like this and also this is very much out of the scope of this design PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some good discussion on that here

Somehow I didn't update the doc with the latest proposal there - chaining would be done by specifying --pipeline multiple times

But, personally, I'd be happy to omit that feature for now until there's a strong use case for it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all is intended to support the downstream to desire to ship custom pipeline content that would run after “full”. If we don’t expose extending in some way, we would ship a copy of “full” plus more from somewhere else and have to try to keep them in sync when fixes or improvements come up in one place or the other.

Both work. The chaining thing lets us define stuff once. Indeed it adds some complexity to the CLI interface, though.

Either way we have to be clear on how we intend to deal with that use case. Happy to hear other alternatives!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all is intended to support the downstream to desire to ship custom pipeline content that would run after “full”.

Wow, I did not realize that - I thought the downstream pipeline was standalone, no dependence on an upstream pipeline. My bad. I'll capture that in the doc. (Unless Kai corrects us on this point, of course)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make it that way. But the input to the custom parts is the output of “full” … so it’s this or we copy stuff or something else 🤷‍♂️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like an ack from @xukai92 on:

We believe the most common case for custom pipelines is for them to extend the full pipeline

I asked @shivchander about this on Slack and got a confirmation. However, it's slightly more complicated than I thought!

In one case, we want to run custom configuration after the full pipeline (skills). In another, we want to run custom configuration before the full pipeline (knowledge). I missed this important ordering detail on the knowledge side.

We've got to factor this into the design somehow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to reply to some comments from @markmc

I would like an ack from @xukai92 on:
We believe the most common case for custom pipelines is for them to extend the full pipeline

I would say "reusing blocks" is the more expected common use cases, not extending the (full) pipeline.
Ideally the flexibility we want is to reuse the blocks as nodes in a DAG.
It happens to that the extension from 1.5 to 2.0 is like appending a list of blocks to another list of blocks (which looks like we are chaining two pipelines. preview: putting existing + some new blocks in a DAG with some loops is what we will work on for something like 3.0).
But I don't think we need to nail this generic design down here as it's not a GA block.

We could make it that way. But the input to the custom parts is the output of “full” … so it’s this or we copy stuff or something else 🤷‍♂️
I don't think "copy stuff" is acceptable (I know you don't either) - every change to the full pipeline would need to be replicated to any custom pipeline

Please see mine and Russel's comment above.

And my feeling to copying stuff here is pretty mixed as the reusability/customization at the pipeline level is sort of limited.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In one case, we want to run custom configuration after the full pipeline (skills). In another, we want to run custom configuration before the full pipeline (knowledge). I missed this important ordering detail on the knowledge side.

Awesome to get this clarified 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't think we need to nail this generic design down here as it's not a GA block.

Great. In this kind of scenario, I'm looking out for ways our decisions now might prevent us adding support for more elaborate chaining like that in future. And it feels like some that can be added in a future release when we have a better understanding exactly what we need.

And my feeling to copying stuff here is pretty mixed as the reusability/customization at the pipeline level is sort of limited.

"Copying stuff" is always an option of course. But if "chain two pipelines" together works for our first release, then supporting that will make the whole thing will be much more maintainable if we don't copy entire chains of blocks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markmc had some good ideas on how we could expand ImportBlock to load specific blocks from another config instead of only allowing importing an entire config. It sounds like we have a pretty good path to add that in the near future if it makes sense.


> `ilab data generate`

Use the default pipeline, `simple`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this default should be achieved as ilab data generate --pipeline PATH_TO_THE_SIMPLE_WORKFLOW_YAML for the purpose of unification.
we can provide alias like simple or full to the built-in YAML files to make users' like easy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, simple and full should be thought of as aliases

But it is a bit more complex, and it had occurred to me we'll need to get into this as part of the CLI integration

Currently, the aliases look more like this:

full:
    knowledge:
      - MMLU_BENCH_FLOW
      - SYNTH_KNOWLEDGE_FLOW
    freeform_skills:
      - SYNTH_FREEFORM_SKILLS_FLOW
    grounded_skills:
      - SYNTH_GROUNDED_SKILLS_FLOW
simple:
    knowledge:
      - SIMPLE_KNOWLEDGE_FLOW
    freeform_skills:
      - SIMPLE_FREEFORM_SKILLS_FLOW
    grounded_skills:
      - SIMPLE_GROUNDED_SKILLS_FLOW

So we need to be able to express a (knowledge, freeform_skills, grounded_skills) tuple really ...

Copy link
Contributor Author

@markmc markmc Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think this issue warrants a significant change to the design

I order to specify a pipeline, you need to specify a chain of blocks for knowledge, freeform_skills, and grounded_skills

I think we can see from the POC that keeping the chains for knowledge, freeform_skills, and grounded_skills in separate files aids clarity

So I think we need something like this

src/
  instructlab/
    sdg/
      pipelines/
        simple/
          knowledge.yaml
          freeform_skills.yaml
          grounded_skills.yaml
        full/
          knowledge.yaml   # also contains the current contents of mmlu_bench.yaml
          freeform_skills.yaml
          grounded_skills.yaml

and a custom pipeline would be a directory with these 3 YAML files

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be confusing what a pipeline is ... I would rather say there are actually 3 pipelines under simple/full and we route them differently in our particular frontend. this semantics should not be enforced

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, "full" or "simple" is a set of pipelines. I agree with that.

But likewise, when using custom pipelines, you should be providing a set of custom pipelines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on Slack, we talked about add the idea of a PipelineRouter which I guess would have an interface like this:

class PipelineRouter:
    def __init__(self, ...);
        ...

    def generate(self, dataset) -> Dataset:
       ...

(hmm, that's the same interface as a Pipeline which suggests a base class that Pipeline and Router inherit from)

and there would be two variants:

class DirectRouter(Router):
    def __init__(self, pipeline_yaml):
        self.pipeline = load_pipeline(pipeline_yaml)

    def generate(dataset):
        return self.pipeline.generate(dataset)

and the use case for this would be if you are developing a pipeline against a known data set?

But the more common use case is covered by this variant that understands how our taxonomy repo is structured:

class TaxonomyRouter(Router):
    def __init__(self, freeform_skills_pipeline_yaml, grounded_skills_pipeline_yaml, knowledge_pipeline_yaml):
       self.freeform_skills_pipeline = load_pipeline(freeform_skills_pipeline_yaml)
       ....

   def generate(samples):
       if samples[0].get("document"):
         pipeline = self.knowledge_pipeline
      elif samples[0].get("seed_context"):
         pipeline = self.grounded_skills_pipeline
      else:
          pipeline = self.freeform_skills_pipeline
      return pipeline.generate(samples)

"full" and "simple" would be an alias for a set of yaml files

(the above is just a sketch, it would evolve quite a bit as you hack on it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the point of all of this is "full" isn't a pipeline, but instead its some logic that understands the taxonomy repo and looks at the data set before picking from one of 3 pipelines

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think this is making sense to me.

From the ilab CLI perspective, it only cares about the taxonomy variant. It can still just talk about a "pipeline" as a user concept, but the target is either:

  • an alias that sets up a TaxonomyRouter with 3 built-in pipelines
  • a path to a directory that has 3 pipeline yamls in it, and we initialize the TaxonomyRouter with that

altered syntax is over here: https://github.com/instructlab/dev-docs/pull/109/files#r1674565115

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the class definitions and methods themselves make sense to me as well.
Abstraction-wise, one thing I've been wondering though is whether routing is a thing that Pipeline should be responsible for, or it's a higher level concept should route pipelines (in our case, it could be the SDG).
Also as per Mark's word: routing sounds like "an object that has a set of pipelines, and knows which pipeline to use for different input datasets", which also doesn't sound like another pipeline but something conveys the details about the actual synthetic data generation process (a bunch of pipelines applied to a set of files from taxonomy, in our case).
PS: open to deal with this later as the design above already unblocks us I believe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of all of this, I think the really important details are:

  • ilab data generate --pipeline can take a path to a directory. I updated the doc text to reflect that part already.

We expect the contents of that directory to have 3 specific files: knowledge.yaml, grounded_skills.yaml, and freeform_skills.yaml. I will add that in a moment.

There are also some really good points about how we can implement this in terms of abstractions. I am comfortable leaving that part out of the doc and see how it shakes out in the implementation. I have a feeling the final details will get sorted as we see what it looks like in implementation, but knowing that it's inspired by our discussion here. Let me know if anyone feels diffrently.

docs/sdg/sdg-flow-yaml.md Show resolved Hide resolved
docs/sdg/sdg-flow-yaml.md Show resolved Hide resolved
russellb added a commit to russellb/instructlab-dev-docs that referenced this pull request Jul 10, 2024
The design proposal in instructlab#109, and the corresponding implementation in
instructlab/sdg#86, raised the importance of
clearly defining how a custom pipeline that requires a model with
custom adapters would be configured. This document explores that
topic.

It's possible this should just become a subsection of instructlab#109.

Signed-off-by: Russell Bryant <[email protected]>
docs/sdg/sdg-flow-yaml.md Outdated Show resolved Hide resolved
russellb added a commit to russellb/instructlab-dev-docs that referenced this pull request Jul 10, 2024
The design proposal in instructlab#109, and the corresponding implementation in
instructlab/sdg#86, raised the importance of
clearly defining how a custom pipeline that requires a model with
custom adapters would be configured. This document explores that
topic.

It's possible this should just become a subsection of instructlab#109.

Signed-off-by: Russell Bryant <[email protected]>
docs/sdg/sdg-flow-yaml.md Outdated Show resolved Hide resolved
@markmc
Copy link
Contributor Author

markmc commented Jul 12, 2024

I've made one small change since @xukai92's approval - changing from "flow params" to "pipeline context". I'm pretty sure @xukai92 acked that idea already.

And given the existing approvals from @leseb, @hickeyma, and the collaboration between @russellb and me on this ... I think we're in good shape to merge.

I don't think it's useful to include all the iterations of this design in git log:

d367f7b (HEAD -> sdg-flow-yaml, markmc/sdg-flow-yaml) Change FlowParams to PipelineContext
d524534 Describe file and directory structure
6c1b7f2 Simplify CLI integration proposal
4871552 Propose new block type: ImportBlock
282d764 Rename from_flow() to from_file()
0ea86cb Change Pipeline.from_flows() to .from_flow()
1bf99b5 Pacify linter and spellchecker
7f24fdc Update the CLI UX for pipeline chaining
6457751 Update the model serving section with the latest thinking
4425f8d Explain why YAML was not initially included in the library API design
f7663a5 Update to account for actual Pipeline.from_flows() API
f0ebc8e Make spellcheck happy
c8a3626 Address the use case distinguishing between multiple models
e6e5ee1 Add --list-pipelines suggestion from Seb
0011e66 Replace num_iters with num_instructions_to_generate
0db6e40 Remove extra whitespace
6f9ddad Remove some extra verbosity from format
dc58502 Tweak title
45deb44 Propose adding a file format for custom Flows

Referring back to this PR is probably much more useful if you ever want to understand how the proposal evolved.

So I'm going to squash these commits and go ahead and merge.

Related to instructlab/instructlab#1546

Co-authored-by: Aakanksha Duggal <[email protected]>
Co-authored-by: Kai Xu <[email protected]>
Co-authored-by: Martin Hickey <[email protected]>
Co-authored-by: Oindrilla Chatterjee <[email protected]>
Co-authored-by: Russell Bryant <[email protected]>
Co-authored-by: Shiv <[email protected]>
Co-authored-by: Sébastien Han <[email protected]>
Signed-off-by: Mark McLoughlin <[email protected]>
@markmc markmc changed the title Propose adding a file format for custom Flows Propose adding a file format for custom pipelines Jul 12, 2024
@markmc markmc merged commit 44e7482 into instructlab:main Jul 12, 2024
4 checks passed
markmc added a commit to markmc/sdg that referenced this pull request Jul 12, 2024
See instructlab/dev-docs#109

In order to support custom pipelines, add a YAML based file format.

However, to make the default pipelines easier to reason about and develop,
also convert them to the YAML file format.

This changes the top-level API from:

```
mmlu_block_configs = MMLUBenchFlow().get_flow()
knowledge_block_configs = SynthKnowledgeFlow().get_flow()
knowledge_pipe = Pipeline(ctx, mmlu_flow + knowledge_flow)
```

to:

```
knowledge_pipe = Pipeline.from_flows(
    ctx, [pipeline.MMLU_BENCH_FLOW, pipeline.SYNTH_KNOWLEDGE_FLOW]
)
```

Signed-off-by: Mark McLoughlin <[email protected]>
markmc added a commit to markmc/sdg that referenced this pull request Jul 12, 2024
See instructlab/dev-docs#109

In order to support custom pipelines, add a YAML based file format.

However, to make the default pipelines easier to reason about and develop,
also convert them to the YAML file format.

This changes the top-level API from:

```
mmlu_block_configs = MMLUBenchFlow().get_flow()
knowledge_block_configs = SynthKnowledgeFlow().get_flow()
knowledge_pipe = Pipeline(ctx, mmlu_flow + knowledge_flow)
```

to:

```
knowledge_pipe = Pipeline.from_flows(
    ctx, [pipeline.MMLU_BENCH_FLOW, pipeline.SYNTH_KNOWLEDGE_FLOW]
)
```

Co-authored-by: Aakanksha Duggal <[email protected]>
Co-authored-by: Kai Xu <[email protected]>
Co-authored-by: Russell Bryant <[email protected]>
Signed-off-by: Mark McLoughlin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants