Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle regex-to-FSM conversion in Rust #10

Closed
6 tasks
Tracked by #9
brandonwillard opened this issue Aug 21, 2024 · 4 comments · Fixed by #125
Closed
6 tasks
Tracked by #9

Handle regex-to-FSM conversion in Rust #10

brandonwillard opened this issue Aug 21, 2024 · 4 comments · Fixed by #125
Assignees
Labels
TGI Related to the integration with `text-generation-inference`

Comments

@brandonwillard
Copy link
Member

brandonwillard commented Aug 21, 2024

We need to perform the conversion of a regex string into a usable FSM in Rust. This involves all the functionality covered by interegular and the functions in outlines_core.fsm.regex.

Regarding the interegular functionality, let's start by finding a suitable Rust crate replacement. regex-automata appears to be the most promising so far.

To determine suitability, we need to

  • make sure that the regex features used by outlines are supported, and
  • make sure that the FSM objects provide the required information and walk functionality.
    • Are the state labels deterministic (i.e. the same for a given regex)?
      If not, we'll need to port make_deterministic_fsm.
    • Determine more requirements.

If we can't find a suitable existing crate, then we can write our own implementations/extensions, but ideally with a limited scope (e.g. use an existing crate for FSMs, but handle parts of the regex-to-FSM conversion ourselves).

As mentioned above, we also need to port the functions in outlines_core.fsm.regex. Many of those functions revolve around the conversion of an existing string-based FSM into one that transitions on unicode bytes. Depending on the crate we use, some of those functions may be unnecessary (e.g. see the comment about deterministic labels above).

Also, there's at least one interface-related question:

  • What can we do about outlines.generate.fsm?
    This module offers dispatch functions that take interegular.fsm.FSM objects. Can they be changed to take regex strings instead?
    @rlouf @lapp0
@brandonwillard brandonwillard added the TGI Related to the integration with `text-generation-inference` label Aug 21, 2024
@brandonwillard brandonwillard changed the title Handle regex-to-DFA conversion in Rust. Handle regex-to-FSM conversion in Rust. Aug 21, 2024
@drbh
Copy link
Collaborator

drbh commented Aug 21, 2024

regex-automata looks like a solid choice assuming it meets the criteria.

Prior to outlines-core I started hacking together a port of interegular that I've started to add here #8, additionally I have a portion of the regex to fsm parsing written that I can refactor and add to that branch as well. These changes would enable parsing to regex and returning the fsm from .to_fsm.

Not sure if this is the best approach (especially long term) however it may be a feasible path to a working mvp, maybe we can assess regex-automata in parallel?

@brandonwillard
Copy link
Member Author

brandonwillard commented Aug 21, 2024

Not sure if this is the best approach (especially long term) however it may be a feasible path to a working mvp, maybe we can assess regex-automata in parallel?

Our main concern is the expansion in scope and costs that come with ports of dependencies. A Rust port of interegular is an entirely distinct project; one that almost immediately justifies its own repo, crate, contributors, etc. We're willing to go down that route if necessary, but we should first be clear on exactly why the existing crates won't work for us and why we can't/shouldn't attempt to extend those.

In other words, since the overhead for porting a dependency like this is relatively high, we're safer prioritizing attempts at using existing crates so that we can justify a port sooner.

We could incrementally move forward—mostly in parallel—with the conversion of outlines_core.fsm.regex under some minimal assumptions on (and possibly changes to) the interegular FSM interface we're currently using. This would allow us to continue using interegular, evaluate external crates, and make progress in a way that works for both a port of interegular and adoption of an external crate, since we'll probably need to port some of those outlines_core.fsm.regex functions in both cases.

For example, we could almost certainly convert regex-automata's DFAs into an intermediate type that the ported outlines_core.fsm.regex functions use. The work you're doing in #8 might already serve to define and implement such a type, and provide a starting point for conversion of the outlines_core.fsm.regex functions.

N.B. BetterFSM was supposed to be the intermediate type mentioned above, so—no matter what—we should remove/replace that during the conversion to Rust.

@brandonwillard
Copy link
Member Author

Looking into regex-automata, it wasn't immediately clear to me how one could obtain transitions maps like we use in outlines. This discussion implies that they can be obtained by walking all the states and transitions.

@brandonwillard
Copy link
Member Author

brandonwillard commented Sep 10, 2024

Just updating...

regex-automata builds DFAs that are substantially different than the ones we used in outlines, and, while converting the DFAs is possible, it's also rather involved. The conversion(s) would also require additional steps and logic not present in outlines, as well as handling/guardrails for special features supported by regex-automata and not outlines. It's hard to tell upfront whether or not all this extra compatibility work will degrade performance, too.

That said, we're going forward with @drbh's port in #8.

@brandonwillard brandonwillard changed the title Handle regex-to-FSM conversion in Rust. Handle regex-to-FSM conversion in Rust Oct 9, 2024
@torymur torymur mentioned this issue Dec 20, 2024
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
TGI Related to the integration with `text-generation-inference`
Projects
None yet
3 participants