Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Considering a roadmap for 1.0, or alternatively considering deprecating in favor of the "globset" crate #59

Open
bstrie opened this issue Apr 24, 2017 · 21 comments

Comments

@bstrie
Copy link

bstrie commented Apr 24, 2017

Per the original rust-lang-nursery RFC ( https://github.com/rust-lang/rfcs/blob/master/text/1242-rust-lang-crates.md ):

If the library team accepts a crate into the nursery, they are indicating an interest in ultimately advertising the crate as "a core part of Rust", and in maintaining the crate permanently. During the 0.X series in the nursery, the original crate author maintains control of the crate, approving PRs and so on, but the library subteam and broader community is expected to participate.
Eventually, a nursery crate will either fail (and move to rust-lang-deprecated) or reach a point where a 1.0 release would be appropriate.
If, on the other hand, a library reaches the 1.0 point, it is ready to be promoted into rust-lang proper. To do so, an RFC must be written outlining the motivation for the crate, the reasons that community ownership are important, and delving into the API design and its rationale design.

I'm not sure who the "original crate author" of glob is, as I'm guessing this crate was split out from the stdlib during the great pre-1.0 purge, but it doesn't appear that there's any real champion trying to advance this crate through the stabilization process.

I'm interested in starting a discussion as to the future of this crate. Sadly I can't find the original petition arguing why this crate belongs in the nursery in the first place, so I'm not even sure how to begin evaluating the aspirations of this crate that keep it from being 1.0 today.

Let's focus on answering three concrete questions:

  1. Does glob matching deserve to be "a core part of Rust", per RFC 1242? If not, then this repository should be moved to rust-lang-deprecated. But if so, then:
  2. Is this implementation of glob matching superior to other existing solutions that could be promoted to "official" crates (in particular, globset ( https://crates.io/crates/globset ) appears to be promising). If not, we should deprecate glob and consider promoting the alternative solution to rust-lang-nursery. But if so, then:
  3. Does anyone care enough about this crate to push it over the finish line? If so, then they should produce a prospective 1.0 RFC. But if not, then I think it sets a bad precedent to allow a crate to languish in the nursery forever, and this crate should be deprecated until such time as it finds a champion.

I don't consider myself an expert in either crate, but the biggest potential pitfall that I see when comparing glob to globset is that the latter relies on the complete regex crate, and transitively all of regex's dependencies. In contrast, glob is small self-contained, but appears to be less featureful than globset, and I'm unsure if it handles both Unicode and non-Unicode as well as the regex crate. Some microbenchmarks would be interesting to see as well.

Tagging @BurntSushi, as the author of globset and who's familiar with the rust-lang-nursery process. And tagging @alexcrichton, who I'm guessing doesn't particularly care about this crate but who's the only active point of contact listed for this crate on crates.io. :P

@BurntSushi
Copy link
Member

OK, I'm just going throw a bunch of stuff against the wall and see what sticks.

The primary motivation behind globset is in the name itself: I need a way to match many (possibly hundreds) of globs against a single file path very quickly. The naive solution of iterating over each glob and checking if it matches simply doesn't fly (the time complexity itself is not good enough). For that reason, and because RegexSet is a thing, it made a lot of sense to do the "convert the glob to a regex" dance. However, it doesn't stop there. It turns out that the regex engine doesn't optimize common globs as well as it could, so I ended up developing a set of matching strategies that make glob matching even faster. I dare say that most globs don't touch the regex engine at all. But when they do, they're still reasonably fast.

For that reason alone, you might consider globset to be relatively niche. It's important and the performance matters, but perhaps it's not common to need that kind of performance at that kind of scale.

With that said, it turns out that if you're implementing glob matching for a set of globs, it's also pretty easy to implement straight-forward single-glob matching as well. So globset provides that. I also made sure to use most (all?) of the test suite from the glob crate inside of globset (and then some), so the functionality should be roughly equivalent.

Tangentially, the glob crate can't match file paths that don't fit into a &str, and this turns out to be a real problem that people face in practice (there are some issues on the ripgrep repo for it, IIRC). I'm not familiar enough with the implementation in glob to know how easy or hard it is to fix that, but from a brief glance, it looks like the matcher pretty much assumes valid UTF-8 in that it's based on char. So it would probably require some sizeable attention to fix that.

I personally don't have the bandwidth to fix the glob crate any time soon. The globset crate is something that must exist for ripgrep to be as fast as it is, so regardless of what we do with glob, I think it's somewhat safe to say that globset will continue to work. However, here are some caveats:

  1. Since globset uses regex, as @bstrie points out, using it requires pulling in regex. regex has (perhaps deservedly) gotten a reputation for being a big dep because it takes a long time to compile.
  2. Since it uses regex, there is some overhead associated with compiling the glob since it requires compiling a regex. It's probably more expensive to compile a regex than it is to compile a glob in the glob crate, although I haven't benchmarked it.
  3. In my incomplete experiments, the matching performance of globset for a single glob is roughly on par with the glob crate, and tends to get better as the paths get longer.
  4. globset doesn't (yet) have an analogous glob function that returns an iterator over paths that match.
  5. globset doesn't support the require_literal_leading_dot option, mostly because it seemed a little tricky to implement, but I honestly haven't given it much thought.

4 and 5 seem like things that can be fixed. 1 and 2 are things that are probably just facts of life for the globset crate. 3 doesn't seem like a blocker.

So I think there's a key trade off here... The globset crate brings significantly more implementation complexity and larger compile times, but also fixes some of shortcomings of the glob crate as it is.

@bstrie
Copy link
Author

bstrie commented Apr 25, 2017

@BurntSushi, Thanks for the feedback!

I personally don't have the bandwidth to fix the glob crate any time soon.

Indeed, I absolutely did not want to give the impression that I wanted to saddle you with any more work. :P Ideally I'd like our crate-elevation process to distribute the work among interested contributors by promising long-term relevance and the chance to contribute to a strategic dependency of the Rust ecosystem. It's just that to bootstrap that process we need somebody who gives a care, and I don't know that we have any such person for glob (and no document arguing why this might be strategically important to begin with, to inspire someone to step up).

@alexcrichton
Copy link
Member

I would personally be on board transitioning community mindshare to globset. The size of the dependency is a downside, but it's not unique to globset (applies to many other crates as well). I think we've got various mitigation strategies here as well such as global caching, incremental compilation, etc.

If we'd like to retain the name glob it seems like we could "merge repos and bump major version" or somehow just transfer ownership of the name as well. I also don't mind just "deprecating" glob, in favor of globset.

Overall I don't have too many opinions, unfortunately. There's been very little activity on this repo over time, although I don't know if that's an indicator of "it's good enough" or "it's not used".

@Keats
Copy link

Keats commented Apr 28, 2017

I was just looking for how to do *.{foo, bar} with glob so I would personally welcome using globset.
I don't mind the compilation time/extra cost if I can get a more powerful globbing mechanism. If it is an issue, the previous versions are still on crates.io.

I think the glob name is better than globset.

@Keats
Copy link

Keats commented Aug 23, 2017

Any update on that? Should a post be created on the internals/users forum?

@BurntSushi
Copy link
Member

@Keats I think everything I said above is still true. Notably, the globset crate is still missing a couple of pieces of functionality that exist in the glob crate.

@KodrAus
Copy link
Contributor

KodrAus commented Apr 9, 2018

There's been some activity in the last few months, BurntSushi/ripgrep#765 has lead to the emergence of the globwalk crate to provide a glob function over the globset API.

Perhaps once that's had time to bake we could look at deprecating glob in favour of globset + globwalk?

cc @Gilnaa

@rivy
Copy link

rivy commented Sep 1, 2018

It looks like globwalk, just after v0.2.0, moved away from globset to using ignore in mid-April 2018 (see Gilnaa/globwalk@551ca14). That, unfortunately, makes the discussion even more confusing.

Unfortunately, by globwalk moving to using ignore, windows users are generally being left out. The new combination doesn't support case-INsensitivity (which is the windows default).

It'd be nice to have a full featured, portable, well-tested library for globbing. I'm currently using wild, which depends on the older, semi-deprecated glob, to add globbing for windows CLI apps (see uutils/coreutils#1281). Unfortunately, there are some rough edges. And, informed by this discussion, I can't find a clear path forward to contribute to fixing the issues.

So anyway, my current vote would be to move forward in the globset / globwalk direction.

If someone starts pushing forward in a single direction, I'm happy to help with Windows testing (my coding skills in the rust environment are too naive for anything else right now).

@BurntSushi
Copy link
Member

@rivy The ignore crate supports case insensitive globbing: https://docs.rs/ignore/0.4.3/ignore/overrides/struct.OverrideBuilder.html#method.case_insensitive

@Gilnaa
Copy link

Gilnaa commented Sep 2, 2018

I'll make sure this option is available through the GlobWalk API. Is it reasonable for this to be on by default on windows, or is it too surprising.

@BurntSushi
Copy link
Member

@Gilnaa I'm not sure about the default on Windows. I think you'll probably want to look at what other glob walkers do. Worst case, start conservative (case sensitive with opt-in case insensitive), and move to case insensitive by default if you get a lot of feedback asking for it.

@rivy
Copy link

rivy commented Sep 21, 2018

@Gilnaa , from what I've seen / used, most libraries default to case sensitive matching.

However, when compiling applications for Windows targets, to minimize surprise for Windows users, those applications should default to utilizing globwalk matching in a case-INsensitive manner. I do think it's fine to leave the onus of setting that up onto the application using globwalk.

So, IMO, as long as there's an easy way to include/use the library with case-sensitivity set OFF, and an easy way to toggle case-sensitivity, I'd leave the basic/normal default to case-sensitivity being ON.

Thanks for working on it!

@Gilnaa
Copy link

Gilnaa commented Sep 21, 2018

@rivy Sorry for keeping quiet, I've added support a few weeks ago
Gilnaa/globwalk@53fafd9

@Keats
Copy link

Keats commented Sep 21, 2018

However, when compiling applications for Windows targets, to minimize surprise for Windows users, those applications should default to utilizing globwalk matching in a case-INsensitive manner. I do think it's fine to leave the onus of setting that up onto the application using globwalk.

That would mean every person using globwalk needs to be aware of that, which isn't going to be the case. This default should be in globwalk directly and applications should be able to override it if necessary imo.

@KodrAus
Copy link
Contributor

KodrAus commented Sep 21, 2018

As one example, I believe .NET's Directory.EnumerateFiles API will use a platform-specific default. So it's case-insensitive on Windows and case-sensitive on Linux.

In my opinion, the current always case insensitive by default everywhere is totally fine. Different usecases can call for different case sensitivity irrespective of the platform, so I usually find myself explicitly needing one way or the other anyway.

@BurntSushi
Copy link
Member

BurntSushi commented Sep 5, 2019

OK, so my plan at the moment---assuming all goes according to plan---is to rewrite globset. In particular, it will continue to support the use case of matching a set of globs, but I plan to make the dependency on regex optional. As in, a default "slow" matcher will be provided, but one can enable a perf feature (or similar) to bring regex in for faster glob matching. I also plan to:

  • Include an efficient directory traversal function (which is something globset currently lacks). I think this will also be a crate feature, but one that is enabled by default, as it will likely pull in a dependency on walkdir. My intent is that this will be the only dependency incurred by default.
  • Review and (hopefully) fix all outstanding issues with both glob and globset.
  • Ensure that a sufficient number of features are available. i.e., At least all of the features from glob and globset today, as well as probably all of the features made available by libc's glob API (and also look at fnmatch).
  • Give globset its own repository separate from ripgrep as a proper stand-alone repository, along with a complete API refresh.
  • Generally support the glob syntax provided by globset, and lift some restrictions. i.e., Support nested brace syntax.

One possible alternative to the above approach that I'd be willing to do is to take over glob itself, do what I said above, but for glob, not globset, and then deprecate globset. This way, we'd have one globbing library that everyone can centralize around. However, it would mean a total and complete break from the current glob crate as it exists today. My sense, though, is that folks will be fine with this because the crate is pretty much limping along right now. (And that's only thanks to @KodrAus's generous maintenance effort. <3)

cc'ing some of the library team for their thoughts @alexcrichton @KodrAus @dtolnay @SimonSapin @sfackler

@alexcrichton
Copy link
Member

That all sounds fantastic to me, and I think it's fine to move glob to 1.0 with a breaking change

@BurntSushi
Copy link
Member

I'd probably do a 0.4 release first if that's okay, in order to make sure whatever API I come up with is tested a bit before committing.

@KodrAus
Copy link
Contributor

KodrAus commented Sep 6, 2019

One possible alternative to the above approach that I'd be willing to do is to take over glob itself, do what I said above, but for glob, not globset, and then deprecate globset.

Personally, I like the idea of making the most of the glob name that we've already got here so an a fan of this path 👍 If we went this way would we want to work in this repository? If so maybe we could create a 0.3.x branch for the current source and revamp master? I'd be keen to get on board with stuff like supporting docs for migrating from the current glob to wherever we land if you think it'd be helpful!

@BurntSushi
Copy link
Member

I think that sounds reasonableish. My plan is to just do what I said above, and then dump it on to master in this repo and put out a release shortly thereafter. At that point, we can create a branch for the old 0.3.x release, although I'm not sure how much utility that serves. I'm not sure it's worth the maintenance time and effort to continue supporting 0.3.x.

I'd be keen to get on board with stuff like supporting docs for migrating from the current glob to wherever we land if you think it'd be helpful!

Sounds good, but hopefully we shouldn't need too much here. It's going to be a complete break from the current API, so folks will probably just want to treat it as if it's a new crate and re-familiarize themselves with the new API. We can of course provide some common examples of how to translate old code. Although I mostly expect the top-level glob function to remain!

@pablosichert
Copy link

Hey @BurntSushi, thanks a lot for the globset/walkdir crates and laying out a plan to move forward with the glob crate.

I wonder if you had time work on the rewrite of globset and could give an update on the progress? If you are already at a point where more incremental changes can be made, I would be happy to pick up outstanding tasks.

For Vector, we would like to add a glob-based file walker and contribute to upstream projects where suitable/necessary. Since the situation seems to be spread out between the glob, globset and globwalk crates, it would be nice to know where it would be best to direct our efforts.

Judging from BurntSushi/ripgrep#765 (comment), Gilnaa/globwalk#28 (comment) and this very issue it seems like converging all functionality into globset and replacing glob with that would still be the optimal course of action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants