-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: show df.count(), len(df.schema())
when displaying interactive = True
tables
#10231
Comments
For an arbitrary query, computing the number of rows isn't cheap. For some data sources, like text files, computing the number of rows can be extremely expensive. Polars can only you show you this information if you're using its Adding the number of columns to the repr seems reasonable. |
In interactive mode though I’m not really concerned about performance, right? |
Why wouldn't performance be relevant in interactive mode? |
I figure because in interactive mode I'm using exploring the data and I'm printing the dataframe often. After exploring and building the sequence of expressions, I move the code to a package into a function or method in a class and only then I really care about performance. In interactive mode, I'm usually cleaning data or validating data and seeing the number of rows can speed up the process for me. For example, when working with timeseries data, seeing that there are It also gives me a sense for how long a computation might take or how long it might take to write the data to disk. I'm not sure if this directly applies here, but when I've used I also am assuming that people working with CSV files don't have too many rows that calculating rows becomes an issue. You definitely have a much better sense of how the performance varies across different backends of ibis, so I'd obviously defer to your judgement. That said, an alternative to printing the shape would be printing the first Or alternatively, maybe just the printing to show the number of rows in backends that do support getting the number of rows in a fast manner? |
I agree with @kdheepak . Maybe this could become another mode |
I am +1 on showing the row count by default in interactive mode. If your query is slow, then it's easy, just add a .head() call after it, and you are back to the current performance. I think we could make this discoverable as well: If a computing the row count takes <5 seconds, we just show the row count in the repr. If it takes more, then we still show the row count, but emit a warning eg "computing the row count was slow. You may be able to avoid this by only showing the first few rows using table.head()". I suggest a warning, and not just an addendum in the repr, because consider: I make a preview call, and it is taking forever, so I just cancel the whole operation, and therefore I never actually see the final suggestion. The suggestion should get shown at ~5 seconds, regardless of if the query completes. Alternatively, this could be what the .preview() method is for. If you don't want the row count, you call that. I do have sympathy for new users though. I would know how to mitigate a slow preview, but I'm not sure new users would, and we need to remove the sharp edges for them.
I don't think we should assume this, this is not true in the data I am working with.
-1 on this too, I would rather just have the first N rows and see the row count. OK, now I'm going much broader than just this PR, but maybe useful to think about as context: I think it's good to note that interactive mode already can be slow. I think in particular this comes up if there is some streaming-blocking-operation in the pipeline, like a sort, groupby, or window function. In these cases, previewing is gonna be slow either way, regardless of if we execute the row count or not. So this suggested change is only affecting the performance in certain circumstances. Both the issue I say above, and the suggested change here, really present the same story: I am a user in interactive mode, and my previews are slow. What can ibis do to make this experience better? Perhaps if an interactive query takes more than ~10 seconds we emit a warning or print or log or add a note in the repr which links to a troubleshooting page in the docs? In this doc we could suggest the solutions of .cache() or .head(), and point out the pros/cons of each. |
Perhaps it wasn't clear from what I said before, but computing |
Counting the duration of the query would almost certainly require spinning up a thread to compute the duration of the count query, which also adds more complexity. Perhaps the question is: why is the additional complexity, maintenance burden, discussion and consideration of edge cases required to show the row count worth taking on? |
The first and last rows of a table in most of our backends will be completely arbitrary, because a table in a SQL database has no built in notion of row order, so this isn't feasible. |
What do you think about having a user opt into printing row counts? e.g. ibis.options.show_row_count_in_interactive_mode = True # todo: choose better name Even a naive non-performant solution would be sufficient in this case. |
I would love to get an opt-in way to automatically show row counts - I get that it shouldn't be the default but I'm using Ibis quite a bit with smallish tables and not being able to quickly glance at lengths makes it much harder to work with than other dataframe libraries. Adding this feature would be drastically improve the package as an alternative to pandas or dplyr in R for things like interactive debugging. |
Not entirely opposed to an option for this but it requires someone else to do the work, and I'll review the PR/help get it over the finish line. A trade-off to consider here is that this feature is likely to generate some new bug reports about performance, so there'll be extra work to do fielding those reports. |
I have an idea on how this API could work, to unify it with .preview()
@cpcloud any thoughts on this API? Might be confusing, I can make a PR to make it concrete |
@kdheepak @contang0 @lboller-pwbm you all chimed in so far, we would love your feedback on #10518 |
Looks very good, I have nothing to add apart from what @lboller-pwbm already mentioned. The row count should ideally be visible without needing to scroll horizontally if a table is wide, i.e. it should be left-justified. |
By the way, sometimes I wish the backend name was also somehow indicated when inspecting a table. Sometimes I work with multiple backends in the same project and I have no idea whether I can join two tables before I try. I know it's probably a very niche problem, but something to keep in mind if more people have the same issue. |
Do y'all have strong opinions on what the default behavior should be (I imagine since you are in this issue you are going to be a non-representative sample of the ibis user community)? Can you comment at all on what sort of workload you are typically doing, one of the below, or a different one?
|
@cpcloud I think that PR is ready for your thoughts whenever you get the chance, thank you! |
One of the major reasons I like using ibis is that API is cleaner than pandas / polars etc. I can usually write a lot less code and get the same things done. Most of the data I work with on a day to day basis is in the "few thousand rows" magnitude and I use the duckdb backend. I expect I'll have to scale in the future on at least one of my projects, and I'm hoping ibis will make that trivial by changing the backend. |
I am more in the fifth bucket, currently - but I'm more hesitant to make the default something that would slow down interactive mode a lot at scale. I think an option like the one suggested in #10231 (comment) would be good and/or letting the user set an environment variable to toggle between a fast or detailed version of interactive mode as the default depending on the project.
|
@lboller-pwbm I've wanted this too. I started #10523 for any discussion of that. |
The way I access my data is usually similar to 1. I tend to start with a connection to Impala, and persist data to memory (duckdb) as soon as it's reasonably small. This is why seeing backend name would also be useful for me. Sometimes I get lost at which step I persisted the data to a memtable.
|
Is your feature request related to a problem?
When interactively exploring data, it's always an additional step to get the "shape" of the data. This is what the display looks like right now:
What is the motivation behind your request?
I find myself typing the following often:
Describe the solution you'd like
I'd like for the default display to print the shape of the dataframe (along with the name of the table too would be nice). Polars prints the shape of the data as the first line and it is very convenient:
What version of ibis are you running?
version 9.5.0
What backend(s) are you using, if any?
DuckDB
Code of Conduct
The text was updated successfully, but these errors were encountered: