feat: show `df.count(), len(df.schema())` when displaying `interactive = True` tables #10231

kdheepak · 2024-09-26T03:17:19Z

Is your feature request related to a problem?

When interactively exploring data, it's always an additional step to get the "shape" of the data. This is what the display looks like right now:

What is the motivation behind your request?

I find myself typing the following often:

print(df.count(), len(df.schema()))

Describe the solution you'd like

I'd like for the default display to print the shape of the dataframe (along with the name of the table too would be nice). Polars prints the shape of the data as the first line and it is very convenient:

What version of ibis are you running?

version 9.5.0

What backend(s) are you using, if any?

DuckDB

Code of Conduct

I agree to follow this project's Code of Conduct

cpcloud · 2024-09-26T09:28:19Z

For an arbitrary query, computing the number of rows isn't cheap. For some data sources, like text files, computing the number of rows can be extremely expensive.

Polars can only you show you this information if you're using its DataFrame structure, and not LazyFrames.

Adding the number of columns to the repr seems reasonable.

kdheepak · 2024-09-26T12:45:33Z

In interactive mode though I’m not really concerned about performance, right?

cpcloud · 2024-09-26T12:52:33Z

Why wouldn't performance be relevant in interactive mode?

kdheepak · 2024-09-26T18:40:05Z

I figure because in interactive mode I'm using exploring the data and I'm printing the dataframe often. After exploring and building the sequence of expressions, I move the code to a package into a function or method in a class and only then I really care about performance.

In interactive mode, I'm usually cleaning data or validating data and seeing the number of rows can speed up the process for me. For example, when working with timeseries data, seeing that there are 8760 rows tells me I have hourly data for 365 days of the year. Seeing 8784 would tell me it is a leap year or 8736 would tell me a day is missing. Not having it printed out by default means I have to explicitly look for these issues.

It also gives me a sense for how long a computation might take or how long it might take to write the data to disk. I'm not sure if this directly applies here, but when I've used pandas in the past if I have a few thousand rows, I can use df.apply with a python lambda and get away a hacky data cleaning workflow. If I have a few million rows, then I know have to look for a more performant alternative.

I also am assuming that people working with CSV files don't have too many rows that calculating rows becomes an issue. You definitely have a much better sense of how the performance varies across different backends of ibis, so I'd obviously defer to your judgement.

That said, an alternative to printing the shape would be printing the first N/2 rows and last N/2 rows, instead of printing the first N rows as it does right now (if I'm reading the source correctly here). Usually there's an id column or a timestamp column that provides the information I'm after.

Or alternatively, maybe just the printing to show the number of rows in backends that do support getting the number of rows in a fast manner?

contang0 · 2024-10-02T08:50:49Z

I agree with @kdheepak . Maybe this could become another mode interactive_detailed or something, that would give you both the row and column counts.

NickCrews · 2024-10-05T17:23:20Z

I am +1 on showing the row count by default in interactive mode. If your query is slow, then it's easy, just add a .head() call after it, and you are back to the current performance. I think we could make this discoverable as well: If a computing the row count takes <5 seconds, we just show the row count in the repr. If it takes more, then we still show the row count, but emit a warning eg "computing the row count was slow. You may be able to avoid this by only showing the first few rows using table.head()". I suggest a warning, and not just an addendum in the repr, because consider: I make a preview call, and it is taking forever, so I just cancel the whole operation, and therefore I never actually see the final suggestion. The suggestion should get shown at ~5 seconds, regardless of if the query completes.

Alternatively, this could be what the .preview() method is for. If you don't want the row count, you call that.

I do have sympathy for new users though. I would know how to mitigate a slow preview, but I'm not sure new users would, and we need to remove the sharp edges for them.

Usually there's an id column or a timestamp column that provides the information I'm after.

I don't think we should assume this, this is not true in the data I am working with.

first N/2 rows and last N/2 rows

-1 on this too, I would rather just have the first N rows and see the row count.

OK, now I'm going much broader than just this PR, but maybe useful to think about as context:

I think it's good to note that interactive mode already can be slow. I think in particular this comes up if there is some streaming-blocking-operation in the pipeline, like a sort, groupby, or window function. In these cases, previewing is gonna be slow either way, regardless of if we execute the row count or not. So this suggested change is only affecting the performance in certain circumstances.

Both the issue I say above, and the suggested change here, really present the same story: I am a user in interactive mode, and my previews are slow. What can ibis do to make this experience better? Perhaps if an interactive query takes more than ~10 seconds we emit a warning or print or log or add a note in the repr which links to a troubleshooting page in the docs? In this doc we could suggest the solutions of .cache() or .head(), and point out the pros/cons of each.

cpcloud · 2024-10-05T17:40:36Z

Perhaps it wasn't clear from what I said before, but computing count(*) of query and compute select * from query limit 11 in the worst case will require computing query twice. That doesn't seem like an acceptable trade-off.

cpcloud · 2024-10-05T17:43:17Z

Counting the duration of the query would almost certainly require spinning up a thread to compute the duration of the count query, which also adds more complexity.

Perhaps the question is: why is the additional complexity, maintenance burden, discussion and consideration of edge cases required to show the row count worth taking on?

cpcloud · 2024-10-06T12:56:21Z

first N/2 rows and last N/2 rows

The first and last rows of a table in most of our backends will be completely arbitrary, because a table in a SQL database has no built in notion of row order, so this isn't feasible.

kdheepak · 2024-10-06T20:18:37Z

What do you think about having a user opt into printing row counts? e.g.

ibis.options.show_row_count_in_interactive_mode = True # todo: choose better name

Even a naive non-performant solution would be sufficient in this case.

lboller-pwbm · 2024-10-16T16:55:17Z

I would love to get an opt-in way to automatically show row counts - I get that it shouldn't be the default but I'm using Ibis quite a bit with smallish tables and not being able to quickly glance at lengths makes it much harder to work with than other dataframe libraries. Adding this feature would be drastically improve the package as an alternative to pandas or dplyr in R for things like interactive debugging.

cpcloud · 2024-11-02T12:18:12Z

Not entirely opposed to an option for this but it requires someone else to do the work, and I'll review the PR/help get it over the finish line.

A trade-off to consider here is that this feature is likely to generate some new bug reports about performance, so there'll be extra work to do fielding those reports.

NickCrews · 2024-11-07T18:37:39Z

I have an idea on how this API could work, to unify it with .preview()

add a .repr attribute to all Tables and Columns. By default it is None. Or, it could be a instance of ibis.options.repr (maybe I don't have that class name right), that holds info such as max_rows, show_count, etc
inside Table.__repr__, we first try to use this per-instance config. If None, we fall back to the ibis-wide one.
when users call .preview(), it returns a new Table instance with a new version of the .repr value. Currently, .preview() returns a rich table. So this is a breaking change, but I think a good one.
I think we should actually make .repr be private at first, and only make public later if needed.
I'm not sure if when you eg .select or .filter or get a column from a table, if this .repr config should be inherited, or revert back to None
performance concerns with attaching another object to all Tables and Column objects??

@cpcloud any thoughts on this API? Might be confusing, I can make a PR to make it concrete

NickCrews · 2024-11-21T16:08:58Z

@kdheepak @contang0 @lboller-pwbm you all chimed in so far, we would love your feedback on #10518

contang0 · 2024-11-22T09:01:54Z

Looks very good, I have nothing to add apart from what @lboller-pwbm already mentioned. The row count should ideally be visible without needing to scroll horizontally if a table is wide, i.e. it should be left-justified.

contang0 · 2024-11-22T09:26:26Z

By the way, sometimes I wish the backend name was also somehow indicated when inspecting a table. Sometimes I work with multiple backends in the same project and I have no idea whether I can join two tables before I try. I know it's probably a very niche problem, but something to keep in mind if more people have the same issue.

NickCrews · 2024-11-22T15:34:15Z

Do y'all have strong opinions on what the default behavior should be (I imagine since you are in this issue you are going to be a non-representative sample of the ibis user community)? Can you comment at all on what sort of workload you are typically doing, one of the below, or a different one?

ibis.duckdb.connect("mydb.db").table("my_table") (I expect ~0 difference)
ibis.duckdb.connect().read_parquet("t.pq") (I expect~0 difference)
ibis.duckdb.connect().read_csv("thousand_rows.csv") (I expect small difference)
ibis.duckdb.connect().read_csv("billion_rows.csv") (I expect large difference)
ibis.duckdb.connect().read_csv("thousand_rows.csv").some_expensive computation() (I expect a difference, size depends on semantics of the function)

NickCrews · 2024-11-22T15:35:06Z

@cpcloud I think that PR is ready for your thoughts whenever you get the chance, thank you!

kdheepak · 2024-11-22T15:55:39Z

One of the major reasons I like using ibis is that API is cleaner than pandas / polars etc. I can usually write a lot less code and get the same things done. Most of the data I work with on a day to day basis is in the "few thousand rows" magnitude and I use the duckdb backend. I expect I'll have to scale in the future on at least one of my projects, and I'm hoping ibis will make that trivial by changing the backend.

lboller-pwbm · 2024-11-22T16:10:24Z

I am more in the fifth bucket, currently - but I'm more hesitant to make the default something that would slow down interactive mode a lot at scale. I think an option like the one suggested in #10231 (comment) would be good and/or letting the user set an environment variable to toggle between a fast or detailed version of interactive mode as the default depending on the project.

Do y'all have strong opinions on what the default behavior should be (I imagine since you are in this issue you are going to be a non-representative sample of the ibis user community)? Can you comment at all on what sort of workload you are typically doing, one of the below, or a different one?

ibis.duckdb.connect("mydb.db").table("my_table") (I expect ~0 difference)

ibis.duckdb.connect().read_parquet("t.pq") (I expect~0 difference)

ibis.duckdb.connect().read_csv("thousand_rows.csv") (I expect small difference)

ibis.duckdb.connect().read_csv("billion_rows.csv") (I expect large difference)

ibis.duckdb.connect().read_csv("thousand_rows.csv").some_expensive computation() (I expect a difference, size depends on semantics of the function)

NickCrews · 2024-11-22T18:34:39Z

environment variable

@lboller-pwbm I've wanted this too. I started #10523 for any discussion of that.

contang0 · 2024-11-22T22:40:27Z

The way I access my data is usually similar to 1. I tend to start with a connection to Impala, and persist data to memory (duckdb) as soon as it's reasonably small. This is why seeing backend name would also be useful for me. Sometimes I get lost at which step I persisted the data to a memtable.

Do y'all have strong opinions on what the default behavior should be (I imagine since you are in this issue you are going to be a non-representative sample of the ibis user community)? Can you comment at all on what sort of workload you are typically doing, one of the below, or a different one?

ibis.duckdb.connect("mydb.db").table("my_table") (I expect 0 difference)
ibis.duckdb.connect().read_parquet("t.pq") (I expect0 difference)
ibis.duckdb.connect().read_csv("thousand_rows.csv") (I expect small difference)
ibis.duckdb.connect().read_csv("billion_rows.csv") (I expect large difference)
ibis.duckdb.connect().read_csv("thousand_rows.csv").some_expensive computation() (I expect a difference, size depends on semantics of the function)

kdheepak added the feature Features or general enhancements label Sep 26, 2024

github-project-automation bot added this to Ibis planning and roadmap Sep 26, 2024

github-project-automation bot moved this to backlog in Ibis planning and roadmap Sep 26, 2024

cpcloud added the ux User experience related issues label Sep 26, 2024

NickCrews linked a pull request Nov 21, 2024 that will close this issue

feat(repr): add show_count option to interactive repr #10518

Open

NickCrews mentioned this issue Nov 22, 2024

feat: set config in environment vars/config files #10523

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: show `df.count(), len(df.schema())` when displaying `interactive = True` tables #10231

feat: show `df.count(), len(df.schema())` when displaying `interactive = True` tables #10231

kdheepak commented Sep 26, 2024

cpcloud commented Sep 26, 2024

kdheepak commented Sep 26, 2024

cpcloud commented Sep 26, 2024

kdheepak commented Sep 26, 2024 •

edited

Loading

contang0 commented Oct 2, 2024

NickCrews commented Oct 5, 2024 •

edited

Loading

cpcloud commented Oct 5, 2024

cpcloud commented Oct 5, 2024 •

edited

Loading

cpcloud commented Oct 6, 2024

kdheepak commented Oct 6, 2024

lboller-pwbm commented Oct 16, 2024

cpcloud commented Nov 2, 2024

NickCrews commented Nov 7, 2024 •

edited

Loading

NickCrews commented Nov 21, 2024 •

edited

Loading

contang0 commented Nov 22, 2024

contang0 commented Nov 22, 2024

NickCrews commented Nov 22, 2024

NickCrews commented Nov 22, 2024

kdheepak commented Nov 22, 2024

lboller-pwbm commented Nov 22, 2024

NickCrews commented Nov 22, 2024

contang0 commented Nov 22, 2024

feat: show df.count(), len(df.schema()) when displaying interactive = True tables #10231

feat: show df.count(), len(df.schema()) when displaying interactive = True tables #10231

Comments

kdheepak commented Sep 26, 2024

Is your feature request related to a problem?

What is the motivation behind your request?

Describe the solution you'd like

What version of ibis are you running?

What backend(s) are you using, if any?

Code of Conduct

cpcloud commented Sep 26, 2024

kdheepak commented Sep 26, 2024

cpcloud commented Sep 26, 2024

kdheepak commented Sep 26, 2024 • edited Loading

contang0 commented Oct 2, 2024

NickCrews commented Oct 5, 2024 • edited Loading

cpcloud commented Oct 5, 2024

cpcloud commented Oct 5, 2024 • edited Loading

cpcloud commented Oct 6, 2024

kdheepak commented Oct 6, 2024

lboller-pwbm commented Oct 16, 2024

cpcloud commented Nov 2, 2024

NickCrews commented Nov 7, 2024 • edited Loading

NickCrews commented Nov 21, 2024 • edited Loading

contang0 commented Nov 22, 2024

contang0 commented Nov 22, 2024

NickCrews commented Nov 22, 2024

NickCrews commented Nov 22, 2024

kdheepak commented Nov 22, 2024

lboller-pwbm commented Nov 22, 2024

NickCrews commented Nov 22, 2024

contang0 commented Nov 22, 2024

feat: show `df.count(), len(df.schema())` when displaying `interactive = True` tables #10231

feat: show `df.count(), len(df.schema())` when displaying `interactive = True` tables #10231

kdheepak commented Sep 26, 2024 •

edited

Loading

NickCrews commented Oct 5, 2024 •

edited

Loading

cpcloud commented Oct 5, 2024 •

edited

Loading

NickCrews commented Nov 7, 2024 •

edited

Loading

NickCrews commented Nov 21, 2024 •

edited

Loading