Dumb Question: What is the Mojo way of using/replacing pandas/dataframes? #1446

cjohnson318 · 2023-12-09T02:14:35Z

cjohnson318
Dec 9, 2023

I use pandas primarily to extract data, store it at runtime, and filter/fetch individual values or subsets. Sometimes I generate new columns, but I can live without that feature. What should I use in Mojo so that I can accelerate my existing workflows? I imagine that just importing pandas is not going to speed anything up. Is there a "struct based dataframe design pattern" floating around that I've never heard of? Is recreating a dataframe based workflow in Mojo a multi-person, multi-project?

rarebreed · 2023-12-09T03:15:17Z

rarebreed
Dec 9, 2023

Right now, there are no dataframe libraries for mojo, since mojo is still in its infancy.

You could import pandas, polars, duckdb, and/or ibis, and probably just wrap all your python code inside mojo. However, since mojo will just be making calls under the hood to libpython, there shouldn't be any speedup. As a side note, I would recommend that you look at polars, or duckdb+ibis if you need a speed up over pandas. Even for 5 and 50gb data sets, polars and duckdb are almost an order of magnitude better than Spark in performance in the H20.ai benchmark.

Eventually, with a lot of work, mojo should excel if a group of people ever write a mojo-arrow lib. Since arrow is columnar, mojo would be great at processing the data in parallel chunks (a column chunk could be wrapped inside a multidimensional SIMD object for example). For purely mathematical columns, GPU acceleration will also help. So I see a very bright future for data processing and analytics query engines written in mojo (indeed, this is my main area of interest, as I work in data engineering).

Who knows, maybe someone could write a ray.io competitor in mojo too, especially if there's already some thought on an Actor library.

0 replies

cjohnson318 · 2023-12-09T04:13:58Z

cjohnson318
Dec 9, 2023
Author

Thanks for your explanation! That's kind of what I was afraid of: "eventually, with a lot of work..." I'll check out Ibis.

Side note: I've tried incorporating polars, but part of my workflow involves geopandas, and there's not a lot of coverage for that functionality in polars.

0 replies

rarebreed · 2023-12-09T05:12:36Z

rarebreed
Dec 9, 2023

Thanks for your explanation! That's kind of what I was afraid of: "eventually, with a lot of work..." I'll check out Ibis.

Side note: I've tried incorporating polars, but part of my workflow involves geopandas, and there's not a lot of coverage for that functionality in polars.

Well, hopefully mojo will get there one day :)

Ibis is like a universal dataframe library so that you can target Ibis dataframe API, but interop with pyspark, duckdb (which is SQL only), polars, pandas, or some other backends that only have SQL. It will probably actually slow you down a little bit. But duckdb with ibis is nice if you don't like SQL or need to interop with other code.

If polars to_pandas doesn't help, then probably duckdb won't really help either. It has a DuckDBPyRelation datatype that can convert to pandas and polars dataframes or raw arrow Tables.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dumb Question: What is the Mojo way of using/replacing pandas/dataframes? #1446

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Dumb Question: What is the Mojo way of using/replacing pandas/dataframes? #1446

cjohnson318 Dec 9, 2023

Replies: 3 comments

rarebreed Dec 9, 2023

cjohnson318 Dec 9, 2023 Author

rarebreed Dec 9, 2023

cjohnson318
Dec 9, 2023

rarebreed
Dec 9, 2023

cjohnson318
Dec 9, 2023
Author

rarebreed
Dec 9, 2023