Dumb Question: What is the Mojo way of using/replacing pandas/dataframes? #1446
Replies: 3 comments
-
Right now, there are no dataframe libraries for mojo, since mojo is still in its infancy. You could import pandas, polars, duckdb, and/or ibis, and probably just wrap all your python code inside mojo. However, since mojo will just be making calls under the hood to libpython, there shouldn't be any speedup. As a side note, I would recommend that you look at polars, or duckdb+ibis if you need a speed up over pandas. Even for 5 and 50gb data sets, polars and duckdb are almost an order of magnitude better than Spark in performance in the H20.ai benchmark. Eventually, with a lot of work, mojo should excel if a group of people ever write a mojo-arrow lib. Since arrow is columnar, mojo would be great at processing the data in parallel chunks (a column chunk could be wrapped inside a multidimensional SIMD object for example). For purely mathematical columns, GPU acceleration will also help. So I see a very bright future for data processing and analytics query engines written in mojo (indeed, this is my main area of interest, as I work in data engineering). Who knows, maybe someone could write a ray.io competitor in mojo too, especially if there's already some thought on an Actor library. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your explanation! That's kind of what I was afraid of: "eventually, with a lot of work..." I'll check out Ibis. Side note: I've tried incorporating polars, but part of my workflow involves geopandas, and there's not a lot of coverage for that functionality in polars. |
Beta Was this translation helpful? Give feedback.
-
Well, hopefully mojo will get there one day :) Ibis is like a universal dataframe library so that you can target Ibis dataframe API, but interop with pyspark, duckdb (which is SQL only), polars, pandas, or some other backends that only have SQL. It will probably actually slow you down a little bit. But duckdb with ibis is nice if you don't like SQL or need to interop with other code. If polars |
Beta Was this translation helpful? Give feedback.
-
I use pandas primarily to extract data, store it at runtime, and filter/fetch individual values or subsets. Sometimes I generate new columns, but I can live without that feature. What should I use in Mojo so that I can accelerate my existing workflows? I imagine that just importing pandas is not going to speed anything up. Is there a "struct based dataframe design pattern" floating around that I've never heard of? Is recreating a dataframe based workflow in Mojo a multi-person, multi-project?
Beta Was this translation helpful? Give feedback.
All reactions