-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support more DML statements #209
Comments
So the idea then would be to create a "SQLFrame Table" object for each engine that you could then call operations like update and merge on? It is a cool idea and would expand the functionality of SQLFrame. Is this something you are interested in contributing? |
Yes, |
Hi, |
Yeah this is how I was thinking about it. Want to take a pass at it but just target DuckDB at first? The goal would be to get aligned on the interface of the class and how the DataFrame and Table API work together. I tend to prototype things on DuckDB because of the speed/convenience of it. |
Hi, I'll start working on it with DuckDB and Databricks (This is the platform I use for work) and I'll try to get an API aligned to what you have right now. For now, my idea is to create a Table class which inherits from DataFrame with the new functionallity. |
Ok that sounds interesting. So the thinking is that although it now returns a table it will maintain API compatibility by converting to a DataFrame if the following method call is not part of the table API. I'm also open to creating new API endpoints for this (like |
Hi, I'm working on this on my fork. Could you take a look? Creating a completely separate API is also possible. |
This is how it is working: When you use the function df_employee = session.read.table("employee")
print(type(df_employee))
When you use any other function (except for alias), it gets converted to a dataframe. print(type(df_employee.select("*")))
The new functions work something like this: # Lazy operation
update = df_employee.update(
_set={"age": df_employee["age"] + 1},
where=df_employee["id"] == 1,
)
# Execute the update
update.execute() delete # Lazy operation
delete= df_employee.delete(
where=df_employee["age"] > 100,
)
# Execute the delete
delete.execute() merge new_df = df_employee.where(F.col("age") > 40)
# Lazy operation
merge = df_employee.merge(
new_df,
(df_employee["id"] == new_df["id"]) & F.lit(True),
[
WhenMatched(condition=(df_employee["id"] == new_df["id"])).update(
set_={"age": df_employee["age"] + new_df["age"] + new_df["age"] + 1}
),
WhenNotMatched().insert(
values={
"id": new_df["id"],
"fname": new_df["fname"],
"lname": F.col("lname"),
"age": new_df["age"] + 1,
"store_id": F.col("store_id"),
"active": F.lit(True),
}
),
WhenNotMatchedBySource().delete(),
],
)
# Execute the merge
merge.execute() |
Very cool. I really like how it maintains compatibility with the PySpark API while also creating a clean interface for doing these table operations. Nice job! 👍 I looked over your fork and overall it looks great. Will certainly dig more into the details once you are ready to create a PR. One thing I did notice is that it currently doesn't have any tests. Is that something you plan on doing before you submit the PR? |
Hi, Yes, adding some tests is something I plan to do. |
Hi,
Right now we could say that sqlframe only support the INSERT statement
I know that spark by itself does not support this kind of statements but I think they would be really usefull.
Also I don't think that supporting that would be out of scope for a library like this because at the end the idea of sqlframe is to run using a SQL Database as the backend and these are usual operations done in a Database/Datawarehouse.
Libraries like delta-lake and Snowpark support these kind of operations so they could be a good starting point to define a good API for this.
The text was updated successfully, but these errors were encountered: