-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove Duplicates Order by a column #22
Comments
I would like to contribute to this. Let me know if I can take up this issue |
This seems like a useful addition. @brayanjuls - are you cool with this? I think the sorting should be optional though and we should make sure it only happens if the user wants it (cause we shouldn't have them incur the additional sort cost). |
For this function, there is a lot of uses cases, but i have a question the role of this function is to give possibility to remove rows from result request or directly from table itself ? |
@ilyasse05 - this is to remove from directly the table itself ;) |
@Jegan7 - I just chatted with @brayanjuls on this one. He's going to think about this and ping you with a suggested implementation. After he pings you, it should be easy for you to write the code. We're looking forward to collaborating with you!! |
I think it's not secure to do it like that, the more secure will do it on resulat with select and after do a merge. |
@ilyasse05 - All the functions that remove duplicates in this library were created with the idea of performing the action in the table and not to work as an in-memory transformation. If you have use cases where you would like to have only the transformation, please open an issue with what you need and we would be happy to brainstorm its implementation. |
@Jegan7 @MrPowers - This is a good use case. I think to implement it we just need to add .desc to the current orderBy that is used in the row_number window function. Let me just think a little bit more about how we can make it a parameter and I will ping you back with my final suggestions. link to the function that we need to modify: https://github.com/MrPowers/jodie/blob/9614cce474e0253a1c8876075f223dcee99735f0/src/main/scala/mrpowers/jodie/DeltaHelpers.scala#L64 |
@Jegan7 - I finally had the time to think about it, I think you can achieve this uses case in the following manner, assuming input table is the following:
you can call the function in this way removeDuplicateRecords(deltaTable = table, primaryKey = "timestamp", duplicateColumns = Seq("firstname","lastname"), sorDirection = SortDirection.DESCENDING) Note that I added a new parameter called sort direction, that's an enum from the spark API (org.apache.spark.sql.catalyst.expressions). Try to implement it using that parameter and send a PR. |
Thanks Bryan
I will look into it
Thanks
Jegan
…________________________________
From: Brayan Jules ***@***.***>
Sent: Saturday, January 28, 2023 6:01:26 PM
To: MrPowers/jodie ***@***.***>
Cc: Jegan7 ***@***.***>; Mention ***@***.***>
Subject: Re: [MrPowers/jodie] Remove Duplicates Order by a column (Issue #22)
@Jegan7<https://github.com/Jegan7> - I finally had the time to think about it, I think you can achieve this uses case in the following manner, assuming input table is the following:
+----+---------+------+-----------+
| id|firstname| lastname| timestamp|
+----+---------+------+-----------+
| 1| Benito| Jackson| t1| # duplicate
| 2| Maria| Willis| t1|
| 3| Jose| Travolta| t1| # duplicate
| 4| Benito| Jackson| t2| # duplicate
| 5| Jose| Travolta| t2| # duplicate
| 6| Maria| Pitt| t1|
| 9| Benito| Jackson| t3| # duplicate
+----+---------+------+------------+
you can call the function in this way
removeDuplicateRecords(deltaTable = table, primaryKey = "timestamp", duplicateColumns = Seq("firstname","lastname"), sorDirection = SortDirection.DESCENDING)
Note that I added a new parameter called sort direction, that's an enum from the spark API (org.apache.spark.sql.catalyst.expressions). Try to implement it using that parameter and send a PR.
—
Reply to this email directly, view it on GitHub<#22 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/APDZ3ROD77KAGZNQKXQZO7DWUWQMNANCNFSM6AAAAAAUDH3UFE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@Jegan7 Just checking, do you need help to advance on this issue? |
Hi @brayanjuls , in your above comment why do you set primarykey to timestamp? Isn't it the order by column? |
@jsubm5 - my proposal is to use the primarykey as an ordering column as well because currently, we use it to sort in ascending order and to identify duplicates. The primarykey should be a column that differentiates(is unique on each duplication set) the duplicated records that's why I use the timestamp there. The change should be to add something like this |
@brayanjuls ,the sort column can be any column that user pass it in the function (it preferably should be a timestamp column but if there is no timestamp then they can choose any other column from the dataset) and we should let them pass the sort order as well |
@jsubm5 - Indeed the primaryKey could be any other column. The timestamp colum that I mention above is just an example of how the function could be used after the feature is implemented. |
@brayanjuls should we let the user sort by multiple columns and sort order for each columns? |
@jsubm5 - to achieve that we should support composed primaryKey in this function. I think it would be good to firstly implement this using a single primary key and afterward open a new issue to implement the composed primary key feature which would also allow to order by multiple columns. |
Need to modify Remove Duplicates function to remove duplicates from delta table/parquet file and keep latest record (sort by timestamp column)
The text was updated successfully, but these errors were encountered: