Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Duplicates Order by a column #22

Open
jsubm5 opened this issue Jan 22, 2023 · 17 comments
Open

Remove Duplicates Order by a column #22

jsubm5 opened this issue Jan 22, 2023 · 17 comments
Assignees

Comments

@jsubm5
Copy link

jsubm5 commented Jan 22, 2023

Need to modify Remove Duplicates function to remove duplicates from delta table/parquet file and keep latest record (sort by timestamp column)

@jsubm5
Copy link
Author

jsubm5 commented Jan 22, 2023

I would like to contribute to this. Let me know if I can take up this issue

@MrPowers
Copy link
Collaborator

This seems like a useful addition. @brayanjuls - are you cool with this?

I think the sorting should be optional though and we should make sure it only happens if the user wants it (cause we shouldn't have them incur the additional sort cost).

@ilyasse05
Copy link

For this function, there is a lot of uses cases, but i have a question the role of this function is to give possibility to remove rows from result request or directly from table itself ?

@MrPowers
Copy link
Collaborator

@ilyasse05 - this is to remove from directly the table itself ;)

@MrPowers
Copy link
Collaborator

@Jegan7 - I just chatted with @brayanjuls on this one. He's going to think about this and ping you with a suggested implementation. After he pings you, it should be easy for you to write the code. We're looking forward to collaborating with you!!

@ilyasse05
Copy link

I think it's not secure to do it like that, the more secure will do it on resulat with select and after do a merge.

@brayanjuls
Copy link
Collaborator

brayanjuls commented Jan 23, 2023

@ilyasse05 - All the functions that remove duplicates in this library were created with the idea of performing the action in the table and not to work as an in-memory transformation. If you have use cases where you would like to have only the transformation, please open an issue with what you need and we would be happy to brainstorm its implementation.

@brayanjuls
Copy link
Collaborator

brayanjuls commented Jan 24, 2023

@Jegan7 @MrPowers - This is a good use case. I think to implement it we just need to add .desc to the current orderBy that is used in the row_number window function. Let me just think a little bit more about how we can make it a parameter and I will ping you back with my final suggestions. link to the function that we need to modify: https://github.com/MrPowers/jodie/blob/9614cce474e0253a1c8876075f223dcee99735f0/src/main/scala/mrpowers/jodie/DeltaHelpers.scala#L64

@brayanjuls
Copy link
Collaborator

@Jegan7 - I finally had the time to think about it, I think you can achieve this uses case in the following manner, assuming input table is the following:

+----+---------+------+-----------+
|  id|firstname| lastname| timestamp|
+----+---------+------+-----------+
|   1|   Benito|  Jackson| t1|     # duplicate
|   2|    Maria|   Willis| t1|
|   3|     Jose| Travolta| t1|     # duplicate
|   4|   Benito|  Jackson| t2|     # duplicate
|   5|     Jose| Travolta| t2|     # duplicate
|   6|    Maria|     Pitt| t1|
|   9|   Benito|  Jackson| t3|     # duplicate
+----+---------+------+------------+

you can call the function in this way

removeDuplicateRecords(deltaTable = table, primaryKey = "timestamp", duplicateColumns = Seq("firstname","lastname"), sorDirection = SortDirection.DESCENDING)

Note that I added a new parameter called sort direction, that's an enum from the spark API (org.apache.spark.sql.catalyst.expressions). Try to implement it using that parameter and send a PR.

@jsubm5
Copy link
Author

jsubm5 commented Jan 30, 2023 via email

@brayanjuls
Copy link
Collaborator

@Jegan7 Just checking, do you need help to advance on this issue?

@jsubm5
Copy link
Author

jsubm5 commented Mar 16, 2023

Hi @brayanjuls , in your above comment why do you set primarykey to timestamp? Isn't it the order by column?

@brayanjuls
Copy link
Collaborator

@jsubm5 - my proposal is to use the primarykey as an ordering column as well because currently, we use it to sort in ascending order and to identify duplicates. The primarykey should be a column that differentiates(is unique on each duplication set) the duplicated records that's why I use the timestamp there.

The change should be to add something like this new Column(SortOrder(col(primaryKey).expr, sort)) in the orderBy that currently exists in that function. Let me know your thoughts.

@jsubm5
Copy link
Author

jsubm5 commented Mar 24, 2023

@brayanjuls ,the sort column can be any column that user pass it in the function (it preferably should be a timestamp column but if there is no timestamp then they can choose any other column from the dataset) and we should let them pass the sort order as well

@brayanjuls
Copy link
Collaborator

@jsubm5 - Indeed the primaryKey could be any other column. The timestamp colum that I mention above is just an example of how the function could be used after the feature is implemented.

@jsubm5
Copy link
Author

jsubm5 commented Apr 1, 2023

@brayanjuls should we let the user sort by multiple columns and sort order for each columns?

@brayanjuls
Copy link
Collaborator

@jsubm5 - to achieve that we should support composed primaryKey in this function. I think it would be good to firstly implement this using a single primary key and afterward open a new issue to implement the composed primary key feature which would also allow to order by multiple columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants