Remove Duplicates Order by a column #22

jsubm5 · 2023-01-22T23:18:20Z

Need to modify Remove Duplicates function to remove duplicates from delta table/parquet file and keep latest record (sort by timestamp column)

jsubm5 · 2023-01-22T23:18:45Z

I would like to contribute to this. Let me know if I can take up this issue

MrPowers · 2023-01-23T08:09:54Z

This seems like a useful addition. @brayanjuls - are you cool with this?

I think the sorting should be optional though and we should make sure it only happens if the user wants it (cause we shouldn't have them incur the additional sort cost).

ilyasse05 · 2023-01-23T08:19:39Z

For this function, there is a lot of uses cases, but i have a question the role of this function is to give possibility to remove rows from result request or directly from table itself ?

MrPowers · 2023-01-23T12:56:08Z

@ilyasse05 - this is to remove from directly the table itself ;)

MrPowers · 2023-01-23T12:57:29Z

@Jegan7 - I just chatted with @brayanjuls on this one. He's going to think about this and ping you with a suggested implementation. After he pings you, it should be easy for you to write the code. We're looking forward to collaborating with you!!

ilyasse05 · 2023-01-23T13:14:52Z

I think it's not secure to do it like that, the more secure will do it on resulat with select and after do a merge.

brayanjuls · 2023-01-23T23:42:22Z

@ilyasse05 - All the functions that remove duplicates in this library were created with the idea of performing the action in the table and not to work as an in-memory transformation. If you have use cases where you would like to have only the transformation, please open an issue with what you need and we would be happy to brainstorm its implementation.

brayanjuls · 2023-01-24T00:47:10Z

@Jegan7 @MrPowers - This is a good use case. I think to implement it we just need to add .desc to the current orderBy that is used in the row_number window function. Let me just think a little bit more about how we can make it a parameter and I will ping you back with my final suggestions. link to the function that we need to modify: https://github.com/MrPowers/jodie/blob/9614cce474e0253a1c8876075f223dcee99735f0/src/main/scala/mrpowers/jodie/DeltaHelpers.scala#L64

brayanjuls · 2023-01-28T23:01:16Z

@Jegan7 - I finally had the time to think about it, I think you can achieve this uses case in the following manner, assuming input table is the following:

+----+---------+------+-----------+
|  id|firstname| lastname| timestamp|
+----+---------+------+-----------+
|   1|   Benito|  Jackson| t1|     # duplicate
|   2|    Maria|   Willis| t1|
|   3|     Jose| Travolta| t1|     # duplicate
|   4|   Benito|  Jackson| t2|     # duplicate
|   5|     Jose| Travolta| t2|     # duplicate
|   6|    Maria|     Pitt| t1|
|   9|   Benito|  Jackson| t3|     # duplicate
+----+---------+------+------------+

you can call the function in this way

removeDuplicateRecords(deltaTable = table, primaryKey = "timestamp", duplicateColumns = Seq("firstname","lastname"), sorDirection = SortDirection.DESCENDING)

Note that I added a new parameter called sort direction, that's an enum from the spark API (org.apache.spark.sql.catalyst.expressions). Try to implement it using that parameter and send a PR.

jsubm5 · 2023-01-30T03:40:56Z

Thanks Bryan I will look into it Thanks Jegan

…

________________________________ From: Brayan Jules ***@***.***> Sent: Saturday, January 28, 2023 6:01:26 PM To: MrPowers/jodie ***@***.***> Cc: Jegan7 ***@***.***>; Mention ***@***.***> Subject: Re: [MrPowers/jodie] Remove Duplicates Order by a column (Issue #22) @Jegan7<https://github.com/Jegan7> - I finally had the time to think about it, I think you can achieve this uses case in the following manner, assuming input table is the following: +----+---------+------+-----------+ | id|firstname| lastname| timestamp| +----+---------+------+-----------+ | 1| Benito| Jackson| t1| # duplicate | 2| Maria| Willis| t1| | 3| Jose| Travolta| t1| # duplicate | 4| Benito| Jackson| t2| # duplicate | 5| Jose| Travolta| t2| # duplicate | 6| Maria| Pitt| t1| | 9| Benito| Jackson| t3| # duplicate +----+---------+------+------------+ you can call the function in this way removeDuplicateRecords(deltaTable = table, primaryKey = "timestamp", duplicateColumns = Seq("firstname","lastname"), sorDirection = SortDirection.DESCENDING) Note that I added a new parameter called sort direction, that's an enum from the spark API (org.apache.spark.sql.catalyst.expressions). Try to implement it using that parameter and send a PR. — Reply to this email directly, view it on GitHub<#22 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/APDZ3ROD77KAGZNQKXQZO7DWUWQMNANCNFSM6AAAAAAUDH3UFE>. You are receiving this because you were mentioned.Message ID: ***@***.***>

brayanjuls · 2023-03-02T21:31:37Z

@Jegan7 Just checking, do you need help to advance on this issue?

jsubm5 · 2023-03-16T23:00:58Z

Hi @brayanjuls , in your above comment why do you set primarykey to timestamp? Isn't it the order by column?

brayanjuls · 2023-03-23T13:58:33Z

@jsubm5 - my proposal is to use the primarykey as an ordering column as well because currently, we use it to sort in ascending order and to identify duplicates. The primarykey should be a column that differentiates(is unique on each duplication set) the duplicated records that's why I use the timestamp there.

The change should be to add something like this new Column(SortOrder(col(primaryKey).expr, sort)) in the orderBy that currently exists in that function. Let me know your thoughts.

jsubm5 · 2023-03-24T22:52:17Z

@brayanjuls ,the sort column can be any column that user pass it in the function (it preferably should be a timestamp column but if there is no timestamp then they can choose any other column from the dataset) and we should let them pass the sort order as well

brayanjuls · 2023-03-24T23:18:49Z

@jsubm5 - Indeed the primaryKey could be any other column. The timestamp colum that I mention above is just an example of how the function could be used after the feature is implemented.

jsubm5 · 2023-04-01T16:32:37Z

@brayanjuls should we let the user sort by multiple columns and sort order for each columns?

brayanjuls · 2023-04-25T12:05:24Z

@jsubm5 - to achieve that we should support composed primaryKey in this function. I think it would be good to firstly implement this using a single primary key and afterward open a new issue to implement the composed primary key feature which would also allow to order by multiple columns.

brayanjuls mentioned this issue Jan 24, 2023

Update the function removeDuplicateRecords #31

Closed

brayanjuls assigned jsubm5 Feb 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Duplicates Order by a column #22

Remove Duplicates Order by a column #22

jsubm5 commented Jan 22, 2023

jsubm5 commented Jan 22, 2023

MrPowers commented Jan 23, 2023

ilyasse05 commented Jan 23, 2023

MrPowers commented Jan 23, 2023

MrPowers commented Jan 23, 2023

ilyasse05 commented Jan 23, 2023

brayanjuls commented Jan 23, 2023 •

edited

Loading

brayanjuls commented Jan 24, 2023 •

edited

Loading

brayanjuls commented Jan 28, 2023

jsubm5 commented Jan 30, 2023 via email

brayanjuls commented Mar 2, 2023

jsubm5 commented Mar 16, 2023

brayanjuls commented Mar 23, 2023

jsubm5 commented Mar 24, 2023

brayanjuls commented Mar 24, 2023

jsubm5 commented Apr 1, 2023

brayanjuls commented Apr 25, 2023

Remove Duplicates Order by a column #22

Remove Duplicates Order by a column #22

Comments

jsubm5 commented Jan 22, 2023

jsubm5 commented Jan 22, 2023

MrPowers commented Jan 23, 2023

ilyasse05 commented Jan 23, 2023

MrPowers commented Jan 23, 2023

MrPowers commented Jan 23, 2023

ilyasse05 commented Jan 23, 2023

brayanjuls commented Jan 23, 2023 • edited Loading

brayanjuls commented Jan 24, 2023 • edited Loading

brayanjuls commented Jan 28, 2023

jsubm5 commented Jan 30, 2023 via email

brayanjuls commented Mar 2, 2023

jsubm5 commented Mar 16, 2023

brayanjuls commented Mar 23, 2023

jsubm5 commented Mar 24, 2023

brayanjuls commented Mar 24, 2023

jsubm5 commented Apr 1, 2023

brayanjuls commented Apr 25, 2023

brayanjuls commented Jan 23, 2023 •

edited

Loading

brayanjuls commented Jan 24, 2023 •

edited

Loading