Rewrite table scripts to rely soley on deduplicated data (for re-imported data) #196

TNRiley · 2024-07-31T13:04:48Z

The following tables currently rely on the original uploaded citation file data as well as the post-deduplication data. Reliance on the original file data is problematic due to the fact that users will not be able to recreate any of the tables from exported CiteSource data when re-importing.

Changing these tables to be able to rely on only the deduplicated data would allow users to retain a single .ris or .csv rather than all of the initial raw citation files.

This should be able to be accomplished as the exported data retains the duplicate_id, record_ids, cite_source, cite_label, and (cite_string - not currently exported but easily added)

TNRiley · 2024-08-06T19:51:57Z

@LukasWallrich I've created two new functions that should be able to replace the record_counts function and the record_counts_table function. I based them on re-import of a .csv -- Can you review this? It works in all the various vignettes and other test libries I've tried. I'm thinking that I'd like to keep these functions separate and name them something according to processing re-imported data.

https://gist.github.com/TNRiley/b48fc14f719c00d9030af9705bd0898a

the more difficult ones will be the calculate_record_counts and the tables that use the output of that data...

LukasWallrich · 2024-08-07T19:12:10Z

Happy to review this - but for context, why are you proposing to keep both?

If we can do everything with unique_citations, then that would seem to make for an easier user interface?

If this is about not changing functions, then let's introduce new ones and deprecate old ones ... but not have two parallel sets?

TNRiley · 2024-08-08T10:28:10Z

I was thinking we'd keep them both until everything is fully tested and then deprecate the old ones.

LukasWallrich · 2024-08-08T11:01:30Z

Ok, that makes sense - will review your functions & then let's think of what to call the new functions. Currently, our interface is not particularly consistent between verbs (dedup_citations()) and nouns (record_counts()) - so I would suggest shifting towards verbs where we can as we introduce new functions?

LukasWallrich · 2024-08-08T21:19:29Z

Just looked at yours, and it looks good (though you might want to try to stick to one style a bit more - when using dplyr, it makes sense to use _join functions rather than merge).

[EDIT: I take back my earlier suggestion to separate unique_citations back into citations in a helper function - that does not work with missing data in any of the original cite_ columns. You seem to be making great progress, I will be happy to review the code and think about how to streamline some of it when you are through.]

TNRiley · 2024-08-09T12:25:43Z

I've gone through each of the new functions and applied changes as needed to maintain a consistent style. I've also added comments and placeholder examples. I'm going to drop the new functions into a new script in the R directory. I also do not think that we need to keep the "citation_summary_table". That function was the first one I created and isn't as useful now. we can depricate it when we do the others.

TNRiley · 2024-08-09T13:49:16Z

screwed up a little on the script and will need to try again. Had to add the .data prefix for the columns as I was receiving some issues on those and forgot that is what I had done previously. I need to figure out how to test these in relation to the CMD check before bringing them into github. @LukasWallrich is it as simple as adding the .R locally and then running the devetools::check() ?

LukasWallrich · 2024-08-09T19:08:32Z

Yes, that should do it

…

On Fri, 9 Aug 2024, 14:49 Trevor Riley, ***@***.***> wrote: screwed up a little on the script and will need to try again. Had to add the .data prefix for the columns as I was receiving some issues on those and forgot that is what I had done previously. I need to figure out how to test these in relation to the CMD check before bringing them into github. @LukasWallrich <https://github.com/LukasWallrich> is it as simple as adding the .R locally and then running the devetools::check() ? — Reply to this email directly, view it on GitHub <#196 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOK6NGMGUR5KOBB7QPKH45TZQTCHFAVCNFSM6AAAAABLYOEEWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZXHE4DQOJQGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

TNRiley · 2024-08-14T13:23:19Z

I believe the functions are good to go now. The only issue I was running into was with the @example data for the new create_detailed_records_table - no matter what I spent many hours trying to resolve it, but didn't fail after my push.

closing this now as this is complete, if there are any issues or changes with individual new functions we can add in a separate issue. Once these are fully integrated into the vignettes we can deprecate the old functions and change the version.

examples should be changed in the future in relation to issue #200

TNRiley added Enhancement New feature or request Output Data output labels Jul 31, 2024

TNRiley self-assigned this Jul 31, 2024

TNRiley mentioned this issue Aug 6, 2024

Clarify instruction for manual dedup #195

Closed

TNRiley closed this as completed Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite table scripts to rely soley on deduplicated data (for re-imported data) #196

Rewrite table scripts to rely soley on deduplicated data (for re-imported data) #196

TNRiley commented Jul 31, 2024 •

edited

Loading

TNRiley commented Aug 6, 2024

LukasWallrich commented Aug 7, 2024

TNRiley commented Aug 8, 2024

LukasWallrich commented Aug 8, 2024

LukasWallrich commented Aug 8, 2024 •

edited

Loading

TNRiley commented Aug 9, 2024 •

edited

Loading

TNRiley commented Aug 9, 2024

LukasWallrich commented Aug 9, 2024 via email

TNRiley commented Aug 14, 2024

Rewrite table scripts to rely soley on deduplicated data (for re-imported data) #196

Rewrite table scripts to rely soley on deduplicated data (for re-imported data) #196

Comments

TNRiley commented Jul 31, 2024 • edited Loading

TNRiley commented Aug 6, 2024

LukasWallrich commented Aug 7, 2024

TNRiley commented Aug 8, 2024

LukasWallrich commented Aug 8, 2024

LukasWallrich commented Aug 8, 2024 • edited Loading

TNRiley commented Aug 9, 2024 • edited Loading

TNRiley commented Aug 9, 2024

LukasWallrich commented Aug 9, 2024 via email

TNRiley commented Aug 14, 2024

TNRiley commented Jul 31, 2024 •

edited

Loading

LukasWallrich commented Aug 8, 2024 •

edited

Loading

TNRiley commented Aug 9, 2024 •

edited

Loading