Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite table scripts to rely soley on deduplicated data (for re-imported data) #196

Closed
6 of 7 tasks
TNRiley opened this issue Jul 31, 2024 · 9 comments
Closed
6 of 7 tasks
Assignees
Labels
Enhancement New feature or request Output Data output

Comments

@TNRiley
Copy link
Collaborator

TNRiley commented Jul 31, 2024

The following tables currently rely on the original uploaded citation file data as well as the post-deduplication data. Reliance on the original file data is problematic due to the fact that users will not be able to recreate any of the tables from exported CiteSource data when re-importing.

Changing these tables to be able to rely on only the deduplicated data would allow users to retain a single .ris or .csv rather than all of the initial raw citation files.

  • record_counts
  • record_counts_table
  • calculate_record_counts
  • record_summary_table
  • citation_summary_table
  • calculate_phase_counts
  • precision_sensitivity_table

This should be able to be accomplished as the exported data retains the duplicate_id, record_ids, cite_source, cite_label, and (cite_string - not currently exported but easily added)

@TNRiley TNRiley added Enhancement New feature or request Output Data output labels Jul 31, 2024
@TNRiley TNRiley self-assigned this Jul 31, 2024
@TNRiley
Copy link
Collaborator Author

TNRiley commented Aug 6, 2024

@LukasWallrich I've created two new functions that should be able to replace the record_counts function and the record_counts_table function. I based them on re-import of a .csv -- Can you review this? It works in all the various vignettes and other test libries I've tried. I'm thinking that I'd like to keep these functions separate and name them something according to processing re-imported data.

https://gist.github.com/TNRiley/b48fc14f719c00d9030af9705bd0898a

the more difficult ones will be the calculate_record_counts and the tables that use the output of that data...

@LukasWallrich
Copy link
Collaborator

Happy to review this - but for context, why are you proposing to keep both?

If we can do everything with unique_citations, then that would seem to make for an easier user interface?

If this is about not changing functions, then let's introduce new ones and deprecate old ones ... but not have two parallel sets?

@TNRiley
Copy link
Collaborator Author

TNRiley commented Aug 8, 2024

I was thinking we'd keep them both until everything is fully tested and then deprecate the old ones.

@LukasWallrich
Copy link
Collaborator

Ok, that makes sense - will review your functions & then let's think of what to call the new functions. Currently, our interface is not particularly consistent between verbs (dedup_citations()) and nouns (record_counts()) - so I would suggest shifting towards verbs where we can as we introduce new functions?

@LukasWallrich
Copy link
Collaborator

LukasWallrich commented Aug 8, 2024

Just looked at yours, and it looks good (though you might want to try to stick to one style a bit more - when using dplyr, it makes sense to use _join functions rather than merge).

[EDIT: I take back my earlier suggestion to separate unique_citations back into citations in a helper function - that does not work with missing data in any of the original cite_ columns. You seem to be making great progress, I will be happy to review the code and think about how to streamline some of it when you are through.]

@TNRiley
Copy link
Collaborator Author

TNRiley commented Aug 9, 2024

I've gone through each of the new functions and applied changes as needed to maintain a consistent style. I've also added comments and placeholder examples. I'm going to drop the new functions into a new script in the R directory. I also do not think that we need to keep the "citation_summary_table". That function was the first one I created and isn't as useful now. we can depricate it when we do the others.

@TNRiley
Copy link
Collaborator Author

TNRiley commented Aug 9, 2024

screwed up a little on the script and will need to try again. Had to add the .data prefix for the columns as I was receiving some issues on those and forgot that is what I had done previously. I need to figure out how to test these in relation to the CMD check before bringing them into github. @LukasWallrich is it as simple as adding the .R locally and then running the devetools::check() ?

@LukasWallrich
Copy link
Collaborator

LukasWallrich commented Aug 9, 2024 via email

@TNRiley
Copy link
Collaborator Author

TNRiley commented Aug 14, 2024

I believe the functions are good to go now. The only issue I was running into was with the @example data for the new create_detailed_records_table - no matter what I spent many hours trying to resolve it, but didn't fail after my push.

closing this now as this is complete, if there are any issues or changes with individual new functions we can add in a separate issue. Once these are fully integrated into the vignettes we can deprecate the old functions and change the version.

examples should be changed in the future in relation to issue #200

@TNRiley TNRiley closed this as completed Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or request Output Data output
Projects
None yet
Development

No branches or pull requests

2 participants