-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align argument name for citation tibble across functions #190
Comments
I put together a script to pull all the functions and arguments so that we can see them. We have a mix it seems between x, df, citations, unique_citations, data, unique_data, .... I think cleaning things up is a good idea, but I'm not sure what the best practices are here. Functions like record_counts and calculate_phase_counts take the output of the dedup_citations (which is normally given the element name "unique_citations") and compare it to the original citations data from read_citations (which is normally given the element name "citations". Obviously, the outputs could be named anything and it might be better to give these functions more flexibility, though I don't see these table functions and others being utilized outside of CiteSource. For me, Citations and unique_citations give me an understanding of the data being fed into a function. I like to see the named data inputs as it helps me navigate and keep a clear understanding of the inner workings, but I understand that it limits the function by requiring the data to be named accordingly which could be problematic if the functions are used outside of CiteSource.
|
This is super-helpful, thank you! Could you put that function into a gist? I need to do something similar for a different package, and this table is great. Given that we do keep on using the full citations beyond the dedup stage, I like your suggestion to use different names for the original and deduped version. Should we maybe go for To be clear: My comment on the use of functions was not about named vs unnamed arguments, but about changes to argument names. People might want to include reproducible CiteSource analyses in their analysis code, and if they decided to call |
@LukasWallrich Here is that gist <script src="https://gist.github.com/TNRiley/69feec62be17c7aa49266832e505b735.js"></script> I'm not 100% sure I understand the argument name part here. I think changes to names could be helpful, but I do have a ton of scripts for projects that would need to be changed. I agree that some of our function names are a bit odd. We should discuss this next time we tag up. I'm not sure if we're too far down the line to change function names at this point. I've also found that within the unique_citations the deuplicate_id and record_ids & duplicate_id are confusing. deuplicate_id is really a unique identifier and the record_ids is just a list of the duplicate_id where overlap occurred. I'm not experienced enough to know how much work would need to go into changing any of this at this point. The other option is just really good documentation? |
I'd say it's not yet too late to change argument names (there, we could also keep the old names as deprecated aliases that continue to work for a while, but give messages asking users to change to the new version) - it might be too late to change function names but we can always add aliases in special cases (for instance, duplicate_id and record_ids are I think I made a minor change in my suggestion to ASySD that means that record_ids are concatenations of all the original record_id fields, while they got sometimes mixed with (former) duplicate_ids during manual deduplication in the previous code - so they would never be derived from duplicate_ids per se, and should not be explained as such ... |
This makes sense to me now, thanks Lukas. I agree with you that it's not too late to change argument names. I've gone through the functions and created a list where the functions rely on the deduplicated data and what the arguments are named. Here are the various argument names we use in functions where the result of dedup_ciations are called. compare_sources (unique_data) export_bib (citations) this is definitely in need of change! |
Here are the various argument names we use in functions where the raw citation data is called (currently called citations) record_counts (citations) |
Another item related to this. Both the reimport_csv and reimport_ris explicitly name the imported data "citations". IF we go ahead with the renaming, I think that these should be changed to name the reimported data unique_citations. |
This is purely internal - we can rename it, but the user will never see that. In the example, we already assign the result to unique_citations. |
I'm pretty sure I've got everything renamed. they are now raw_citations and unique_citations. |
Currently, in some functions, the argument taking the citations tibble is called
unique_citations
(e.g.,compare_sources()
) while is is simply namedcitations
in others (e.g.,citation_summary_table()
). In order to have a clean user interface, we should probably align this.Given that the entire package works on deduplicated citations, I would personally suggest to call it
citations
(and document that the argument refers to deduplicated citations) - but happy to call itunique_citations
as well. Usually, it won't be named, so it does not make a huge difference - but should still be cleaned up before release as changes to argument names can break existing code if users name the arguments ...@TNRiley (and anyone else) - which argument name would you prefer? Also, do we have other recurring arguments across functions with different names? I currently don't think so, but that may also be worth checking.
The text was updated successfully, but these errors were encountered: