Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dedup_citations returns cite_source and cite_label twice #182

Open
LukasWallrich opened this issue Jul 5, 2024 · 4 comments
Open

dedup_citations returns cite_source and cite_label twice #182

LukasWallrich opened this issue Jul 5, 2024 · 4 comments
Assignees
Labels
Internal Internal functionality Output Data output

Comments

@LukasWallrich
Copy link
Collaborator

LukasWallrich commented Jul 5, 2024

@kaitlynhair currently, dedup_citations returns cite_source and cite_label also as source and label, which then leads to duplications in output (and confusion where those fields actually contained something else).

Reproduce:

refs <- read_citations("tests/shinytest/test_1.ris")
dedup_citations(refs)
@TNRiley
Copy link
Collaborator

TNRiley commented Jul 9, 2024

We talked about this in the last meeting. Not sure if these extra columns (source, label) are used for functions down the line. If they are we should probably change this to point utilize cite_source and cite_label to keep the data clean.

If for some reason this is something ASySD relies on I think that we can just work to remove these columns after they are no longer needed for processing the data (as long as that will still work for re-importing data)

@TNRiley TNRiley added Output Data output Internal Internal functionality labels Jul 9, 2024
@TNRiley
Copy link
Collaborator

TNRiley commented Aug 2, 2024

Looking at all the functions that are used down the line, source and label are not used. My guess is that source and label are required as part of ASySD... we should be able to remove these columns at some point in our data processing when calling the dedup functions in ASySD, not sure if we can do this now or wait until ASySD is on CRAN (or at least submitted).

@kaitlynhair thoughts on this?

@LukasWallrich
Copy link
Collaborator Author

We currently turn cite_source into source, and cite_label into label before ASySD, and then copy the results back after (in our dedup_citations()). I think that came from a time when ASySD did not support merging other fields. Instead, I believe we can call ASySD with extra_merge_field = our three cite_fields, and leave the label and source columns alone (in case they have meaningful information)?

@TNRiley
Copy link
Collaborator

TNRiley commented Aug 5, 2024

I need a better understanding of the extra_merge_field argument functionality in ASySD...

However, source and label columns are not used in CiteSource after deduplication so I don't think we need to keep them.

Since the information in source and label are duplicate data from cite_source and cite_label, even if source and label are used down the line we could easily just point to those columns instead of source and label (again I'm almost 100% that source and label are not being used after dedup anyway).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internal Internal functionality Output Data output
Projects
None yet
Development

No branches or pull requests

3 participants