-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manual Deduplication Error #192
Comments
Does not replicate on other vignette files that I tested. |
I can replicate the issue - trying to figure it out |
This is an ASySD issue, where ASySD loses some of the record_ids while merging the metadata in this case (below, the number of sources should be the same as the number of record_ids, yet there are 6 sources but only 4 record ids, which then fails). Weirdly, it only happens when cite_string is included. I will have a look at the ASySD code and aim to propose a fix. library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ASySD)
extra_pair <- tibble::tribble(
~author1, ~author2, ~author, ~title1, ~title2, ~title, ~abstract1, ~abstract2, ~abstract, ~year1, ~year2, ~year, ~number1, ~number2, ~number, ~pages1, ~pages2, ~pages, ~volume1, ~volume2, ~volume, ~journal1, ~journal2, ~journal, ~isbn, ~isbn1, ~isbn2, ~doi1, ~doi2, ~doi, ~record_id1, ~record_id2, ~label1, ~label2, ~source1, ~source2, ~cite_string1, ~cite_string2, ~duplicate_id.x, ~duplicate_id.y, ~match, ~min_id, ~max_id,
"Ferguson, MC and Waite, JM and Curtice, C and Clarke, JT and Harrison, J", "Ferguson, MC and Curtice, C and Harrison, J", 0.85898, "Biologically Important Areas for Cetaceans Within US Waters - Aleutian Islands and Bering Sea Region", "Biologically Important Areas for Cetaceans Within US Waters - Gulf of Alaska Region", 0.92987, "We integrated existing published and unpublished information to delineate Biologically Important Areas (BIAs) for bowhead, fin, gray, North Pacific right, and humpback whales and belugas in U.S. waters of the Aleutian Islands and Bering Sea. Supporting evidence for these BIAs came from aerial-, land-, and vessel-based surveys; satellite-tagging data; passive acoustic monitoring; traditional ecological knowledge; photo-and genetic-identification data; and whaling data, including catch and sighting locations and stomach contents. The geographic extent of the BIAs in this region ranged from approximately 1,200 to 373,000 km(2). Information gaps identified during this assessment include (1) reproductive areas for all species; (2) detailed information on the migration routes and timing of all species; and (3) cetacean distribution, density, and behavior in U.S. Bering Sea waters off the continental shelf. To maintain their utility, these BIAs should be re-evaluated and revised, if necessary, as new information becomes available.", "We integrated existing published and unpublished information to delineate Biologically Important Areas (BIAs) for fin, gray, North Pacific right, and humpback whales, and belugas in U.S. waters of the Gulf of Alaska. BIAs are delineated for feeding, migratory corridors, and small and resident populations. Supporting evidence for these BIAs came from aerial-, land-, and vessel-based surveys; satellite-tagging data; passive acoustic monitoring; traditional ecological knowledge; photo-and genetic-identification data; whaling data, including catch and sighting locations and stomach contents; prey studies; and anecdotal information from fishermen. The geographic extent of the BIAs in this region ranged from approximately 900 to 177,000 km(2). Information gaps identified during this assessment include (1) reproductive areas for fin, gray, and North Pacific right whales; (2) detailed information on the migration routes of all species; (3) detailed information on the migratory timing of all species except humpback whales; and (4) cetacean distribution, density, and behavior in U.S. Gulf of Alaska waters off the continental shelf. To maintain their utility, these BIAs should be re-evaluated and revised, if necessary, as new information becomes available.", 0.88146, "2015", "2015", 1, "1", "1", 1, "79", "65", 0, "41", "41", 1, "AQUATIC MAMMALS", "AQUATIC MAMMALS", 1, 1, "0167-5427", "0167-5427", "10.1578/AM.41.1.2015.79", "10.1578/AM.41.1.2015.65", 0.96522, "1036", "1030", "unknown", "unknown", "test_3", "test_3", "unknown", "unknown", "1036", "1030", FALSE, "1030", "1036"
)
auto_dedup <- tibble::tribble(
~duplicate_id, ~record_ids, ~cite_source, ~cite_label, ~cite_string, ~author, ~title, ~year, ~abstract, ~doi, ~volume, ~source, ~issue, ~issn, ~start_page, ~file.name, ~file.size, ~file.type, ~file.datapath, ~label, ~isbn, ~journal, ~pages, ~number, ~ID,
"1030", "1030, 1427, 2008", "test_3, test_2, test_1", "unknown, unknown, unknown", "unknown, unknown, unknown", "Ferguson, MC and Curtice, C and Harrison, J", "Biologically Important Areas for Cetaceans Within US Waters - Gulf of Alaska Region", "2015", "We integrated existing published and unpublished information to delineate Biologically Important Areas (BIAs) for fin, gray, North Pacific right, and humpback whales, and belugas in U.S. waters of the Gulf of Alaska. BIAs are delineated for feeding, migratory corridors, and small and resident populations. Supporting evidence for these BIAs came from aerial-, land-, and vessel-based surveys; satellite-tagging data; passive acoustic monitoring; traditional ecological knowledge; photo-and genetic-identification data; whaling data, including catch and sighting locations and stomach contents; prey studies; and anecdotal information from fishermen. The geographic extent of the BIAs in this region ranged from approximately 900 to 177,000 km(2). Information gaps identified during this assessment include (1) reproductive areas for fin, gray, and North Pacific right whales; (2) detailed information on the migration routes of all species; (3) detailed information on the migratory timing of all species except humpback whales; and (4) cetacean distribution, density, and behavior in U.S. Gulf of Alaska waters off the continental shelf. To maintain their utility, these BIAs should be re-evaluated and revised, if necessary, as new information becomes available.", "10.1578/AM.41.1.2015.65", "41", "test_3, test_2, test_1", "1", "0167-5427", "65", "test_3.ris", "1343376", "", "/var/folders/xk/g0cqx1hs53z_txqsyq74jzcc0000gn/T//RtmpZgvaqg/0cf145f36dd8a17833ba383c/0.ris", "unknown, unknown, unknown", "0167-5427", "AQUATIC MAMMALS", "65", "1", "",
"1036", "1036, 1436, 2020", "test_3, test_2, test_1", "unknown, unknown, unknown", "unknown, unknown, unknown", "Ferguson, MC and Waite, JM and Curtice, C and Clarke, JT and Harrison, J", "Biologically Important Areas for Cetaceans Within US Waters - Aleutian Islands and Bering Sea Region", "2015", "We integrated existing published and unpublished information to delineate Biologically Important Areas (BIAs) for bowhead, fin, gray, North Pacific right, and humpback whales and belugas in U.S. waters of the Aleutian Islands and Bering Sea. Supporting evidence for these BIAs came from aerial-, land-, and vessel-based surveys; satellite-tagging data; passive acoustic monitoring; traditional ecological knowledge; photo-and genetic-identification data; and whaling data, including catch and sighting locations and stomach contents. The geographic extent of the BIAs in this region ranged from approximately 1,200 to 373,000 km(2). Information gaps identified during this assessment include (1) reproductive areas for all species; (2) detailed information on the migration routes and timing of all species; and (3) cetacean distribution, density, and behavior in U.S. Bering Sea waters off the continental shelf. To maintain their utility, these BIAs should be re-evaluated and revised, if necessary, as new information becomes available.", "10.1578/AM.41.1.2015.79", "41", "test_3, test_2, test_1", "1", "0167-5427", "79", "test_3.ris", "1343376", "", "/var/folders/xk/g0cqx1hs53z_txqsyq74jzcc0000gn/T//RtmpZgvaqg/0cf145f36dd8a17833ba383c/0.ris", "unknown, unknown, unknown", "0167-5427", "AQUATIC MAMMALS", "79", "1", ""
)
ASySD::dedup_citations_add_manual(auto_dedup, additional_pairs = extra_pair, extra_merge_fields = "cite_string") %>%
glimpse()
#> Joining with `by = join_by(record_id)`
#> Rows: 1
#> Columns: 25
#> $ duplicate_id <chr> "1030"
#> $ record_ids <chr> "1030, 1427, 2008, 1036"
#> $ cite_source <chr> "test_3, test_2, test_1"
#> $ cite_label <chr> "unknown, unknown, unknown"
#> $ cite_string <chr> "unknown, unknown, unknown, unknown, unknown, unknown"
#> $ author <chr> "Ferguson, MC and Curtice, C and Harrison, J"
#> $ title <chr> "Biologically Important Areas for Cetaceans Within US Wa…
#> $ year <chr> "2015"
#> $ abstract <chr> "We integrated existing published and unpublished inform…
#> $ doi <chr> "10.1578/AM.41.1.2015.65"
#> $ volume <chr> "41"
#> $ source <chr> "test_3, test_2, test_1, test_3, test_2, test_1"
#> $ issue <chr> "1"
#> $ issn <chr> "0167-5427"
#> $ start_page <chr> "65"
#> $ file.name <chr> "test_3.ris"
#> $ file.size <chr> "1343376"
#> $ file.type <chr> ""
#> $ file.datapath <chr> "/var/folders/xk/g0cqx1hs53z_txqsyq74jzcc0000gn/T//RtmpZ…
#> $ label <chr> "unknown, unknown, unknown, unknown, unknown, unknown"
#> $ isbn <chr> "0167-5427"
#> $ journal <chr> "AQUATIC MAMMALS"
#> $ pages <chr> "65"
#> $ number <chr> "1"
#> $ ID <chr> ""
ASySD::dedup_citations_add_manual(auto_dedup, additional_pairs = extra_pair) %>%
glimpse()
#> Joining with `by = join_by(record_id)`
#> Rows: 1
#> Columns: 25
#> $ duplicate_id <chr> "1030"
#> $ record_ids <chr> "1030, 1427, 2008, 1036, 1436, 2020"
#> $ cite_source <chr> "test_3, test_2, test_1"
#> $ cite_label <chr> "unknown, unknown, unknown"
#> $ cite_string <chr> "unknown, unknown, unknown"
#> $ author <chr> "Ferguson, MC and Curtice, C and Harrison, J"
#> $ title <chr> "Biologically Important Areas for Cetaceans Within US Wa…
#> $ year <chr> "2015"
#> $ abstract <chr> "We integrated existing published and unpublished inform…
#> $ doi <chr> "10.1578/AM.41.1.2015.65"
#> $ volume <chr> "41"
#> $ source <chr> "test_3, test_2, test_1, test_3, test_2, test_1"
#> $ issue <chr> "1"
#> $ issn <chr> "0167-5427"
#> $ start_page <chr> "65"
#> $ file.name <chr> "test_3.ris"
#> $ file.size <chr> "1343376"
#> $ file.type <chr> ""
#> $ file.datapath <chr> "/var/folders/xk/g0cqx1hs53z_txqsyq74jzcc0000gn/T//RtmpZ…
#> $ label <chr> "unknown, unknown, unknown, unknown, unknown, unknown"
#> $ isbn <chr> "0167-5427"
#> $ journal <chr> "AQUATIC MAMMALS"
#> $ pages <chr> "65"
#> $ number <chr> "1"
#> $ ID <chr> "" Created on 2024-07-30 with reprex v2.0.2 |
Just to be clear: in ASySD, this is not a |
This should be fixed in camaradesuk/ASySD#44, once Kaitlyn merges it. |
@kaitlynhair just a ping, I know you'll want to review. |
Thanks @LukasWallrich and @TNRiley |
Excellent. Should citesource depend on the updated version of asysd? Did you bump up the version number, @kaitlynhair Also, would you mind adding me as a contributor in the asysd DESCRIPTION? didn't want to just do that in the PR and it isn't essential - but would be appreciated :) |
looks like it is now 0.3.5 with the changes. |
Yep I bumped the version. @LukasWallrich of course!! Should have done this sooner, sorry. |
Running into this error when manually deduplicating the duplicate record in our shinytest data. Appears when trying to view any plot/table
Error: In row 84, can't recycle input of size 4 to size 6.
The text was updated successfully, but these errors were encountered: