Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manual Deduplication Error #192

Closed
TNRiley opened this issue Jul 30, 2024 · 10 comments
Closed

Manual Deduplication Error #192

TNRiley opened this issue Jul 30, 2024 · 10 comments

Comments

@TNRiley
Copy link
Collaborator

TNRiley commented Jul 30, 2024

Running into this error when manually deduplicating the duplicate record in our shinytest data. Appears when trying to view any plot/table

Error: In row 84, can't recycle input of size 4 to size 6.

@TNRiley TNRiley added Bug Something isn't working Shiny Application labels Jul 30, 2024
@TNRiley
Copy link
Collaborator Author

TNRiley commented Jul 30, 2024

Does not replicate on other vignette files that I tested.

@LukasWallrich
Copy link
Collaborator

I can replicate the issue - trying to figure it out

@LukasWallrich
Copy link
Collaborator

This is an ASySD issue, where ASySD loses some of the record_ids while merging the metadata in this case (below, the number of sources should be the same as the number of record_ids, yet there are 6 sources but only 4 record ids, which then fails). Weirdly, it only happens when cite_string is included. I will have a look at the ASySD code and aim to propose a fix.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ASySD)

extra_pair <- tibble::tribble(
  ~author1,                                                                    ~author2,                                       ~author,  ~title1,                                                                                                 ~title2,                                                                                ~title,   ~abstract1,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         ~abstract2,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ~abstract, ~year1,  ~year2,  ~year, ~number1, ~number2, ~number, ~pages1, ~pages2, ~pages, ~volume1, ~volume2, ~volume, ~journal1,          ~journal2,          ~journal, ~isbn, ~isbn1,       ~isbn2,       ~doi1,                      ~doi2,                      ~doi,     ~record_id1, ~record_id2, ~label1,    ~label2,    ~source1,  ~source2,  ~cite_string1, ~cite_string2, ~duplicate_id.x, ~duplicate_id.y, ~match, ~min_id, ~max_id, 
  "Ferguson, MC and Waite, JM and Curtice, C and Clarke, JT and Harrison, J",  "Ferguson, MC and Curtice, C and Harrison, J",  0.85898,  "Biologically Important Areas for Cetaceans Within US Waters - Aleutian Islands and Bering Sea Region",  "Biologically Important Areas for Cetaceans Within US Waters - Gulf of Alaska Region",  0.92987,  "We integrated existing published and unpublished information to delineate Biologically Important Areas (BIAs) for bowhead, fin, gray, North Pacific right, and humpback whales and belugas in U.S. waters of the Aleutian Islands and Bering Sea. Supporting evidence for these BIAs came from aerial-, land-, and vessel-based surveys; satellite-tagging data; passive acoustic monitoring; traditional ecological knowledge; photo-and genetic-identification data; and whaling data, including catch and sighting locations and stomach contents. The geographic extent of the BIAs in this region ranged from approximately 1,200 to 373,000 km(2). Information gaps identified during this assessment include (1) reproductive areas for all species; (2) detailed information on the migration routes and timing of all species; and (3) cetacean distribution, density, and behavior in U.S. Bering Sea waters off the continental shelf. To maintain their utility, these BIAs should be re-evaluated and revised, if necessary, as new information becomes available.",  "We integrated existing published and unpublished information to delineate Biologically Important Areas (BIAs) for fin, gray, North Pacific right, and humpback whales, and belugas in U.S. waters of the Gulf of Alaska. BIAs are delineated for feeding, migratory corridors, and small and resident populations. Supporting evidence for these BIAs came from aerial-, land-, and vessel-based surveys; satellite-tagging data; passive acoustic monitoring; traditional ecological knowledge; photo-and genetic-identification data; whaling data, including catch and sighting locations and stomach contents; prey studies; and anecdotal information from fishermen. The geographic extent of the BIAs in this region ranged from approximately 900 to 177,000 km(2). Information gaps identified during this assessment include (1) reproductive areas for fin, gray, and North Pacific right whales; (2) detailed information on the migration routes of all species; (3) detailed information on the migratory timing of all species except humpback whales; and (4) cetacean distribution, density, and behavior in U.S. Gulf of Alaska waters off the continental shelf. To maintain their utility, these BIAs should be re-evaluated and revised, if necessary, as new information becomes available.",  0.88146,   "2015",  "2015",  1,     "1",      "1",      1,       "79",    "65",    0,      "41",     "41",     1,       "AQUATIC MAMMALS",  "AQUATIC MAMMALS",  1,        1,     "0167-5427",  "0167-5427",  "10.1578/AM.41.1.2015.79",  "10.1578/AM.41.1.2015.65",  0.96522,  "1036",      "1030",      "unknown",  "unknown",  "test_3",  "test_3",  "unknown",     "unknown",     "1036",          "1030",          FALSE,  "1030",  "1036"
)

auto_dedup <- tibble::tribble(
  ~duplicate_id, ~record_ids,         ~cite_source,              ~cite_label,                  ~cite_string,                 ~author,                                                                     ~title,                                                                                                  ~year,   ~abstract,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ~doi,                       ~volume, ~source,                   ~issue, ~issn,        ~start_page, ~file.name,    ~file.size, ~file.type, ~file.datapath,                                                                                 ~label,                       ~isbn,        ~journal,           ~pages, ~number, ~ID, 
  "1030",        "1030, 1427, 2008",  "test_3, test_2, test_1",  "unknown, unknown, unknown",  "unknown, unknown, unknown",  "Ferguson, MC and Curtice, C and Harrison, J",                               "Biologically Important Areas for Cetaceans Within US Waters - Gulf of Alaska Region",                   "2015",  "We integrated existing published and unpublished information to delineate Biologically Important Areas (BIAs) for fin, gray, North Pacific right, and humpback whales, and belugas in U.S. waters of the Gulf of Alaska. BIAs are delineated for feeding, migratory corridors, and small and resident populations. Supporting evidence for these BIAs came from aerial-, land-, and vessel-based surveys; satellite-tagging data; passive acoustic monitoring; traditional ecological knowledge; photo-and genetic-identification data; whaling data, including catch and sighting locations and stomach contents; prey studies; and anecdotal information from fishermen. The geographic extent of the BIAs in this region ranged from approximately 900 to 177,000 km(2). Information gaps identified during this assessment include (1) reproductive areas for fin, gray, and North Pacific right whales; (2) detailed information on the migration routes of all species; (3) detailed information on the migratory timing of all species except humpback whales; and (4) cetacean distribution, density, and behavior in U.S. Gulf of Alaska waters off the continental shelf. To maintain their utility, these BIAs should be re-evaluated and revised, if necessary, as new information becomes available.",  "10.1578/AM.41.1.2015.65",  "41",    "test_3, test_2, test_1",  "1",    "0167-5427",  "65",        "test_3.ris",  "1343376",  "",         "/var/folders/xk/g0cqx1hs53z_txqsyq74jzcc0000gn/T//RtmpZgvaqg/0cf145f36dd8a17833ba383c/0.ris",  "unknown, unknown, unknown",  "0167-5427",  "AQUATIC MAMMALS",  "65",   "1",     "", 
  "1036",        "1036, 1436, 2020",  "test_3, test_2, test_1",  "unknown, unknown, unknown",  "unknown, unknown, unknown",  "Ferguson, MC and Waite, JM and Curtice, C and Clarke, JT and Harrison, J",  "Biologically Important Areas for Cetaceans Within US Waters - Aleutian Islands and Bering Sea Region",  "2015",  "We integrated existing published and unpublished information to delineate Biologically Important Areas (BIAs) for bowhead, fin, gray, North Pacific right, and humpback whales and belugas in U.S. waters of the Aleutian Islands and Bering Sea. Supporting evidence for these BIAs came from aerial-, land-, and vessel-based surveys; satellite-tagging data; passive acoustic monitoring; traditional ecological knowledge; photo-and genetic-identification data; and whaling data, including catch and sighting locations and stomach contents. The geographic extent of the BIAs in this region ranged from approximately 1,200 to 373,000 km(2). Information gaps identified during this assessment include (1) reproductive areas for all species; (2) detailed information on the migration routes and timing of all species; and (3) cetacean distribution, density, and behavior in U.S. Bering Sea waters off the continental shelf. To maintain their utility, these BIAs should be re-evaluated and revised, if necessary, as new information becomes available.",                                                                                                                                                                                                                                    "10.1578/AM.41.1.2015.79",  "41",    "test_3, test_2, test_1",  "1",    "0167-5427",  "79",        "test_3.ris",  "1343376",  "",         "/var/folders/xk/g0cqx1hs53z_txqsyq74jzcc0000gn/T//RtmpZgvaqg/0cf145f36dd8a17833ba383c/0.ris",  "unknown, unknown, unknown",  "0167-5427",  "AQUATIC MAMMALS",  "79",   "1",     ""
)

ASySD::dedup_citations_add_manual(auto_dedup, additional_pairs = extra_pair, extra_merge_fields = "cite_string") %>% 
  glimpse()
#> Joining with `by = join_by(record_id)`
#> Rows: 1
#> Columns: 25
#> $ duplicate_id  <chr> "1030"
#> $ record_ids    <chr> "1030, 1427, 2008, 1036"
#> $ cite_source   <chr> "test_3, test_2, test_1"
#> $ cite_label    <chr> "unknown, unknown, unknown"
#> $ cite_string   <chr> "unknown, unknown, unknown, unknown, unknown, unknown"
#> $ author        <chr> "Ferguson, MC and Curtice, C and Harrison, J"
#> $ title         <chr> "Biologically Important Areas for Cetaceans Within US Wa…
#> $ year          <chr> "2015"
#> $ abstract      <chr> "We integrated existing published and unpublished inform…
#> $ doi           <chr> "10.1578/AM.41.1.2015.65"
#> $ volume        <chr> "41"
#> $ source        <chr> "test_3, test_2, test_1, test_3, test_2, test_1"
#> $ issue         <chr> "1"
#> $ issn          <chr> "0167-5427"
#> $ start_page    <chr> "65"
#> $ file.name     <chr> "test_3.ris"
#> $ file.size     <chr> "1343376"
#> $ file.type     <chr> ""
#> $ file.datapath <chr> "/var/folders/xk/g0cqx1hs53z_txqsyq74jzcc0000gn/T//RtmpZ…
#> $ label         <chr> "unknown, unknown, unknown, unknown, unknown, unknown"
#> $ isbn          <chr> "0167-5427"
#> $ journal       <chr> "AQUATIC MAMMALS"
#> $ pages         <chr> "65"
#> $ number        <chr> "1"
#> $ ID            <chr> ""

ASySD::dedup_citations_add_manual(auto_dedup, additional_pairs = extra_pair) %>% 
  glimpse()
#> Joining with `by = join_by(record_id)`
#> Rows: 1
#> Columns: 25
#> $ duplicate_id  <chr> "1030"
#> $ record_ids    <chr> "1030, 1427, 2008, 1036, 1436, 2020"
#> $ cite_source   <chr> "test_3, test_2, test_1"
#> $ cite_label    <chr> "unknown, unknown, unknown"
#> $ cite_string   <chr> "unknown, unknown, unknown"
#> $ author        <chr> "Ferguson, MC and Curtice, C and Harrison, J"
#> $ title         <chr> "Biologically Important Areas for Cetaceans Within US Wa…
#> $ year          <chr> "2015"
#> $ abstract      <chr> "We integrated existing published and unpublished inform…
#> $ doi           <chr> "10.1578/AM.41.1.2015.65"
#> $ volume        <chr> "41"
#> $ source        <chr> "test_3, test_2, test_1, test_3, test_2, test_1"
#> $ issue         <chr> "1"
#> $ issn          <chr> "0167-5427"
#> $ start_page    <chr> "65"
#> $ file.name     <chr> "test_3.ris"
#> $ file.size     <chr> "1343376"
#> $ file.type     <chr> ""
#> $ file.datapath <chr> "/var/folders/xk/g0cqx1hs53z_txqsyq74jzcc0000gn/T//RtmpZ…
#> $ label         <chr> "unknown, unknown, unknown, unknown, unknown, unknown"
#> $ isbn          <chr> "0167-5427"
#> $ journal       <chr> "AQUATIC MAMMALS"
#> $ pages         <chr> "65"
#> $ number        <chr> "1"
#> $ ID            <chr> ""

Created on 2024-07-30 with reprex v2.0.2

@LukasWallrich
Copy link
Collaborator

Just to be clear: in ASySD, this is not a cite_string issue, but an issue with merging any extra field. extra_merge_fields = "doi" fails likewise, so this needs to be resolved there even if we drop cite_string

@LukasWallrich
Copy link
Collaborator

This should be fixed in camaradesuk/ASySD#44, once Kaitlyn merges it.

@TNRiley TNRiley added External factors and removed Bug Something isn't working labels Jul 31, 2024
@TNRiley
Copy link
Collaborator Author

TNRiley commented Aug 14, 2024

@kaitlynhair just a ping, I know you'll want to review.

@kaitlynhair
Copy link
Collaborator

Thanks @LukasWallrich and @TNRiley
Now merged!!

@LukasWallrich
Copy link
Collaborator

Excellent. Should citesource depend on the updated version of asysd? Did you bump up the version number, @kaitlynhair

Also, would you mind adding me as a contributor in the asysd DESCRIPTION? didn't want to just do that in the PR and it isn't essential - but would be appreciated :)

@TNRiley
Copy link
Collaborator Author

TNRiley commented Aug 23, 2024

looks like it is now 0.3.5 with the changes.

@kaitlynhair
Copy link
Collaborator

Yep I bumped the version.

@LukasWallrich of course!! Should have done this sooner, sorry.
Now added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants