Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gsub error 'unable to translate...to a wide string' #42

Open
wilkox opened this issue Oct 10, 2023 · 3 comments
Open

gsub error 'unable to translate...to a wide string' #42

wilkox opened this issue Oct 10, 2023 · 3 comments

Comments

@wilkox
Copy link

wilkox commented Oct 10, 2023

Running read_bibliography() on a UTF-8 encoded file produces an error (see example file Cochrane.txt):

library(revtools)

system2("file", c("Cochrane.txt", "-I"), stdout = TRUE)
#> [1] "Cochrane.txt: text/plain; charset=utf-8"
read_bibliography("Cochrane.txt")
#> Warning in gsub("<[[:alnum:]]{2}>", "", z): unable to translate 'AB - Three
#> hundred healthy adults, permanently residing and contacting (a contact subject)
#> with a household patient with confirmed COVID^aEUR<90>19 (primary patient), or
#> who stayed in close long protected contact with a person who consequently
#> become...' to a wide string
#> Error in gsub("<[[:alnum:]]{2}>", "", z): input string 14 is invalid

Created on 2023-10-10 with reprex v2.0.2

This seems to arise from this line, and I think it's because the encoding for z is set to 'latin1', but since R 4.3.0 'Regular expression functions now check more thoroughly whether their inputs are valid strings (in their encoding, e.g. in UTF-8)'.

A workaround is to convert the file into latin1 encoding first:

library(revtools)

utf8tolatin1 <- function(infile, outfile) {
  content <- readLines(infile, encoding = "UTF-8")
  latin1 <- iconv(content, from = "UTF-8", to = "latin1")
  writeLines(latin1, outfile)
}

utf8tolatin1("Cochrane.txt", "Cochrane-latin1.txt")

system2("file", c("Cochrane-latin1.txt", "-I"), stdout = TRUE)
#> [1] "Cochrane-latin1.txt: text/plain; charset=us-ascii"
read_bibliography("Cochrane-latin1.txt")
#>                   label type   accession       author
#> 1 NCT04907877_2021_http JOUR CN-02278011 NCT04907877,
#>                                                                title
#> 1 Bifido- and Lactobacilli in Symptomatic Adult COVID-19 Outpatients
#>                                       journal year                     keywords
#> 1 https://clinicaltrials.gov/show/NCT04907877 2021 Respiratory Tract Infections
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            abstract
#> 1 Three hundred healthy adults, permanently residing and contacting (a contact subject) with a household patient with confirmed COVID-19 (primary patient), or who stayed in close long protected contact with a person who consequently become SARS-CoV-2 positive, will be screened for the study. When the contact subject meets enrollment criteria, he/she will be randomized to take an investigational product (probiotic, test dietary supplement, TDS), a mixture of lactobacilli and bifidobacteria or placebo 1 time a day before breakfast. During screening period, he/she will also keep Screening and Compliance Diary for screening of COVID-19 symptoms and confirming TDS intake. Duration of the screening period (Days 0-X) will depend on the health status of a contact person. If the contact remains asymptomatic, duration of probiotic intake will be 30 days. After this period, subject will be excluded from the study . If the contact develops symptoms, he/she will call family physician, request a referral, and visit a local center to make PCR test of the nasal swab for SARS-CoV-2. While result of PCR test are being available (Days 0-2), the patient will continue taking TDS and start keeping Respiratory Illness Diary. If the result of the PCR test is negative the patient will be withdrawn from the study. If the result is positive, he/she will continue participation and be visited by the nurse (Nurse Visit 1, Days 3-5), who supplies the patient with TDS in amount enough to complete 28-day intake period and takes blood for anti-SARS-CoV-2 IgG. During 28-day period of TDS intake, the patient will keep Respiratory Illness Diary (the Diary is designed for evaluation of the COVID-19 course and assessment of TDS reduces clinical manifestation of COVID-19), the investigator/family physician updated with health status, and the physician will make weekly phone calls to assess patient health status, indications for hospitalization, treatment, checking TDS intake and Respiratory Illness Diary. In the case of patient hospitalization, patient is withdrawn from the study, and will be requested to provide a reference from Medical Records after hospital discharge. During Nurse Visit 2 (Days 28-35), after finishing TDS intake, the nurse will collect Respiratory Illness Diary, empty vials with TDS, take blood for anti-SARS-CoV-2 IgG test. The test is necessary to evaluate if TDS intake improves post-COVID-19 immunity on short-term perspective. At the end of the 2nd visit, the nurse will give enveloped Post-COVID-19 Questionnaire to be completed in 3 months. In 3 months, investigator/family physician will call to the patient and remind to return a completed Post-COVID-19 Questionnaire. Post-COVID-19 Questionnaire will help to see if active TDS reduces presentation of Post-COVID-19 syndrome. In 6 months, the study nurse will perform Nurse Visit 3 and draw blood for the anti-SARS-CoV-2 IgG. The test is necessary to evaluate if TDS intake improves post-COVID-19 immunity on long term perspective.
#>                                                                            url
#> 1 https://www.cochranelibrary.com/central/doi/10.1002/central/CN-02278011/full
#>                  c3                    m3
#> 1 CTgov NCT04907877 Trial registry record

Created on 2023-10-10 with reprex v2.0.2

@sy-olesya
Copy link

sy-olesya commented Jan 16, 2024

David! Thanks a lot for your help!
Unfortunately, this doesn't work for files from PubMed (.nbib). Could you help as well?
vk.txt

@wilkox
Copy link
Author

wilkox commented Jan 31, 2024

@sy-olesya Adding useBytes = TRUE to writeLines() seems to fix this particular problem. However, there is then another, apparently unrelated error (I had to truncate the input file as it couldn't fit the whole thing in memory):

library(revtools)

system2("file", c("~/tmp/vk.txt", "-I"), stdout = TRUE)
#> [1] "/Users/wilkox/tmp/vk.txt: text/plain; charset=utf-8"

utf8tolatin1 <- function(infile, outfile) {
  content <- readLines(infile, encoding = "UTF-8")
  latin1 <- iconv(content, from = "UTF-8", to = "latin1")
  writeLines(latin1, outfile, useBytes = TRUE)
}

utf8tolatin1("~/tmp/vk.txt", "~/tmp/vk-latin1.txt")

system2("file", c("~/tmp/vk-latin1.txt", "-I"), stdout = TRUE)
#> [1] "/Users/wilkox/tmp/vk-latin1.txt: text/plain; charset=iso-8859-1"
bib <- read_bibliography("~/tmp/vk-latin1.txt")
#> Error in names(x_final) <- unlist(lapply(x_final, function(a) {: 'names' attribute [254] must be the same length as the vector [43]

Created on 2024-01-31 with reprex v2.1.0

I had a poke around and I think it's not parsing the nbib file correctly. You might want to open a separate issue about this if you are still having trouble.

@vivekrmk
Copy link

vivekrmk commented Jul 9, 2024

I was getting the error : Error in gsub("[[:space:]]+", " ", x) : input string 12 is invalid
setting up the encoding option in readLines as "latin1" fixed the issue for me, I did not receive the error again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants