Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Throw exception when non-utf8 characters are in a data.frame #16

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

Tmonster
Copy link
Contributor

@Tmonster Tmonster commented Sep 22, 2023

The duckdb engine assumes all strings are valid utf-8. In the R-client, we forgot to check if strings were in fact utf8. Here we check them when the scanning the df when we register it.

Closes #12.

@codecov-commenter
Copy link

codecov-commenter commented Sep 22, 2023

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

@krlmlr krlmlr force-pushed the unicode_strings_fix branch from 253e2d7 to 0a48fd4 Compare November 8, 2023 12:58
@krlmlr
Copy link
Collaborator

krlmlr commented Nov 8, 2023

Thanks. Are there any other locations where we need to care about this? Thinking mainly about parameter binding via dbBind() or dbSendQuery(params = ...) .

@krlmlr krlmlr force-pushed the unicode_strings_fix branch from 0a48fd4 to 2286bda Compare March 11, 2024 04:36
@krlmlr
Copy link
Collaborator

krlmlr commented Mar 11, 2024

Coming back to this.

What are the costs of this preemptive check? I assume that the code must read the entire string? What are the consequences of avoiding this check?

Example for constructing a broken string:

bork <- iconv("börk", to = "latin1")
Encoding(bork) <- "UTF-8"
bork
#> [1] "b\xf6rk"

Created on 2024-03-11 with reprex v2.1.0

@krlmlr krlmlr changed the title throw exception when non-utf8 characters are in a data.frame feat: Throw exception when non-utf8 characters are in a data.frame Mar 11, 2024
@krlmlr
Copy link
Collaborator

krlmlr commented Mar 21, 2024

There's a faster way:

x <- "Est\xe2ncia"
Encoding(x)
#> [1] "unknown"
x <- "Estância"
Encoding(x)
#> [1] "UTF-8"
x <- enc2utf8("Estância")
Encoding(x)
#> [1] "UTF-8"
Encoding(iconv("Est\xe2ncia", from = "latin1", to = "UTF-8"))
#> [1] "UTF-8"

Created on 2024-03-21 with reprex v2.1.0

This shows different results when ran line by line in the RStudio IDE: the second example also yields "unknown". I propose to do the fast check first, and only use utf8proc if that returns "unknown".

@krlmlr krlmlr added this to the 0.10.1 milestone Mar 21, 2024
@Tmonster
Copy link
Contributor Author

I agree, I think the fast check first is better. Otherwise we have to copy the whole string. Will work on this now

@Tmonster
Copy link
Contributor Author

Tmonster commented Mar 26, 2024

I actually wonder now if the issue is in the function enc2utf8. In register.R we encode the values of the data frame in the function encode_values(). enc2utf8 is applied to all character columns in the data frame that is passed. Playing around with the enc2utf8 function, there seem to be some inconsistencies when the encoding of the passed string is unknown.

> enc2utf8("Est\xe2ncia")
# [1] "Est\xe2ncia"
> iconv(enc2utf8("Est\xe2ncia"), from="latin1", to="UTF-8")
# [1] "Estância"
> iconv(enc2utf8("Est\xe2ncia"), from="UTF-8", to="UTF-8")
# [1] NA
> iconv(enc2utf8("hello"), from="UTF-8", to="UTF-8")
# [1] "hello"
> iconv(enc2utf8("hello"), from="latin1", to="UTF-8")
# [1] "hello"

Now I just check that Encoding(x) is valid for every value of a varcher column. This seems like a lot of overhead though, so open to other ideas.

EDIT: changes for readability

krlmlr added a commit that referenced this pull request Nov 24, 2024
krlmlr added a commit that referenced this pull request Nov 24, 2024
krlmlr added a commit that referenced this pull request Dec 8, 2024
krlmlr added a commit that referenced this pull request Dec 8, 2024
@krlmlr krlmlr removed this from the 0.10.1 milestone Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update
3 participants