Question

Using R, how can I detect if a string includes a unicode character?

If a string includes a Unicode character, I would like to detect it and replace it (because RMarkdown fails when creating pdf output that includes Unicode characters). For example, the variable inclusion below includes Unicode 2265 ("\U2265"):

inclusion <- "Include patients ≥ 18 years of age"

To detect that specific code I can use

str_detect(inclusion, "\U2265")

and the result is TRUE.

Is there a way to detect ANY Unicode in a string, not a single specific Unicode?

Furthermore, I'd love to find a function that replaces any Unicode characters with non-Unicode characters that convey the same meaning. For example, str_replace(inclusion, "\U2265", ">=") creates the result: "Include patients >= 18 years of age"

3 94 3

1 Jan 1970

Solution

You can match against [^\\x00-\\x7F] to detect non ASCII or use stringi::stri_enc_isascii or use utf8ToInt and test if any code is larger than 127.

inclusion <- c("Include patients ≥ 18 years of age", "a")

grepl("[^\\x00-\\x7F]", inclusion, perl=TRUE)
#[1]  TRUE FALSE

stringr::str_detect(inclusion, "[^\\x00-\\x7F]")
#[1]  TRUE FALSE

!stringi::stri_enc_isascii(inclusion)
#[1]  TRUE FALSE

lapply(inclusion, \(x) any(utf8ToInt(x) > 127))
#[[1]]
#[1] TRUE
#
#[[2]]
#[1] FALSE

And you can replace them using e.g. gsub, stringr::str_replace_all or incov (thanks to @MrFlick for the comment) or stringi::stri_enc_toascii or textclean::replace_non_ascii (thanks to @Quarto for the comment).

gsub("[^\\x00-\\x7F]", "?", inclusion, perl=TRUE)
#[1] "Include patients ? 18 years of age" "a"                                 

stringr::str_replace_all(inclusion, "[^\\x00-\\x7F]", "?") 
#[1] "Include patients ? 18 years of age" "a"                                 

iconv(inclusion, "UTF-8", "ASCII", sub="?")
#[1] "Include patients ??? 18 years of age"
#[2] "a"                                   

stringi::stri_enc_toascii(inclusion)
#[1] "Include patients \032 18 years of age"
#[2] "a"                                    

textclean::replace_non_ascii(inclusion, "?")
#[1] "Include patients ??? 18 years of age"
#[2] "a"

To be more specific when exchanging the non ASCII you can create a lookup table.

s <- c("x ≥ y", "x = y", "x ≤ y")

i <- gregexpr("[^\\x00-\\x7F]", s, perl=TRUE)  # Match non ASCII
m <- regmatches(s, i)                          # Extract them
unique(unlist(m))
#[1] "≥" "≤"

# Create lookup table
u <- read.table(text="
≥ >=
≤ <=
")

# Exchange them
regmatches(s, i) <- lapply(m, \(x) u[[2]][match(x, u[[1]])])
s
#[1] "x >= y" "x = y"  "x <= y"

(I'm aware that in the case of the question solving the issue with RMarkdown will be the solution. Just for the case that someone has a need to translate to ASCII.)

2024-07-16

GKi

Solution

The {constructive} package might help, you'll see all characters unambiguously because non ASCII characters by default will be printed with the "\U{xxxx}" form, this way you won't be confused by homoglyphs like different type of spaces etc

library(constructive)
construct("Include patients ≥ 18 years of age")
#> "Include patients \U{2265} 18 years of age"

You can adjust this behavior to your taste using the unicode_representation argument.

x <- "¿Cómo está tu día hoy? 🌞 ≥ 🌦"
construct(x, unicode_representation = "ascii") # default
#> "\U{BF}C\U{F3}mo est\U{E1} tu d\U{ED}a hoy? \U{1F31E} \U{2265} \U{1F326}"

Note that the latin characters above are formatted as "\U{xx}" (only 2 characters needed, going from "\U{80}" to \U{FF}").

construct(x, unicode_representation = "latin")
#> "¿Cómo está tu día hoy? \U{1F31E} \U{2265} \U{1F326}"

construct(x, unicode_representation = "character")
#> "¿Cómo está tu día hoy? \U{1F31E} ≥ \U{1F326}"

construct(x, unicode_representation = "unicode")
#> "¿Cómo está tu día hoy? 🌞 ≥ 🌦"

^{Created on 2024-07-16 with reprex v2.1.0}

2024-07-16

moodymudskipper