Question

R: Efficient way to str_replace_all without recursively replacing conflicting substitutions?

Hello,

The problem
First, let me try to illustrate the problem. Assume I want to apply the following cipher to encode the string, "abc".

library(tidyverse)

cipher <- tibble(
  byte = c(128:153, 160:185, 246:255) %>% as.hexmode() %>% str_to_upper(),
  char = c(LETTERS, letters, 0:9)
)

"abc" %>% str_replace_all(set_names(cipher$byte, cipher$char))
# [1] "AFFCAFFDAFFE"

The result I'd like is "A0A1A2", not "AFFCAFFDAFFE". It looks like 0 in the first substitution A0 is replaced with its own substitution, which is FF, and so on. This is what I mean by recursive replacement of conflicting substitutions.

Related info
I've read this post. I've also read this issue. I've also looked into the vectorize_all argument of the stri_replace_all* function.

Working (but inefficient) solution
The only way I've managed successfully to make multiple string substitutions with replacement values that would otherwise be conflicting, is to split each string character for character, then make the substitutions, and finally paste it all back together. Like so:

library(tidyverse)

c("abc", "123") %>%
  map_chr(\(string) {
    str_split_1(string, "") %>%
      map_chr(\(char) {
        str_replace_all(char, set_names(cipher$byte, paste0("^", cipher$char, "$")))
      }) %>% paste(collapse = "")
  })
# [1] "A0A1A2" "F7F8F9"

Unfortunately, this way of encoding the strings takes a long time (on my 2020 Intel Macbook Pro at least) for large vectors. I prefer working within the tidyverse, but at this stage I'd consider other methods too.

 3  90  3
1 Jan 1970

Solution

 2

As you are implementing a cipher, I would avoid regex in favour of the base function chartr():

chartr translates each character in x that is specified in old to the corresponding character specified in new. Ranges are supported in the specifications, but character classes and repeated characters are not.

The slight wrinkle in this plan is that chartr() only does one-for-one character substitution. You want to replace "a" with "A0" and so on. However, all your two character substitutions can be expressed in hex as one Unicode character.

We can write a function to convert each of your one character substitutions to a single Unicode character so we can use it in chartr(), and then back again. For example, "A0" becomes intToUtf8("0xA0"), which is space, which is then fed to chartr() as a one character replacement for "a" and then at the end is converted back to "A0":

do_cipher <- function(x, old = cipher$char, new = cipher$byte) {
    chartr(
        paste0(old, collapse = ""),
        intToUtf8(paste0("0x", new)),
        x
    ) |>
        vapply(
            \(x)
            x |>
                utf8ToInt() |>
                as.hexmode() |>
                paste0(collapse = "") |>
                toupper(),
            character(1)
        ) |>
        setNames(x)
}

This returns the desired output:

c("abc", "123") |> do_cipher()
#      abc      123
# "A0A1A2" "F7F8F9"
2024-07-21
SamR

Solution

 1

Embarrassingly, I discovered a pretty elegant solution to this problem shortly after posting it. I wonder why this didn't pop up in my initial googling...

The solution is simply:

library(tidyverse)

cipher <- tibble(
  byte = c(128:153, 160:185, 246:255) %>% as.hexmode() %>% str_to_upper(),
  char = c(LETTERS, letters, 0:9)
)

c("abc", "123") %>% str_replace_all(".", ~ set_names(cipher$byte, cipher$char)[.x])
# [1] "A0A1A2" "F7F8F9"

This single line of code is over 50 times faster than the original way I was achieving this. SamR's base R solution is even faster, by a factor of over 100.

Aside, here is a potentially useful extension of the tidyverse solution that also incorporates multi-character string substitutions.

library(tidyverse)

cipher <- tibble(
  byte = c(20:22, 128:153, 160:185, 246:255) %>% as.hexmode() %>% str_to_upper(),
  char = c("<multi>", "<char>", "<strings>", LETTERS, letters, 0:9)
)

c("<multi>", "<strings>") %>% str_replace_all("(<.+?>)|.", ~ set_names(cipher$byte, cipher$char)[.x])
# [1] "14" "16"
2024-07-21
rinkjames