Question

How do I combine text names within an ordered transcription of dialogue?

Say I have this data:

df <- data.frame(x = c("Tom: I like cheese.", 
                       "Tom: Cheese is good.", 
                       "Tom: Muenster is my favorite.", 
                       "Bob: No, I like Cheddar.", 
                       "Tom: You're wrong. I think cheddar is only good on burgers.", 
                       "Gina: But what about American on burgers?", 
                       "Gina: That's better.", 
                       "Bob: Yeah, I agree with Gina.", 
                       "Bob: American is better on burgers. Cheddar is for grating on nachos."))

I want to turn it into this data:

df <- data.frame(x = c("Tom: I like cheese. Cheese is good. Muenster is my favorite.", 
                       "Bob: No, I like Cheddar.", 
                       "Tom: You're wrong. I think cheddar is only good on burgers.", 
                       "Gina: But what about American on burgers? That's better.", 
                       "Bob: Yeah, I agree with Gina. American is better on burgers. Cheddar is for grating on nachos."))

Basically, I want to cut the text including and before the colon on any instance of text that already has had a recent name.

I am struggling with trying to figure out how to do it in a way that doesn't group the entire "Tom:"'s and "Gina:"'s together and remove them all but for the first instance. I want the later mentions of names to restart the loop.

4 46 4

1 Jan 1970

Solution

We can use tidyr to split the speaker and what they say into columns, then use dplyr to combine runs of the same speaker. For example

df |> 
  tidyr::separate_wider_delim(x, ": ", names=c("speaker", "words")) |>
  mutate(instance = consecutive_id(speaker)) |>
  summarize(speaker = first(speaker), text=paste(words, collapse=" "), .by=instance)

returns

  instance speaker text                                                                                 
     <int> <chr>   <chr>                                                                                
1        1 Tom     I like cheese. Cheese is good. Muenster is my favorite.                              
2        2 Bob     No, I like Cheddar.                                                                  
3        3 Tom     You're wrong. I think cheddar is only good on burgers.                               
4        4 Gina    But what about American on burgers? That's better.                                   
5        5 Bob     Yeah, I agree with Gina. American is better on burgers. Cheddar is for grating on na…

2024-07-19

MrFlick

Solution

library(stringr)
library(dplyr)

df |>
  group_by(line = consecutive_id(str_extract(x, "^\\w+"))) |>
  reframe(x = str_remove_all(str_flatten(x, " "), "\\s\\w+:"))

Output

  line x                                                                                             
 <int> <chr>                                                                                         
     1 Tom: I like cheese. Cheese is good. Muenster is my favorite.                                  
     2 Bob: No, I like Cheddar.                                                                      
     3 Tom: You're wrong. I think cheddar is only good on burgers.                                   
     4 Gina: But what about American on burgers? That's better.                                      
     5 Bob: Yeah, I agree with Gina. American is better on burgers. Cheddar is for grating on nachos.

2024-07-19

LMc

Solution

Using data.table, split on ": ", group by relid, then paste it back per group:

df[, c("name", "text") := tstrsplit(x, ": ", fixed = TRUE) 
   ][, .(text = paste(text, collapse = " ")), by = .(name, rleid(name))
     ][, -2]

#      name                                                                                  text
#    <char>                                                                                 <char>
# 1:    Tom                                 I like cheese. Cheese is good. Muenster is my favorite.
# 2:    Bob                                                                     No, I like Cheddar.
# 3:    Tom                                   You're wrong. I think cheddar is only good on burgers.
# 4:   Gina                                       But what about American on burgers? That's better.
# 5:    Bob Yeah, I agree with Gina. American is better on burgers. Cheddar is for grating on nachos.

2024-07-19

zx8754

Solution

Here is a base R option

with(
    sort_by(
        aggregate(
            V2 ~ nr + V1,
            transform(
                read.delim(
                    text = df$x,
                    sep = ":",
                    head = FALSE,
                    strip.white = TRUE
                ),
                nr = with(rle(V1), rep(seq_along(values), lengths))
            ),
            paste0,
            collapse = " "
        ), ~nr
    ),
    data.frame(
        x = paste0(V1, ": ", V2)
    )
)

which gives

                                                                                               x
1                                   Tom: I like cheese. Cheese is good. Muenster is my favorite.
2                                                                       Bob: No, I like Cheddar.
3                                    Tom: You're wrong. I think cheddar is only good on burgers.
4                                       Gina: But what about American on burgers? That's better.
5 Bob: Yeah, I agree with Gina. American is better on burgers. Cheddar is for grating on nachos.

2024-07-19

ThomasIsCoding