Question

How to determine an overlapping sequence of words between two texts

In one of our digital assignments, I had asked my students to read an article and write a few things they learned from that article. Students were told that they were supposed to write using their own words. I also had reasons to expect that copying and pasting a block of text or all of it was disabled. But I was so wrong. I received over 9000 entries of texts where many of them looked like they were copied and pasted directly from the digital assignments. Some had some differences in punctuations and capitalizations but I cannot imagine that they literally sat there and typed most of the article out.

I have read through many of the students' assignments and tried to identify unique features from a copied and pasted entry versus an honest one so that hopefully some R function would help me to detect. However, I have not been successful. To demonstrate, here is an example that I made up. The passages are often long, between 300-800 words and I wonder if there's a relatively easy way to identify the common block of words that overlap between the two texts.

text_1 <- "She grew up in the United States. Her father was..."
text_2 <- "I learned that she grew up in the united states.Her father was ..."

Desired Outcome: "she grew up in the united states. Her father was ..."

The desired outcome should print the sequence of words that overlapped between the two vectors, and capitalization or space differences do not matter

Thank you for reading and for any expertise you can share.

2 100 2

1 Jan 1970

Solution

Using the data from @Bastián Olea Herrera:

library(tm)
library(slam)

text <- list("she grew up in the united states.Her father was",
             "She grew up in the United States. Her father was",
             "I learned that she grew up in the united states.Her father was",
             "The main character was born in the USA, his father being",
             "My favourite animals are raccoons, they are so silly and cute",
             "I didn't understand this assignment so I'm just answering gibberish",
             "she grew up in the united states.Her father was"
)

tdm <- VectorSource(sapply(text, \(x) gsub(".", " ", x, fixed = T), USE.NAMES = F)) |>
  SimpleCorpus() |>
  TermDocumentMatrix(
    control = list(tolower = TRUE,
                   removePunctuation = TRUE,
                   stopwords = TRUE))

cs <-  crossprod_simple_triplet_matrix(tdm)/(sqrt(col_sums(tdm^2) %*% t(col_sums(tdm^2))))
cs
#     Docs
# Docs         1         2         3         4 5 6         7
#    1 1.0000000 1.0000000 0.8944272 0.2236068 0 0 1.0000000
#    2 1.0000000 1.0000000 0.8944272 0.2236068 0 0 1.0000000
#    3 0.8944272 0.8944272 1.0000000 0.2000000 0 0 0.8944272
#    4 0.2236068 0.2236068 0.2000000 1.0000000 0 0 0.2236068
#    5 0.0000000 0.0000000 0.0000000 0.0000000 1 0 0.0000000
#    6 0.0000000 0.0000000 0.0000000 0.0000000 0 1 0.0000000
#    7 1.0000000 1.0000000 0.8944272 0.2236068 0 0 1.0000000

This is just an example that you could build on -- tm has a lot more functionality. The idea here is you can build a term document matrix and use that to compute a similarity score between documents. The similarity score computed here is cosine similarity but there are many others.

If you read the documentation for ?TermDocumentMatrix you can see you can do things like inverse weighting procedures, which gives more weight to uncommon words, for example.

The first column of the output compares the first text to all the text, the second column compares the second text to all the text and so forth. The diagonal is always one because it is comparing the text to itself. As you can see from the first column, the second, third and seventh text are all pretty similar to the first.

Alternatively, you can extract the longest common substring like so (using the text list from above). This compares the first element (your base text/digital assignment) to the remaining text (this should be student input):

library(PTXQC)
library(textclean)

standardize <- function(x) {
  x |>
    tolower() |>
    replace_contraction() |> 
    gsub("[[:punct:]]", " ", x = _, perl = T) |>
    replace_white()
}

std_text <- standardize(text)

lapply(std_text[-1], \(x) LCSn(c(std_text[[1]], x)))
# [[1]]
# [1] "she grew up in the united states her father was"
# 
# [[2]]
# [1] "she grew up in the united states her father was"
# 
# [[3]]
# [1] " in the u"
# 
# [[4]]
# [1] " the"
# 
# [[5]]
# [1] " un"
# 
# [[6]]
# [1] "she grew up in the united states her father was"

First, a little text cleaning is done to standardize the text. I add space around punctuation to address that issue in your text_2. This may introduce excess white space, but that is resolved with replace_white().

LCSn() has a min_LCS_length you can specify to ignore minimally overlapping text.

Note: PTXQC and textclean have a fair number of dependencies.

2024-07-23

LMc

Solution

This is not quite what you asked for, but you can use the {stringdist} package to evaluate the "distance" between two texts, generally interpreted as the amount of characters that you would have to modify in a string in order to become equal to the reference string. So "friend" and "friendly" would have a difference of 2.

This way you could check which texts have less differences compared to the reference text, possibly meaning that they were copied straight away from the source material.

# https://github.com/markvanderloo/stringdist
install.packages('stringdist')

library(stringdist)

base_text <- "she grew up in the united states.Her father was"

text_1 <- "She grew up in the United States. Her father was"
text_2 <- "I learned that she grew up in the united states.Her father was"
text_3 <- "The main character was born in the USA, his father being"
text_4 <- "My favourite animals are raccoons, they are so silly and cute"
text_5 <- "I didn't understand this assignment so I'm just answering gibberish"
text_6 <- "she grew up in the united states.Her father was"

test_texts <- c(text_1, text_2, text_3, text_4, text_5, text_6)

# calculate string distance using default method
distances <- stringdist(base_text, test_texts)

# texts that are only x or less edits away from the original text
possible_copied_texts <- test_texts[distances <= 25]

possible_copied_texts

#[1] "She grew up in the United States. Her father was"              
#[2] "I learned that she grew up in the united states.Her father was"
#[3] "she grew up in the united states.Her father was"

If this method does not work for your use case, you can use stringdist with the The longest common substring method (method='lcs'), which is defined as the "longest string that can be obtained by pairing characters from an and b while keeping the order of characters intact." This way we can find if longer texts have a pasted text inside them, even if it is slightly modified:

library(stringdist)

base_text_2 <- "this sentence means plagiarism therefore something bad will occur"

text_7 <- "random string with no words from the base text"
text_8 <- "cat dog pig chicken duck this sentence means plagiarism therefore something bad will occur food pizza sandwich pineapple"
text_9 <- "this pretty long sentence does in fact mean that I have not plagiarized any text, instead I'm writing all by myself"
text_10 <- "what so you mean you are doubting this text? you can't actually think that this sentence means plagiarism therefore something bad will occur, even if it does use words straight from the plagiarism sentence"
text_11 <- "totally normal text"
text_12 <- "this sentence means plagiarism therefore something bad will occur"
text_13 <- "this sentence does not mean plagiarism and therefore something bad not will occur"
# here, strings 8, 10, and 12 contain the base text in them, and string 13 contains a slightly modified version of the base text which would still be plagiarism

# create a vector with the strings
test_texts_2 <- c(text_7, 
                  text_8, 
                  text_9, 
                  text_10,
                  text_11,
                  text_12,
                  text_13)

# but we will also add filler text before and after every string, so that they become longer
filler <- "lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt"
test_texts_3 <- paste(filler, test_texts_2, filler)

# perform strins distance calculation with the  longest common substring method
distances_lcs <- stringdist(base_text_2, test_texts_3, method = "lcs")

# we get the distances substrazcted from the length of every string, then we substract the lenght of the base text so that strings with the base text become zero
distance_lcs_results <- nchar(test_texts_3) - distances_lcs - nchar(base_text_2)

# strings with a value of 0 means the exact text is present in the text
distance_lcs_results
#> [1] -38   0 -24   0 -44   0  -2

# subset the vector so that we can confirm that the strings that contain the text were detected
test_texts_2[distance_lcs_results == 0]
#> [1] "cat dog pig chicken duck this sentence means plagiarism therefore something bad will occur food pizza sandwich pineapple"                                                                                     
#> [2] "what so you mean you are doubting this text? you can't actually think that this sentence means plagiarism therefore something bad will occur, even if it does use words straight from the plagiarism sentence"
#> [3] "this sentence means plagiarism therefore something bad will occur"

# but we can also get close matches, strings containing text that are not the same, but similar, to the base texts
test_texts_2[abs(distance_lcs_results) < 20]
#> [1] "cat dog pig chicken duck this sentence means plagiarism therefore something bad will occur food pizza sandwich pineapple"                                                                                     
#> [2] "what so you mean you are doubting this text? you can't actually think that this sentence means plagiarism therefore something bad will occur, even if it does use words straight from the plagiarism sentence"
#> [3] "this sentence means plagiarism therefore something bad will occur"                                                                                                                                            
#> [4] "this sentence does not mean plagiarism and therefore something bad not will occur"

You could use both methods (or more!) to create a score variable, and then make a decision based on multiple plagiarism metrics.

^{Created on 2024-07-24 with reprex v2.1.0}

2024-07-23

Bastián Olea Herrera

Solution

Here's an approach to identify contiguous words in common using tidytext, looking for verbatim copying of phrases. Here, I strip out punctuation (at least .,;?!) and make n-grams (phrases of length n) to match between two sources.

In this case I look for contiguous phrases of 7-10 words. For large data, a large range could become computationally- and memory-intensive. A small phrase length will capture more false positives from common phrases, while a long phrase length could lead to false negatives (not catching plagiarism, so this will take some adjusting for the context.

library(tidyverse); library(tidytext)
tokenize <- function(str_vec, from = 7, to = 10) {
  data.frame(text = str_vec) |>
    mutate(text = text |> 
             str_replace_all("[.,;?!] ", " ") |>
             str_replace_all("[.,;?!]",  " ")) |>
    tidytext::unnest_ngrams(phrase, text, n = to, n_min = from,  to_lower = TRUE) |>
    mutate(length = str_count(phrase, "\\w+"))
}

inner_join(
  text_1 |> tokenize() |> mutate(src = "text_1"),
  text_2 |> tokenize() |> mutate(src = "text_2"),
  join_by(phrase, length)
) |>
  arrange(-length)

Result

                                            phrase length  src.x  src.y
1  she grew up in the united states her father was     10 text_1 text_2
2      she grew up in the united states her father      9 text_1 text_2
3      grew up in the united states her father was      9 text_1 text_2
4             she grew up in the united states her      8 text_1 text_2
5          grew up in the united states her father      8 text_1 text_2
6           up in the united states her father was      8 text_1 text_2
7                 she grew up in the united states      7 text_1 text_2
8                 grew up in the united states her      7 text_1 text_2
9               up in the united states her father      7 text_1 text_2
10             in the united states her father was      7 text_1 text_2

2024-07-24

Jon Spring