This is not quite what you asked for, but you can use the {stringdist}
package to evaluate the "distance" between two texts, generally interpreted as the amount of characters that you would have to modify in a string in order to become equal to the reference string. So "friend" and "friendly" would have a difference of 2.
This way you could check which texts have less differences compared to the reference text, possibly meaning that they were copied straight away from the source material.
# https://github.com/markvanderloo/stringdist
install.packages('stringdist')
library(stringdist)
base_text <- "she grew up in the united states.Her father was"
text_1 <- "She grew up in the United States. Her father was"
text_2 <- "I learned that she grew up in the united states.Her father was"
text_3 <- "The main character was born in the USA, his father being"
text_4 <- "My favourite animals are raccoons, they are so silly and cute"
text_5 <- "I didn't understand this assignment so I'm just answering gibberish"
text_6 <- "she grew up in the united states.Her father was"
test_texts <- c(text_1, text_2, text_3, text_4, text_5, text_6)
# calculate string distance using default method
distances <- stringdist(base_text, test_texts)
# texts that are only x or less edits away from the original text
possible_copied_texts <- test_texts[distances <= 25]
possible_copied_texts
#[1] "She grew up in the United States. Her father was"
#[2] "I learned that she grew up in the united states.Her father was"
#[3] "she grew up in the united states.Her father was"
If this method does not work for your use case, you can use stringdist
with the The longest common substring method (method='lcs'
), which is defined as the "longest string that can be obtained by pairing characters from an and b while keeping the order of characters intact." This way we can find if longer texts have a pasted text inside them, even if it is slightly modified:
library(stringdist)
base_text_2 <- "this sentence means plagiarism therefore something bad will occur"
text_7 <- "random string with no words from the base text"
text_8 <- "cat dog pig chicken duck this sentence means plagiarism therefore something bad will occur food pizza sandwich pineapple"
text_9 <- "this pretty long sentence does in fact mean that I have not plagiarized any text, instead I'm writing all by myself"
text_10 <- "what so you mean you are doubting this text? you can't actually think that this sentence means plagiarism therefore something bad will occur, even if it does use words straight from the plagiarism sentence"
text_11 <- "totally normal text"
text_12 <- "this sentence means plagiarism therefore something bad will occur"
text_13 <- "this sentence does not mean plagiarism and therefore something bad not will occur"
# here, strings 8, 10, and 12 contain the base text in them, and string 13 contains a slightly modified version of the base text which would still be plagiarism
# create a vector with the strings
test_texts_2 <- c(text_7,
text_8,
text_9,
text_10,
text_11,
text_12,
text_13)
# but we will also add filler text before and after every string, so that they become longer
filler <- "lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt"
test_texts_3 <- paste(filler, test_texts_2, filler)
# perform strins distance calculation with the longest common substring method
distances_lcs <- stringdist(base_text_2, test_texts_3, method = "lcs")
# we get the distances substrazcted from the length of every string, then we substract the lenght of the base text so that strings with the base text become zero
distance_lcs_results <- nchar(test_texts_3) - distances_lcs - nchar(base_text_2)
# strings with a value of 0 means the exact text is present in the text
distance_lcs_results
#> [1] -38 0 -24 0 -44 0 -2
# subset the vector so that we can confirm that the strings that contain the text were detected
test_texts_2[distance_lcs_results == 0]
#> [1] "cat dog pig chicken duck this sentence means plagiarism therefore something bad will occur food pizza sandwich pineapple"
#> [2] "what so you mean you are doubting this text? you can't actually think that this sentence means plagiarism therefore something bad will occur, even if it does use words straight from the plagiarism sentence"
#> [3] "this sentence means plagiarism therefore something bad will occur"
# but we can also get close matches, strings containing text that are not the same, but similar, to the base texts
test_texts_2[abs(distance_lcs_results) < 20]
#> [1] "cat dog pig chicken duck this sentence means plagiarism therefore something bad will occur food pizza sandwich pineapple"
#> [2] "what so you mean you are doubting this text? you can't actually think that this sentence means plagiarism therefore something bad will occur, even if it does use words straight from the plagiarism sentence"
#> [3] "this sentence means plagiarism therefore something bad will occur"
#> [4] "this sentence does not mean plagiarism and therefore something bad not will occur"
You could use both methods (or more!) to create a score variable, and then make a decision based on multiple plagiarism metrics.
Created on 2024-07-24 with reprex v2.1.0