Question
How can I simplify this method to replace punctuation while keeping special words intact?
I am making a modulatory function that will take keywords with special characters (@&\*%
) and keep them intact while all other punctuation is deleted from a sentence. I have devised a solution, but it is very bulky and probably more complicated than it needs to be. Is there a way to do this, but in a much simpler way?
In short, my code matches all instances of the special words to find the span. I then match the characters to find their span, and then I loop over the list of matches and remove any characters that also exist in the span of the found words.
Code:
import re
from string import punctuation
sentence = "I am going to run over to Q&A and ask them a ton of questions about this & that & that & this while surfacing the internet! with my raccoon buddy @ the bar."
# my attempt to remove punctuation
class SentenceHolder:
sentence = None
protected_words = ["Q&A"]
def __init__(sentence):
self.sentence = sentence
def remove_punctuation(self):
for punct in punctuation:
symbol_matches: List[re.Match] = [i for i in re.finditer(punct, self.sentence)]
remove_able_matches = self._protected_word_overlap(symbol_matches)
for word in reversed(remove_able_word_matches):
self.sentence = (self.modified_string[:word.start()] + " " + self.sentence[word.end():])
def _protected_word_overlap(symbol_matches)
protected_word_locations = []
for protected_word in self.protected_words :
protected_word_locations.extend([i for i in re.finditer(protected_word, self.sentence)])
protected_matches = []
for protected_word in protected_word_locations:
for symbol_inst in symbol_matches:
symbol_range: range = range(symbol_inst.start(), symbol_inst.end())
protested_word_set = set(range(protected_word.start(), protected_word.end()))
if len(protested_word_set.intersection(symbol_range)) != 0:
protected_matches.append(symbol_inst)
remove_able_matches = [sm for sm in symbol_matches if sm not in protected_matches]
return remove_able_matches
The output of the code:
my_string = SentenceHolder(sentence)
my_string.remove_punctuation()
Result:
"I am going to run over to Q&A and ask them a ton of questions about this that that this while surfacing the internet with my raccoon buddy the bar"
I tried to use regex and pattern to identify all the locations of the punctuation, but the pattern I use in re.sub
does not work similarly in re.match
.