Question
Python/Regex: Finding a specific pattern to update and then placing it back in the original text
I have a long file containing thousands of lines and a couple of samples are shown below:
\begin{align*}
H_0 \amp : \mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5 \\
H_1 \amp : \text{Some text} \\
H_2 \amp : \text{More text...} \\
\end{align*}
\begin{table}[htb]
\centering
\begin{tabular}{cc}
Mean \amp = \amp the mean value $\mu$ \\
Median \amp = \amp the median value $\median$ \\
Mode \amp = \amp the mode value $\mode$ \\
\end{tabular}
\end{table}
The objective is to turn \begin{align*}...\end{align*}
into
<md>
<mrow>H_0 \amp : \mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5</mrow>
<mrow>H_1 \amp : \text{Some text}</mrow>
<mrow>H_2 \amp : \text{More text...}</mrow>
</md>
and \begin{table}[htb]...\end{table}
to
<table>
<tabular halign="center">
<row header="yes" bottom="minor" >
<cell>Mean</cell>
<cell>=</cell>
<cell>the mean value $\mu$</cell>
</row>
<row>
<cell>Median</cell>
<cell>=</cell>
<cell>the mode value $\mode$</cell>
</row>
<row>
<cell>Mode</cell>
<cell>=</cell>
<cell>the mode value $\mode$</cell>
</row>
</tabular>
</table>
I am trying to get \begin{align*}
working and haven't started on \begin{table}
yet. I have made a script for it but doesn't work as expected. I believe it is because I am using re.escape(...)
. There are too many unnecessary \
characters generated. I want to eliminate the extra \
's and also remove \begin{align*}
along with \end{align*}
during the process. Any assistance is appreciated!
<md><mrow>\begin\{align\*\}\
\ \ H_0\ \amp\ :\ \mu_1\ =\ \mu_2\ =\ \mu_3\ =\ \mu_4\ =\ \mu_5\ </mrow><mrow>\
\ \ H_1\ \amp\ :\ \text\{Some\ text\}\ </mrow><mrow>\ \
\ \ H_2\ \amp\ :\ \text\{More\ text\.\.\.\}\ </mrow>\ \
\end\{align\*\}</md>
\begin{table}[htb]
\centering
\begin{tabular}{cc}
Mean \amp = \amp the mean value $\mu$ \\
Median \amp = \amp the median value $\median$ \\
Mode \amp = \amp the mode value $\mode$ \\
\end{tabular}
\end{table}
import re
my_file = open("sample.txt", "r")
data: str = my_file.read()
result: str = data
original = re.findall(r'\\begin{align\*}[\s\S]*\\end{align\*}', data,)
modified = re.findall(r'\\begin{align\*}[\s\S]*\\end{align\*}', data,)
for i in range(len(modified)):
# append the first mrow of the <md> tag
modified[i] = r'<mrow>' + modified[i]
# replace \\ with a closing and opening of </mrow> and <mrow>.
modified[i] = str(modified[i]).replace(r'\\', r'</mrow><mrow>')
#wrap everything with the math display environment
modified[i] = '<md>' + modified[i]+r'</md>'
# Remove the last <mrow> as it is an extra
modified[i] =(modified[i][::-1].replace(r'<mrow>'[::-1], ''[::-1], 1))[::-1]
result = re.sub(re.escape(original[i]), re.escape(modified[i]), result)
# print(modified[i])
# print(original[i])
print(result)