Question

Python/Regex: Finding a specific pattern to update and then placing it back in the original text

I have a long file containing thousands of lines and a couple of samples are shown below:

\begin{align*}
  H_0 \amp : \mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5 \\
  H_1 \amp : \text{Some text} \\ 
  H_2 \amp : \text{More text...} \\ 
\end{align*}

\begin{table}[htb]
  \centering
  \begin{tabular}{cc}
    Mean   \amp = \amp the mean value $\mu$       \\
    Median \amp = \amp the median value $\median$ \\
    Mode   \amp = \amp the mode value $\mode$     \\
  \end{tabular}
\end{table}

The objective is to turn \begin{align*}...\end{align*} into

<md>
  <mrow>H_0 \amp : \mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5</mrow>
  <mrow>H_1 \amp : \text{Some text}</mrow>
  <mrow>H_2 \amp : \text{More text...}</mrow>
</md>

and \begin{table}[htb]...\end{table} to

<table>
  <tabular halign="center">
    <row header="yes" bottom="minor" >
      <cell>Mean</cell>
      <cell>=</cell>
      <cell>the mean value $\mu$</cell>
    </row>
    <row>
      <cell>Median</cell>
      <cell>=</cell>
      <cell>the mode value $\mode$</cell>
    </row>
    <row>
      <cell>Mode</cell>
      <cell>=</cell>
      <cell>the mode value $\mode$</cell>
    </row>
  </tabular>
</table>

I am trying to get \begin{align*} working and haven't started on \begin{table} yet. I have made a script for it but doesn't work as expected. I believe it is because I am using re.escape(...). There are too many unnecessary \ characters generated. I want to eliminate the extra \'s and also remove \begin{align*} along with \end{align*} during the process. Any assistance is appreciated!

<md><mrow>\begin\{align\*\}\
\ \ H_0\ \amp\ :\ \mu_1\ =\ \mu_2\ =\ \mu_3\ =\ \mu_4\ =\ \mu_5\ </mrow><mrow>\
\ \ H_1\ \amp\ :\ \text\{Some\ text\}\ </mrow><mrow>\ \
\ \ H_2\ \amp\ :\ \text\{More\ text\.\.\.\}\ </mrow>\ \
\end\{align\*\}</md>

\begin{table}[htb]
  \centering
  \begin{tabular}{cc}
    Mean   \amp = \amp the mean value $\mu$       \\
    Median \amp = \amp the median value $\median$ \\
    Mode   \amp = \amp the mode value $\mode$     \\
  \end{tabular}
\end{table}

import re

my_file = open("sample.txt", "r")
data: str = my_file.read()
result: str = data

original = re.findall(r'\\begin{align\*}[\s\S]*\\end{align\*}', data,)
modified = re.findall(r'\\begin{align\*}[\s\S]*\\end{align\*}', data,)

for i in range(len(modified)):
  # append the first mrow of the <md> tag
  modified[i] = r'<mrow>' + modified[i] 
  # replace \\ with a closing and opening of </mrow> and <mrow>.
  modified[i] = str(modified[i]).replace(r'\\', r'</mrow><mrow>')
  #wrap everything with the math display environment
  modified[i] = '<md>' + modified[i]+r'</md>'
  # Remove the last <mrow> as it is an extra
  modified[i] =(modified[i][::-1].replace(r'<mrow>'[::-1], ''[::-1], 1))[::-1]
  
  
  result = re.sub(re.escape(original[i]), re.escape(modified[i]), result)
  # print(modified[i])
  # print(original[i])
  print(result)

3 85 3

1 Jan 1970

Solution

Here is an example for the first part. I think you can extend for the table easily.

But for a real implementation you should also care about the different environments as you try in your example. There are some parser like TexSoup available. Therefore this changes could be done better with a Tex parser or a good LaTex Editor.

import regex as re
import xml.etree.ElementTree as ET

def change_c(textline):
    ml = [(b'\\begin{align*}\n', b'<md>'),(b'\\end{align*}',b'</md>')]
    for k,v in ml:
        if k == textline:
            return textline.replace(k,v)
        if textline.startswith(b'  H_'):
            l = ET.Element('mrow')
            l.text = textline.decode().strip().rstrip('\\')
            return ET.tostring(l)
    return textline
            

with open('example.tex','rb') as tex:
    s = tex.readlines()

result = map(change_c, s)
with open('new_example.tex', 'wb') as f:
    for row in result:
        f.write(row+b'\n')

for line in s:
    new = change_c(line)
    print(new.decode())

Output as file and text:

<md>
<mrow>H_0 \amp : \mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5 </mrow>
<mrow>H_1 \amp : \text{Some text} </mrow>
<mrow>H_2 \amp : \text{More text...} </mrow>
</md>

2024-07-14

Hermann12

Solution

You can use two patterns, such as:

([^\r\n\s]+)\s+\\amp\s+:\s+([^\r\n]+)
([^\r\n\s]+)\s+\\amp\s+=\s+([^\r\n]+)

and use pipe:

([^\r\n\s]+)\s+\\amp\s+:\s+([^\r\n]+)|([^\r\n\s]+)\s+\\amp\s+=\s+([^\r\n]+)

and capture the lines inside begin{align*} and begin{table}.

Example:

import re

s = """\\begin{align*}\n"
    "  H_0 \\amp : \\mu_1 = \\mu_2 = \\mu_3 = \\mu_4 = \\mu_5 \\\\\n"
    "  H_1 \\amp : \\text{Some text} \\\\ \n"
    "  H_2 \\amp : \\text{More text...} \\\\ \n"
    "\\end{align*}\n\n"
    "\\begin{table}[htb]\n"
    "  \\centering\n"
    "  \\begin{tabular}{cc}\n"
    "    Mean   \\amp = \\amp the mean value $\\mu$       \\\\\n"
    "    Median \\amp = \\amp the median value $\\median$ \\\\\n"
    "    Mode   \\amp = \\amp the mode value $\\mode$     \\\\\n"
    "  \\end{tabular}\n"
    "\\end{table}\n"""

p = r'([^\r\n\s]+)\s+\\amp\s+:\s+([^\r\n]+)|([^\r\n\s]+)\s+\\amp\s+=\s+([^\r\n]+)'

matches = re.findall(p, s)

align = []
table = []
for k1, v1, k2, v2 in matches:
    if k1 and v1:
        align += [(k1, v1)]
    if k2 and v2:
        table += [(k2, v2)]

print(align, "\n\n\n", table)

Prints

[('H_0', '\\mu_1 = \\mu_2 = \\mu_3 = \\mu_4 = \\mu_5 \\\\'), ('H_1', '\\text{Some text} \\\\ '), ('H_2', '\\text{More text...} \\\\ ')] 


 [('Mean', '\\amp the mean value $\\mu$       \\\\'), ('Median', '\\amp the median value $\\median$ \\\\'), ('Mode', '\\amp the mode value $\\mode$     \\\\')]

2024-07-14

user24714692