Question

Making sense of the regex in grep command

When executing the following command in a bash shell:

grep '^\(.\)*\1$' exp.txt

I expect it to match any line that:

begins and ends with the same latter.

but in reality it matches the following lines:

Ritchie at Bell
Laboratories in the late 1960s. Onee
- first of all my name is ...
TTTTTTTTTTTTTT

I understand why it selected TTTTTTTTTTTTTT since it begins and ends with the same character, but I don't know why it selected the rest.

I know that adding a dot before the asterisk '^$.$.*\1$' will do what I desire (selecting lines that begin and end with the same character), but I still would like to know why '^$.$*\1$' doesn't work as expected.

The full file (exp.txt) is:

The Unix operating system was pioneered
by
Ken
Thompson and Dennis
Ritchie at Bell
Laboratories in the late 1960s. Onee
of the primary
goals in the design of the Unix system was to create an environment that
promoted efficient program
development.
[my name]
.Hi man
HHi man
my mamamamamama my
mohammad
space      spacesm
first char is same as last f
- first of all my name is ...
T
TT
TTTT
TTTTT
TTTTTT
TTTTTTT
TTTTTTTTTTTTTT

3 67 3

1 Jan 1970

Solution

A repeated capturing group, such as (.)*, puts in the memory what was captured last. That's where the confusion is coming from.

As your group is a wildcard ((.)), it will match anything and whatever was matched last will remain in \1. Simpler example (again in the more common (enhanced) regex style):

/(.)*/
aabbcc     # matches everything, but \1 contains `c`

This might be strange at first glance, but makes sense: it repeats a wildcard of a single character. So, matches everything. Finally, whatever was last captures is what remains in capture group memory.

To capture a repeated occurrence, one needs to repeat explicitly:

/(.)\1*/
aabbcc     # matches `aa` and \1 contains `a`

In the example above, (.) matches the first character and then it is repeated.

Back to your original expression (here expressed in grep's basic notation):

^\(.\)*\1$

Beginning tied to the start (^), it matches anything ($.$), as many times as possible (*), and then asks for a repeat of the last matched character at the end ($). In plain English:

Match any line that ends with a repeated character.

Achieving what you originally wanted (lines starting and ending with the same character) is possible with the regex you already mentioned in your question:

grep '^\(.\).*\1$' filename

2024-07-21

sidyll