Question
Explanation about awk matching a pattern between a range of lines
I have an Awk one-liner that, while I understand what is being accomplished at a high level, I don't fully understand how Awk is accomplishing the task I give it. I'm asking Awk to give me a range of text only if a certain pattern is met.
Here are the contents I have for my file.txt
:
START
Name: Apple
Kingdom: Plantae
Family: Rosaceae
END
--------------------
START
Name: Cat
Kingdom: Animalia
Family: Felidae
END
--------------------
START
Name: Orange
Kingdom: Plantae
Family: Rutaceae
END
--------------------
START
Name: Dog
Kingdom: Animalia
Family: Canidae
END
--------------------
And here's the one-liner command I'm passing:
awk '/START/ {var=""} {var=var ? var ORS $0 : $0} /END/ {if(var~/Plantae/) print var}' file.txt
This returns, as desired, the range if the condition var~/Plantae/
is met:
$ awk '/START/ {var=""} {var=var ? var ORS $0 : $0} /END/ {if(var~/Plantae/) print var}' file.txt
START
Name: Apple
Kingdom: Plantae
Family: Rosaceae
END
START
Name: Orange
Kingdom: Plantae
Family: Rutaceae
END
$
I know a few things going into this:
/START/
= Starting flag for my range/END/
= Ending flag for my range{var=""}
= Sets var equal to an empty stringcondition ? Yay : Boo
= If the condition is true use the true expression: "Yay", else use the false expression: "Boo"
I think I know the fundamentals here:
- I'm using
START
as my starting flag (flag on) andEND
as my ending flag (flag off) to set the "range" of lines I want to search between - If any records within my range contain the pattern
Plantae
Awk returns all lines between the flags, including the flags. - Anything outside of this range of flags is excluded, as are any ranges that do not contain the desired pattern
Very neat, and meets my, admittedly rather niche, scenario of using Awk via CLI instead of more robust and advanced programming languages or methods.
I get lost on how this is necessarily being accomplished. It sets var
equal to var
, then as long as var is not empty uses the expression var ORS $0
. But my understanding would be as follows:
var
is equal to empty at this point as one of the few things done so far is{var=""}
ORS
would be defaulting to the newline character$0
is the current record (i.e. line) being read in, and is the only value that has been "set" so far, besidesvar
being set to empty
So, obviously, there's other shenanigans going on behind the scene and I'm not 100% on what those may be.
I tried a few things for the final print
portion, just to see what was set by the end, and roughly what I concluded was:
- If I print
ORS
instead ofvar
, I get only the newline characters equal to number of records with my flags. So if I use my first examplevar~/Plantae/
I get four newline characters (twoSTART
and twoEND
lines as bothApple
andOrange
contain the patternPlantae
in their range), but if I only match onvar~/Apple/
I only get two newline characters (oneSTART
and and oneEND
lines as only one range containsApple
) - If I print
$0
I getEND
which is either the flag off, or perhaps it's the final record being read, I'm not sure. Those two options might functionally be the same thing (END
is being printed here regardless) but perhaps there's an important semantic difference to be noted between the flag off and final record being read? This is also returnsEND
equal to the number of "ranges" that match theif() print
statement, as noted by my previous bullet point (So/Apple/
returnsEND
only once whereas/Plantae/
returnsEND
twice)
My question would boil down to how is all of this being set within Awk? How is var
ending up containing the entire range including the flags, but ORS
seems to remain only newline characters, and $0
is only the final flag/last record (again, not sure which this is)?