Question

Explanation about awk matching a pattern between a range of lines

I have an Awk one-liner that, while I understand what is being accomplished at a high level, I don't fully understand how Awk is accomplishing the task I give it. I'm asking Awk to give me a range of text only if a certain pattern is met.

Here are the contents I have for my file.txt:

START
        Name: Apple
        Kingdom: Plantae
        Family: Rosaceae
END
--------------------
START
        Name: Cat
        Kingdom: Animalia
        Family: Felidae
END
--------------------
START
        Name: Orange
        Kingdom: Plantae
        Family: Rutaceae
END
--------------------
START
        Name: Dog
        Kingdom: Animalia
        Family: Canidae
END
--------------------

And here's the one-liner command I'm passing:

awk '/START/ {var=""} {var=var ? var ORS $0 : $0} /END/ {if(var~/Plantae/) print var}' file.txt

This returns, as desired, the range if the condition var~/Plantae/ is met:

$ awk '/START/ {var=""} {var=var ? var ORS $0 : $0} /END/ {if(var~/Plantae/) print var}' file.txt
START
        Name: Apple
        Kingdom: Plantae
        Family: Rosaceae
END
START
        Name: Orange
        Kingdom: Plantae
        Family: Rutaceae
END
$

I know a few things going into this:

  • /START/ = Starting flag for my range
  • /END/ = Ending flag for my range
  • {var=""} = Sets var equal to an empty string
  • condition ? Yay : Boo = If the condition is true use the true expression: "Yay", else use the false expression: "Boo"

I think I know the fundamentals here:

  • I'm using START as my starting flag (flag on) and END as my ending flag (flag off) to set the "range" of lines I want to search between
  • If any records within my range contain the pattern Plantae Awk returns all lines between the flags, including the flags.
  • Anything outside of this range of flags is excluded, as are any ranges that do not contain the desired pattern

Very neat, and meets my, admittedly rather niche, scenario of using Awk via CLI instead of more robust and advanced programming languages or methods.

I get lost on how this is necessarily being accomplished. It sets var equal to var, then as long as var is not empty uses the expression var ORS $0. But my understanding would be as follows:

  • var is equal to empty at this point as one of the few things done so far is {var=""}
  • ORS would be defaulting to the newline character
  • $0 is the current record (i.e. line) being read in, and is the only value that has been "set" so far, besides var being set to empty

So, obviously, there's other shenanigans going on behind the scene and I'm not 100% on what those may be.

I tried a few things for the final print portion, just to see what was set by the end, and roughly what I concluded was:

  • If I print ORS instead of var, I get only the newline characters equal to number of records with my flags. So if I use my first example var~/Plantae/ I get four newline characters (two START and two END lines as both Apple and Orange contain the pattern Plantae in their range), but if I only match on var~/Apple/ I only get two newline characters (one START and and one END lines as only one range contains Apple)
  • If I print $0 I get END which is either the flag off, or perhaps it's the final record being read, I'm not sure. Those two options might functionally be the same thing (END is being printed here regardless) but perhaps there's an important semantic difference to be noted between the flag off and final record being read? This is also returns END equal to the number of "ranges" that match the if() print statement, as noted by my previous bullet point (So /Apple/ returns END only once whereas /Plantae/ returns END twice)

My question would boil down to how is all of this being set within Awk? How is var ending up containing the entire range including the flags, but ORS seems to remain only newline characters, and $0 is only the final flag/last record (again, not sure which this is)?

 3  69  3
1 Jan 1970

Solution

 2

if var is equal to var (which would always be true in this scenario? Maybe?), then use the expression var ORS $0.

You're misreading. = is assignment, not comparison. So var = is assigning to var, it's not comparing var with anything.

It's assigning the result of the ternary expression

var ? var ORS $0 : $0

which means if var is not empty, use var ORS $0, otherwise $0. This makes a newline-separated list of the lines, and the conditional prevents putting an extra newline before the first item in the list.

Perhaps you would have understood it better if it had been written as

var = (var != "") ? var ORS $0 : $0

or more verbosely:

if (var == "") {
    var = $0;
} else {
    var = var ORS $0;
}
2024-07-09
Barmar

Solution

 1

The crucial detail here is that Awk doesn't use any explicit operator for string concatenation. Thus

"a" "b" "c"

is perfectly equivalent to

"abc"

and

var ORS $0

is the concatenation of var, ORS, and $0.

In other words,

var = var ORS $0

means "append ORS $0 to the end of the current value of var." Or in still other words, in the context of this script, collect the lines in the range into var, separated by ORS.

The ternary operator is used to only assign $0 (no ORS in front) on the first line of the range, when var is still empty.

In some other languages, you might use e.g. + for string concatenation, and say something like

# not valid in Awk
var += (var ? ORS : "") + $0

This is a very common Awk idiom, so a good one to learn and understand.

2024-07-10
tripleee

Solution

 1

When you have file.txt like

START
        Name: Apple
        Kingdom: Plantae
        Family: Rosaceae
END
--------------------
START
        Name: Cat
        Kingdom: Animalia
        Family: Felidae
END
--------------------
START
        Name: Orange
        Kingdom: Plantae
        Family: Rutaceae
END
--------------------
START
        Name: Dog
        Kingdom: Animalia
        Family: Canidae
END
--------------------

and you have GNU AWK at your disposal you might do

awk 'BEGIN{RS="\n--------------------\n"}/Plantae/' file.txt

which will output

START
        Name: Apple
        Kingdom: Plantae
        Family: Rosaceae
END
START
        Name: Orange
        Kingdom: Plantae
        Family: Rutaceae
END

Explanation: I inform GNU AWK that the record separator (RS) is a newline (\n) followed by 20 dashes (-) followed by a newline (\n). I filter for records where Plantae is present.

(tested in GNU Awk 5.1.0)

2024-07-10
Daweo