Question

gawk hangs when using a regex for RS combined with reading a continuous stream from stdin

I'm streaming data using netcat and piping the output to gawk. Here is an example byte sequence that gawk will receive:

=AAAA;=BBBB;;CCCC==DDDD;

The data includes nearly any arbitrary characters, but never contains NULL characters, where = and ; are reserved to be delimiters. As chunks of arbitrary characters are written, each chunk will always be prefixed by one of the delimiters, and always be suffixed by one of the delimiters, but either delimiter can be used at any time: = is not always the prefix, and ; is not always the suffix. It will never write a chunk without also writing an appropriate prefix and suffix. As the data is parsed, I need to distuingish between which delimiter was used, so that my downstream code can properly interpret that information.

Since this is a network stream, stdin remains open after this sequence is read, as it waits for future data. I'd want gawk to read until either delimiter is encountered, and then execute the body of my gawk script with whatever data was found, while ensuring that it properly handles the continuous stream of stdin. I explain this in more detail below.

Thus far

Here is what I have attempted thus far (zsh script, using gawk, on macOS). For this post, I simplified the body to just print the data - my full gawk script has a much more complicated body. I also simplified the netcat stream to instead just cat a file (along with cat'ing stdin in order to mimic the stream behavior).

cat example.txt - | gawk '
BEGIN {
    RS = "=|;";
}
{
    if ($0 != "") {
        print $0;
        fflush();
    }
}
'

example.txt

=AAAA;=BBBB;=CCCC;=DDDD;

My attempt successfully handles most of the data......up until the most-recent record. It hangs waiting for more data from stdin, and fails to execute the body of my script for the most-recent record, despite an appropriate delimiter clearly being available in stdin.

Current output: (fails to process the most-recent record of DDDD)

AAAA
BBBB
CCCC
[hang here, waiting for future data]

Desired output: (successfully processes all records, including the most-recent)

AAAA
BBBB
CCCC
DDDD
[hang here, waiting for future data]

What, exactly, could be the cause of this problem, and how can I potentially address it? I recognize that this seems to be somewhat of an edge-case scenario. Thank you all very much for your help!

Edit: Comment consolidation, misc clarifications, and various observations/realizations

Here are some misc observations I found during debugging, both before and after I originally made this post. These edits also clarify some questions that came up in the comments, and consolidate the info scattered across various comments into a single place. Also includes some realizations I made about how gawk works internally, based on the extremely insightful information in the comments. Info in this edit supersedes any potentially conflicting info that may have been discussed in the comments.

I briefly investigated whether this could be a pipe buffering issue imposed by the OS. After messing with the stdbuf tool to disable all pipe buffering, it seems that buffering is not the problem at all, at least not in the traditional sense (see item #3).
I noticed that if stdin is closed and a regex is used for RS, no problems occur. Conversely, if stdin remains open and RS is not a regex (i.e. a plaintext string), no problems occur either. The problem only occurs if both stdin remains open and RS is a regex. Thus, we can reasonably assume that it's something related to how regex handles having a continuous stream of stdin.
I noticed that if my RS with regex (RS = "=|;";) is 3 characters long...and stdin remains open...it stops hanging after exactly 3 additional characters appear in stdin. If I adjust the length of my regex to be 5 chars (RS = "(=|;)"), the amount of additional characters necessary to return from hanging adjusts accordingly. Combined with the extremely insightful discussion with Kaz, this establishes that the hanging is an artifact of the regex engine itself. Like Kaz said, when the regex engine parses RS = "=|;";, it ends up trying to read additional characters from stdin in order to be sure that the regex is a match, despite this additional read not being strictly necessary for the regex in question, which obviously causes a hang waiting on stdin. I also tried adding lazy quantifiers to the regex, which in theory means the regex engine can return immediately, but alas it does not, as this is an implementation detail of the regex engine.
The gawk docs here and here state that when RS is a single character, it is treated as a plaintext string, and causes RS to match without invoking the regex engine. Conversely, if RS has 2 or more characters, it is treated as a regex, and the regex engine will be invoked (subsequently bringing the problem discussed in item #3 into play). However, this seems to be slightly misleading, which is an implementation detail of gawk. I tried RS = "xy"; (and adjusted my data accordingly), and re-tested my experiment from #3. No hanging occurred and the correct output was printed, which must mean that despite RS being 2 characters, it is still being treated as a plaintext string - the regex engine is never invoked, and the hanging problem never occurs. So, there seems to be some further filtering on whether RS is treated as plaintext or as a regex.
So....now that we've figured out the root cause of the problem....what do we do about it? An obvious idea would be to avoid using regex....but that points toward writing a custom data parser in C or some other language. This hypothetical custom program would parse the input entirely from scratch, and gawk/regex would never be involved anywhere in the lifecycle of my script. Although I could do this, and this would certainly solve the problem, the extent of my full data parsing is somewhat complex, so I'd rather not go down this path of weeds.
This brings us to Ed Morton's workaround, which is probably the best way to go, or some derivative thereof. Summarizing his approach below:

Basically, use other CLI tools to do an ahead-of-time conversion, before data is given to gawk, to add a suffixed NULL character after each potential delimiter. Next, invoke gawk with RS as the NULL character, which would treat RS as a plaintext string and not a regex, which means the hanging problem never comes into play. From there, the real delimiter and data chunk could be decoded and processed in whatever way you want.

Although I have now marked Ed's answer as the solution, I think that my final solution will be a hybrid of Ed's approach, Kaz's insight, some subsequent realizations I made thanks to them, and some arbitrary approach that I can come up with in order to add those suffixed NULL characters. Wish I could mark two answers as solutions! Thank you everyone for your help, especially Ed Morton and Kaz!

5 271 5

1 Jan 1970

Solution

Awk is waiting for the record to be delimited. A record will be delimited when two things happen: there is a match for the RS regex, or the input ends.

You've not given it either, because you used cat <file> -, which means that cat's output tream continues with standard input (your TTY) after <file> is exhausted.

You must use Ctrl-D on an empty line to generate the necessary EOF condition that Gawk is looking for.

Edit:

The issue is, why does the last record not appear even though it is delimited by the trailing =?

This behavior reproduces exactly in an Awk implementation that I wrote as a macro in a Lisp language, side by side with GNU Awk.

$ (echo -n 'AAAA=AAAA;AAAA;AAAA='; cat) | gawk 'BEGIN { RS = "=|;"; } { print $0; fflush(); }'
AAAA
AAAA
AAAA
# hangs here until Ctrl-D, then:
AAAA

Exactly the same thing:

$ (echo -n 'AAAA=AAAA;AAAA;AAAA='; cat) | txr -e '(awk (:set rs #/=|;/) (t))'
AAAA
AAAA
AAAA
# hangs here until Ctrl-D, then:
AAAA

In the case of the second Awk implementation, since I wrote everything from scratch, including the regex engine, I can explain the behavior of that which forms a hypothesis about why Gawk is the same.

The regex-delimited reading is based on a function written in C called read_until_match which is a wrapper for a helper called scan_until_common. This function works by feeding characters one by one from the stream into a regex state machine, checking the state.

Here is the thing. When the regex state machine says "we have a match!" we cannot stop there. The reason is that we need to find the longest match.

The function does not know that the regex is a trivial one-character regex, for which the first match is already the longest match. Therefore, it needs to feed one more character of the input. At that point, the regex state machine says "fail!". The function then knows that there had been a successful match previously. It backtracks to that point, pushing the extra character back into the stream.

So, of course, if there is no next character available in the stream, we get an I/O blocking hang.

Why it has to work this way is that some regexes successfully match prefixes of the longest match. A trivial example is: suppose we have #+ as a delimiter. When one # is seen, that's a match! But when another # is seen, that is also a match! We have to see all the # characters to get the full match, which means we have to see the first non-matching character which follows.

GNU Awk cannot easily escape from doing something very similar; the theory calls for it.

A way to solve the problem would be to have a function maxmatchlen(R) which for a regex R reports the maximum length of the match for the regex (possibly infinite). maxmatchlen(/.*/) is Inf, but matchmatchlen(/abc/) is 3. You get the picture. With this function, we would know that if we have just fed the regex matchmatchlen characters, and the regex state machine is reporting a matching state, we are done; we don't have to look ahead into the stream.

2024-07-03

Kaz

Solution

A workaround inserting a shell read loop into the pipeline to carve the original awk input (the OPs actual netcat output) up into individual characters and then feed them to awk one at a time:

cat example.txt - |
while IFS= read -r -d '' -N1 char; do printf '%s\0' "$char"; done |
awk -v RS='\0' '
    /[;=]/ { if (rec != "") { print rec; fflush() }; rec=""; next }
    { rec=rec $0 }
'
AAAA
AAAA
AAAA
AAAA

That requires GNU awk or some other that can handle a NUL character as the RS as that's non-POSIX behavior. It does assume your input can't contain NUL bytes, i.e. it's a valid POSIX text "file".

Read on for how we got there if interested...

I thought there was at least 1 bug here as I found multiple oddities (see below) so I opened a gawk bug report at https://lists.gnu.org/archive/html/bug-gawk/2024-07/msg00006.html but per the gawk provider, Arnold, the differences in behavior in this case are just implementation details of having to read ahead to ensure the regexp matches the right string.

It seems there are 3 issues at play here, e.g. using GNU awk 5.3.0 on cygwin:

Different supposedly equivalent regexps produce different behavior:

$ printf 'A;B;C;\n' > file

$ cat file - | awk -v RS='(;|=)' '{print NR, $0}'
1 A

$ cat file - | awk -v RS=';|=' '{print NR, $0}'
1 A
2 B

$ cat file - | awk -v RS='[;=]' '{print NR, $0}'
1 A
2 B
3 C

(;|=), ;|= and [;=] should be equivalent but clearly they aren't in this case.

The good news is you can apparently work around that problem using a bracket expression as in the 3rd case above instead of an "or".

The output record trails the input record when the record separator character is the last one in the input, e.g. with no newline after the last ;:

$ printf 'A;B;C;' > file

$ cat file - | awk -v RS='(;|=)' '{print $0; fflush()}'

$ cat file - | awk -v RS=';|=' '{print $0; fflush()}'
A

$ cat file - | awk -v RS='[;=]' '{print $0; fflush()}'
A
B

The bad news is that that impacts the OPs example:

$ printf ';AAAA;BBBB;CCCC;DDDD;' > file

With a literal character RS:

$ cat file - | awk -v RS=';' '{print $0; fflush()}'

AAAA
BBBB
CCCC
DDDD

With a regexp RS that should also make that char literal:

$ cat file - | awk -v RS='[;]' '{print $0; fflush()}'

AAAA
BBBB
CCCC

$ printf ';AAAA;BBBB;CCCC;DDDD;x' > file

$ cat file - | awk -v RS='[;]' '{print $0; fflush()}'

AAAA
BBBB
CCCC
DDDD

Adding different characters to the RS bracket expression produces inconsistent behavior (I stumbled across this by accident):

$ printf 'A;B;C;\n' > file

$ cat file - | awk -v RS='[;|=]' '{print $0; fflush()}'
A

$ cat file - | awk -v RS='[;a=]' '{print $0; fflush()}'
A
B
C

FWIW I tried setting a timeout:

$ cat file - | awk -v RS='[;]' 'BEGIN{PROCINFO["-", "READ_TIMEOUT"]=100} {print $0; fflush()}'
A
B
awk: cmd. line:1: (FILENAME=- FNR=3) fatal: error reading input file `-': Connection timed out

$ cat file - | awk -v RS='[;]' -v GAWK_READ_TIMEOUT=1 '{print $0; fflush()}'
A
B

and stdbuf to disable buffering:

$ cat file - | stdbuf -i0 -o0 -e0 awk -v RS='[;]' '{print $0; fflush()}'
A
B

and matching every character (thinking I could then use RT ~ /[=;]/ to find the separator):

$ cat file - | awk -v RS='(.)' '{print RT; fflush()}'
A
;
B
;
C

but none of them would let me read the last record separator so at this point I don't know what the OP could do to successfully read the last record of continuing input using a regexp other than something like this:

$ printf 'A;B;C;' > file

$ cat file - |
    while IFS= read -r -d '' -N1 char; do printf '%s\0' "$char"; done |
    awk -v RS='\0' '/[;=]/ { print rec; fflush(); rec=""; next } { rec=rec $0 }'
A
B
C

and using the OPs sample input but with different text per record to make the mapping of input to output records clearer:

$ printf '=AAAA=BBBB;CCCC;DDDD=' > example.txt

$ cat example.txt - |
    while IFS= read -r -d '' -N1 char; do printf '%s\0' "$char"; done |
    awk -v RS='\0' '/[;=]/ { print rec; fflush(); rec=""; next } { rec=rec $0 }'

AAAA
BBBB
CCCC
DDDD

We're using NUL chars as the delimiters and various options above to make the shell read loop robust enough to handle blank lines and other white space in the input, see https://unix.stackexchange.com/a/49585/133219 and https://unix.stackexchange.com/a/169765/133219 for details on those issues. We're additionally using a NUL char for the awk RS so it can distinguish between newlines coming from the original input vs a newline as a terminating character being added by the shell printf, otherwise rec in the awk script could never contain a newline as they'd ALL be consumed by matching the default RS.

We're using a pipe to/from the while-read loop instead of process substitution just to ease clarity since the OP is already using pipes.

2024-07-03

Ed Morton

Solution

A combination of the solutions of @daweo and @EdMorton:
OP wants to have logic based on discern the two delimiters, and might want to use RT for it.
First use Ed's work-around for reading the input one character a time.
When a = is found, add a ; as a delimiter.
In awk, fix the RT when the = is part of the line.

I will print the RT after printing $0.

cat example.txt - | 
while IFS= read -r -d '' -N1 char; do
  if [[ "$char" == '=' ]]; then
    printf "=;"
  else
    printf '%s' "$char"
  fi
done  | awk '
  BEGIN {
    RS = ";"
  }
  /=/ {
        RT="=";
        sub(/=/,"", $0) 
  }
  {
    if ($0 != "") {
        print $0 "(RT=" RT ")";
        fflush();
    }
  }
'

Result:

AAAA(RT==)
AAAA(RT=;)
AAAA(RT=;)
AAAA(RT==)

2024-07-03

Walter A

Solution

Multiple Line (The GNU Awk User's Guide) says that

RS == any single character

Records are separated by each occurrence of the character. Multiple successive occurrences delimit empty records. (...)

RS == regexp

Records are separated by occurrences of characters that match regexp. Leading and trailing matches of regexp delimit empty records.(...)

Observe that Leading and trailing is mentioned only for latter, so I suspect source of troubles might be how it is implemented in GNU AWK.

If you do not need discern between = and ; I propose following workaround

cat -u example.txt - | sed -u 'y/;/=/' | gawk '
BEGIN {
    RS = "=";
}
{
    if ($0 != "") {
        print $0;
        fflush();
    }
}
'

which for example.txt content being

=AAAA=AAAA;AAAA;AAAA=

gives output

AAAA
AAAA
AAAA
AAAA

and hangs. Explanation: I added GNU sed running in unbuffered mode (-u) with single y command which does

Transliterate any characters in the pattern space which match any of the source-chars with the corresponding character in dest-chars.

In this replaces ; using =. Then changed RS in gawk command to single-character string =.

(tested in GNU sed 4.8 and GNU Awk 5.1.0)

2024-07-03

Daweo

Solution

A solution which doesn't require changing the awk script: Since empty records are ignored by it, we can simply duplicate each record separator in a pipe stage inserted before gawk, e. g.

python -c '
import os
for i in iter(lambda: os.read(0, 1), b""):
    os.write(1, i)
    if i in b"=;": os.write(1, i)
' |

2024-07-04

Armali

Solution

One of the gawk providers, Andy Schorr, was unable to create a Stackoverflow account for some reason so he asked me to post his suggestion for him (see https://lists.gnu.org/archive/html/bug-gawk/2024-07/msg00012.html for the original source):

From Andy:

Have you considered trying to use the select extension and its nonblocking feature?

Something like this sort of seems to work:

(echo "A;B;C;D;"; cat -) | gawk -v 'RS=[;=]' -lselect -ltime '
BEGIN {
   fd = input_fd("")
   set_non_blocking(fd)
   PROCINFO[FILENAME, "RETRY"] = 1
   while (1) {
      delete readfds
      readfds[fd] = ""
      select(readfds, writefds, exceptfds)
      while ((rc = getline x) > 0) {
         if (rc > 0)
            printf "%d [%s]\n", ++n, x
         else if (rc != 2) {
            print "Error: non-retry error"
            exit 1
         }
      }
   }
}'

2024-07-05

Ed Morton

Solution

I couldn't replicate it at all with any awk variant I have :

The outputs for gawk -c and gawk -P look out of place by design
None of them triggered the timeout

 for __ in 'mawk1' 'mawk2' 'nawk' 'gawk -e'   'gawk -be' \
           'gawk -ce' 'gawk -Pe'  'gawk -Mbe' 'gawk -nbe'; do

     ( time ( timeout --foreground 10 

       echo '=AAAA;=BBBB;=CCCC;=DDDD;' | $( printf '%s' "$__" ) '

       BEGIN {  RS = "[\n=;]+"
               OFS = "\3"     
           } { 
                print NR, FNR, NR, length(), 
                       "$0 := \""($0)"\"",
                       "$1 := \""($1)"\"", 
                      "$NF := \""($NF)"\"" }' ) | gcat - ) | 

     column -s$'\3' -t

     echo "\f\t$__ done ...\n"
 done

( timeout --foreground 10 echo '=AAAA;=BBBB;=CCCC;=DDDD;' |  ; ) 
        0.00s user 0.01s system 110% cpu 0.011 total
gcat -  0.00s user 0.00s system 39% cpu 0.010 total
1  1  1  0  $0 := ""      $1 := ""      $NF := ""
2  2  2  4  $0 := "AAAA"  $1 := "AAAA"  $NF := "AAAA"
3  3  3  4  $0 := "BBBB"  $1 := "BBBB"  $NF := "BBBB"
4  4  4  4  $0 := "CCCC"  $1 := "CCCC"  $NF := "CCCC"
5  5  5  4  $0 := "DDDD"  $1 := "DDDD"  $NF := "DDDD"

    mawk1 done ...
 
( timeout --foreground 10 echo '=AAAA;=BBBB;=CCCC;=DDDD;' |  ; ) 
        0.00s user 0.01s system 127% cpu 0.008 total
gcat -  0.00s user 0.00s system  38% cpu 0.007 total

1  1  1  0  $0 := ""      $1 := ""      $NF := ""
2  2  2  4  $0 := "AAAA"  $1 := "AAAA"  $NF := "AAAA"
3  3  3  4  $0 := "BBBB"  $1 := "BBBB"  $NF := "BBBB"
4  4  4  4  $0 := "CCCC"  $1 := "CCCC"  $NF := "CCCC"
5  5  5  4  $0 := "DDDD"  $1 := "DDDD"  $NF := "DDDD"
 
    mawk2 done ...

( timeout --foreground 10 echo '=AAAA;=BBBB;=CCCC;=DDDD;' |  ; )
        0.00s user 0.01s system 112% cpu 0.007 total
gcat -  0.00s user 0.00s system  31% cpu 0.006 total

1  1  1  0  $0 := ""      $1 := ""      $NF := ""
2  2  2  4  $0 := "AAAA"  $1 := "AAAA"  $NF := "AAAA"
3  3  3  4  $0 := "BBBB"  $1 := "BBBB"  $NF := "BBBB"
4  4  4  4  $0 := "CCCC"  $1 := "CCCC"  $NF := "CCCC"
5  5  5  4  $0 := "DDDD"  $1 := "DDDD"  $NF := "DDDD"
     
    nawk done ...

( timeout --foreground 10 echo '=AAAA;=BBBB;=CCCC;=DDDD;' |  ; )
        0.00s user 0.01s system 61% cpu 0.018 total
gcat -  0.00s user 0.00s system 10% cpu 0.017 total

1  1  1  0  $0 := ""      $1 := ""      $NF := ""
2  2  2  4  $0 := "AAAA"  $1 := "AAAA"  $NF := "AAAA"
3  3  3  4  $0 := "BBBB"  $1 := "BBBB"  $NF := "BBBB"
4  4  4  4  $0 := "CCCC"  $1 := "CCCC"  $NF := "CCCC"
5  5  5  4  $0 := "DDDD"  $1 := "DDDD"  $NF := "DDDD"
     
    gawk -e done ...

( timeout --foreground 10 echo '=AAAA;=BBBB;=CCCC;=DDDD;' |  ; )
        0.00s user 0.00s system 106% cpu 0.008 total
gcat -  0.00s user 0.00s system  21% cpu 0.008 total

1  1  1  0  $0 := ""      $1 := ""      $NF := ""
2  2  2  4  $0 := "AAAA"  $1 := "AAAA"  $NF := "AAAA"
3  3  3  4  $0 := "BBBB"  $1 := "BBBB"  $NF := "BBBB"
4  4  4  4  $0 := "CCCC"  $1 := "CCCC"  $NF := "CCCC"
5  5  5  4  $0 := "DDDD"  $1 := "DDDD"  $NF := "DDDD"
     
    gawk -be done ...

( timeout --foreground 10 echo '=AAAA;=BBBB;=CCCC;=DDDD;' |  ; )
        0.00s user 0.00s system 104% cpu 0.008 total
gcat -  0.00s user 0.00s system  19% cpu 0.007 total

1  1                                 1        
                            25  $0 := "=AAAA;=BBBB;=CCCC;=DDDD;"  
                                $1 := "=AAAA;=BBBB;=CCCC;=DDDD;" 
                               $NF := "=AAAA;=BBBB;=CCCC;=DDDD;"
 
    gawk -ce done ...

( timeout --foreground 10 echo '=AAAA;=BBBB;=CCCC;=DDDD;' |  ; )
        0.00s user 0.00s system 108% cpu 0.007 total
gcat -  0.00s user 0.00s system  21% cpu 0.007 total

1  1                                 1        
                            25  $0 := "=AAAA;=BBBB;=CCCC;=DDDD;"  
                                $1 := "=AAAA;=BBBB;=CCCC;=DDDD;" 
                               $NF := "=AAAA;=BBBB;=CCCC;=DDDD;"
    gawk -Pe done ...

( timeout --foreground 10 echo '=AAAA;=BBBB;=CCCC;=DDDD;' |  ; )
        0.00s user 0.00s system 79% cpu 0.011 total
gcat -  0.00s user 0.00s system 13% cpu 0.010 total

1  1  1  0  $0 := ""      $1 := ""      $NF := ""
2  2  2  4  $0 := "AAAA"  $1 := "AAAA"  $NF := "AAAA"
3  3  3  4  $0 := "BBBB"  $1 := "BBBB"  $NF := "BBBB"
4  4  4  4  $0 := "CCCC"  $1 := "CCCC"  $NF := "CCCC"
5  5  5  4  $0 := "DDDD"  $1 := "DDDD"  $NF := "DDDD"
 
    gawk -Mbe done ...    
 
( timeout --foreground 10 echo '=AAAA;=BBBB;=CCCC;=DDDD;' |  ; )
        0.00s user 0.00s system 108% cpu 0.007 total
gcat -  0.00s user 0.00s system  23% cpu 0.007 total

1  1  1  0  $0 := ""      $1 := ""      $NF := ""
2  2  2  4  $0 := "AAAA"  $1 := "AAAA"  $NF := "AAAA"
3  3  3  4  $0 := "BBBB"  $1 := "BBBB"  $NF := "BBBB"
4  4  4  4  $0 := "CCCC"  $1 := "CCCC"  $NF := "CCCC"
5  5  5  4  $0 := "DDDD"  $1 := "DDDD"  $NF := "DDDD"
     
    gawk -nbe done ...

2024-07-05

RARE Kpop Manifesto