Question
gawk hangs when using a regex for RS combined with reading a continuous stream from stdin
I'm streaming data using netcat and piping the output to gawk. Here is an example byte sequence that gawk will receive:
=AAAA;=BBBB;;CCCC==DDDD;
The data includes nearly any arbitrary characters, but never contains NULL characters, where =
and ;
are reserved to be delimiters. As chunks of arbitrary characters are written, each chunk will always be prefixed by one of the delimiters, and always be suffixed by one of the delimiters, but either delimiter can be used at any time: =
is not always the prefix, and ;
is not always the suffix. It will never write a chunk without also writing an appropriate prefix and suffix. As the data is parsed, I need to distuingish between which delimiter was used, so that my downstream code can properly interpret that information.
Since this is a network stream, stdin remains open after this sequence is read, as it waits for future data. I'd want gawk to read until either delimiter is encountered, and then execute the body of my gawk script with whatever data was found, while ensuring that it properly handles the continuous stream of stdin. I explain this in more detail below.
Thus far
Here is what I have attempted thus far (zsh script, using gawk, on macOS). For this post, I simplified the body to just print the data - my full gawk script has a much more complicated body. I also simplified the netcat stream to instead just cat
a file (along with cat
'ing stdin in order to mimic the stream behavior).
cat example.txt - | gawk '
BEGIN {
RS = "=|;";
}
{
if ($0 != "") {
print $0;
fflush();
}
}
'
example.txt
=AAAA;=BBBB;=CCCC;=DDDD;
My attempt successfully handles most of the data......up until the most-recent record. It hangs waiting for more data from stdin, and fails to execute the body of my script for the most-recent record, despite an appropriate delimiter clearly being available in stdin.
Current output: (fails to process the most-recent record of DDDD
)
AAAA
BBBB
CCCC
[hang here, waiting for future data]
Desired output: (successfully processes all records, including the most-recent)
AAAA
BBBB
CCCC
DDDD
[hang here, waiting for future data]
What, exactly, could be the cause of this problem, and how can I potentially address it? I recognize that this seems to be somewhat of an edge-case scenario. Thank you all very much for your help!
Edit: Comment consolidation, misc clarifications, and various observations/realizations
Here are some misc observations I found during debugging, both before and after I originally made this post. These edits also clarify some questions that came up in the comments, and consolidate the info scattered across various comments into a single place. Also includes some realizations I made about how gawk works internally, based on the extremely insightful information in the comments. Info in this edit supersedes any potentially conflicting info that may have been discussed in the comments.
I briefly investigated whether this could be a pipe buffering issue imposed by the OS. After messing with the
stdbuf
tool to disable all pipe buffering, it seems that buffering is not the problem at all, at least not in the traditional sense (see item #3).I noticed that if stdin is closed and a regex is used for RS, no problems occur. Conversely, if stdin remains open and RS is not a regex (i.e. a plaintext string), no problems occur either. The problem only occurs if both stdin remains open and RS is a regex. Thus, we can reasonably assume that it's something related to how regex handles having a continuous stream of stdin.
I noticed that if my RS with regex (
RS = "=|;";
) is 3 characters long...and stdin remains open...it stops hanging after exactly 3 additional characters appear in stdin. If I adjust the length of my regex to be 5 chars (RS = "(=|;)"
), the amount of additional characters necessary to return from hanging adjusts accordingly. Combined with the extremely insightful discussion with Kaz, this establishes that the hanging is an artifact of the regex engine itself. Like Kaz said, when the regex engine parsesRS = "=|;";
, it ends up trying to read additional characters from stdin in order to be sure that the regex is a match, despite this additional read not being strictly necessary for the regex in question, which obviously causes a hang waiting on stdin. I also tried adding lazy quantifiers to the regex, which in theory means the regex engine can return immediately, but alas it does not, as this is an implementation detail of the regex engine.The gawk docs here and here state that when RS is a single character, it is treated as a plaintext string, and causes RS to match without invoking the regex engine. Conversely, if RS has 2 or more characters, it is treated as a regex, and the regex engine will be invoked (subsequently bringing the problem discussed in item #3 into play). However, this seems to be slightly misleading, which is an implementation detail of gawk. I tried
RS = "xy";
(and adjusted my data accordingly), and re-tested my experiment from #3. No hanging occurred and the correct output was printed, which must mean that despite RS being 2 characters, it is still being treated as a plaintext string - the regex engine is never invoked, and the hanging problem never occurs. So, there seems to be some further filtering on whether RS is treated as plaintext or as a regex.So....now that we've figured out the root cause of the problem....what do we do about it? An obvious idea would be to avoid using regex....but that points toward writing a custom data parser in C or some other language. This hypothetical custom program would parse the input entirely from scratch, and gawk/regex would never be involved anywhere in the lifecycle of my script. Although I could do this, and this would certainly solve the problem, the extent of my full data parsing is somewhat complex, so I'd rather not go down this path of weeds.
This brings us to Ed Morton's workaround, which is probably the best way to go, or some derivative thereof. Summarizing his approach below:
Basically, use other CLI tools to do an ahead-of-time conversion, before data is given to gawk, to add a suffixed NULL character after each potential delimiter. Next, invoke gawk with RS as the NULL character, which would treat RS as a plaintext string and not a regex, which means the hanging problem never comes into play. From there, the real delimiter and data chunk could be decoded and processed in whatever way you want.
Although I have now marked Ed's answer as the solution, I think that my final solution will be a hybrid of Ed's approach, Kaz's insight, some subsequent realizations I made thanks to them, and some arbitrary approach that I can come up with in order to add those suffixed NULL characters. Wish I could mark two answers as solutions! Thank you everyone for your help, especially Ed Morton and Kaz!