Question

Using callable_iterator (re.finditer) causes Python to freeze

I have a function that is called for every line of a text.

def tokenize_line(line: str, cmd = ''):
    matches = re.finditer(Patterns.SUPPORTED_TOKENS, line)
    tokens_found, not_found, start_idx = [], [], 0
    print(matches)
    for match in matches:
        pass
        # Rest of code

The result of print(matches) is something like: <callable_iterator object at 0x0000021201445000>

However, when I convert the iterator into a list:

matches = list(re.finditer(Patterns.SUPPORTED_TOKENS, line))

or when I iterate with for:

for match in matches:
   print(match)

...Python freezes.

This issue occurs inconsistently. For example:

tokenize_line('$color AS $length') # Works fine
tokenize_line('FALSE + $length IS GT 7 + $length IS 4') # Freezes

So, the problem arises when converting the callable_iterator into a list or iterating over it.

Here is the pattern (Patterns.SUPPORTED_TOKENS) I'm using:

(°p\d+°|°a\d+°|°m\d+°)|((?<!\S)(?:!\'(?:\\.|[^\'\n\\])*\'|!"(?:\\.|[^\n"\\])*")(?!\S))|((?:\'(?:\\.|[^\'\n\\])*\'|"(?:\\.|[^\n"\\])*"))|((\{(.*)\}))|((?<!\S)([@$][\w]*(?:\.[\w]*)*)(?!\S))|((?<!\d)-?\d*\.?\d+)|(\*\*|[\+\-\*\(\)/%\^]|==|&&|\|\||!=|>=|<=|>|<|~~|!~~|::|!::)|([\:/])|(\b(?:AS|AND|AT|:|BETWEEN|BY|FROM|IN|INTO|ON|OF|OR|THAN|TO|USING|WITH)\b)|(\b[a-zA-Z_][a-zA-Z0-9_]* *((?:[^;()\'""]*|"(?:[^"\\]|\\.)*"|\'(?:[^\'\\]|\\.)*\'|\([^)]*\))*?;))|((\b(?:EMPTY|STRING|NUMBER|BOOL|ARRAY|MAP|TRUE|FALSE|NULL|UNKNOWN|DOTALL|IGNORECASE|MULTILINE|ARRAY_ARRAY|ARRAY_STRING|ARRAY_MAP|ARRAY_NUMBER|ARRAY_NULL|DOT|SPACE|NEWLINE|SEMICOLON|COLON|HASH|COMMA|TAB)\b)|(\b(?:IS NOT LT|IS NOT GT|IS NOT GEQ|IS NOT LEQ|IS NOT|IS LT|IS GT|IS GEQ|IS LEQ|IS|NOT IN|NOT|IN|HAS NOT|HAS|AND|OR)\b))

Explanation of the Regular Expression Pattern:

Custom Tokens: Matches specific custom tokens that start with particular characters and are followed by digits.

Quoted Strings: Matches both single and double-quoted strings, including those with escape characters. Curly Braces Content: Matches anything enclosed in curly braces.

Variables: Matches variables that start with specific characters (like @ or $) and can include dots for nested properties.

Numbers: Matches both integers and floating-point numbers, including negative numbers.

Operators: Matches various mathematical and logical operators. Colons and Slashes: Matches specific punctuation characters like colons and slashes.

Keywords: Matches certain keywords that are reserved in the language. Function Definitions: Matches function definitions or similar structures, ensuring they follow specific syntax rules.

Data Types and Modifiers: Matches keywords that represent data types or modifiers.

Logical Operators: Matches complex logical operators used in conditional expressions.

Example:

import re

SUPPORTED_TOKENS = r'(°p\d+°|°a\d+°|°m\d+°)|((?<!\S)(?:!\'(?:\\.|[^\'\n\\])*\'|!"(?:\\.|[^\n"\\])*")(?!\S))|((?:\'(?:\\.|[^\'\n\\])*\'|"(?:\\.|[^\n"\\])*"))|((\{(.*)\}))|((?<!\S)([@$][\w]*(?:\.[\w]*)*)(?!\S))|((?<!\d)-?\d*\.?\d+)|(\*\*|[\+\-\*\(\)/%\^]|==|&&|\|\||!=|>=|<=|>|<|~~|!~~|::|!::)|([\:/])|(\b(?:AS|AND|AT|:|BETWEEN|BY|FROM|IN|INTO|ON|OF|OR|THAN|TO|USING|WITH)\b)|(\b[a-zA-Z_][a-zA-Z0-9_]* *((?:[^;()\'""]*|"(?:[^"\\]|\\.)*"|\'(?:[^\'\\]|\\.)*\'|\([^)]*\))*?;))|((\b(?:EMPTY|STRING|NUMBER|BOOL|ARRAY|MAP|TRUE|FALSE|NULL|UNKNOWN|DOTALL|IGNORECASE|MULTILINE|ARRAY_ARRAY|ARRAY_STRING|ARRAY_MAP|ARRAY_NUMBER|ARRAY_NULL|DOT|SPACE|NEWLINE|SEMICOLON|COLON|HASH|COMMA|TAB)\b)|(\b(?:IS NOT LT|IS NOT GT|IS NOT GEQ|IS NOT LEQ|IS NOT|IS LT|IS GT|IS GEQ|IS LEQ|IS|NOT IN|NOT|IN|HAS NOT|HAS|AND|OR)\b))'

def tokenize_line(line: str, cmd = ''):
    if not line:
        return [], []
    
    matches = list(re.finditer(SUPPORTED_TOKENS, line))
    print(list)
    
    

lines = [
    '$color AS $length',
    'EMPTY + $length IS GT 7 + $length IS 4'
]


for x in lines:
    tokenize_line(x)

Any help to understand why this happens and how to fix it would be greatly appreciated!

 2  78  2
1 Jan 1970

Solution

 5

As pointed out in the comments, the hang is caused by a catastrophic backtracking, as you're wrapping the pattern [^;()\'""]*, which can match the entirety of your input or any part of it, in a group that can repeat zero to many times, followed by a ;, which is not matching your input. Upon failure, the regex engine backtracks a character, still matching [^;()\'""]*, but with the outer * it repeats the inner pattern to match the leftover character, only to fail again with the following ;. It would backtrack again with now 2 leftover characters to allow the inner pattern to match them both or one at a time.

In other words, the backtracking goes on until the regex engine exhausts all possible ways to partition the input, amounting to a rapidly out-of-hand 2 ^ (<number of characters> + 1) number of combinations because every gap position in the input can be either a partition or not.

In this case the catastrophic backtracking can be fixed by removing the inner quantifier * from (?:[^;()\'""]*|...)*?; because the outer quantifier *? already allows a zero-to-many repetition of the inner pattern. That is, change:

(?:[^;()\'""]*|...)*?;

to:

(?:[^;()\'""]|...)*?;

Demo: https://regex101.com/r/MszbVf/1

If you're using Python 3.11 or later, you can also avoid catastrophic backtracking by making the group an atomic group so that the regex engine would not attempt to backtrack once the atomic group itself successfully matches, even if the match then fails with the following ;. That is, change:

(?:[^;()\'""]*|...)*?;

to:

(?>[^;()\'""]*|...)*?;

Demo: https://regex101.com/r/MszbVf/2

2024-07-22
blhsing