Question
Using callable_iterator (re.finditer) causes Python to freeze
I have a function that is called for every line of a text.
def tokenize_line(line: str, cmd = ''):
matches = re.finditer(Patterns.SUPPORTED_TOKENS, line)
tokens_found, not_found, start_idx = [], [], 0
print(matches)
for match in matches:
pass
# Rest of code
The result of print(matches)
is something like: <callable_iterator object at 0x0000021201445000>
However, when I convert the iterator into a list:
matches = list(re.finditer(Patterns.SUPPORTED_TOKENS, line))
or when I iterate with for
:
for match in matches:
print(match)
...Python freezes.
This issue occurs inconsistently. For example:
tokenize_line('$color AS $length') # Works fine
tokenize_line('FALSE + $length IS GT 7 + $length IS 4') # Freezes
So, the problem arises when converting the callable_iterator into a list or iterating over it.
Here is the pattern (Patterns.SUPPORTED_TOKENS) I'm using:
(°p\d+°|°a\d+°|°m\d+°)|((?<!\S)(?:!\'(?:\\.|[^\'\n\\])*\'|!"(?:\\.|[^\n"\\])*")(?!\S))|((?:\'(?:\\.|[^\'\n\\])*\'|"(?:\\.|[^\n"\\])*"))|((\{(.*)\}))|((?<!\S)([@$][\w]*(?:\.[\w]*)*)(?!\S))|((?<!\d)-?\d*\.?\d+)|(\*\*|[\+\-\*\(\)/%\^]|==|&&|\|\||!=|>=|<=|>|<|~~|!~~|::|!::)|([\:/])|(\b(?:AS|AND|AT|:|BETWEEN|BY|FROM|IN|INTO|ON|OF|OR|THAN|TO|USING|WITH)\b)|(\b[a-zA-Z_][a-zA-Z0-9_]* *((?:[^;()\'""]*|"(?:[^"\\]|\\.)*"|\'(?:[^\'\\]|\\.)*\'|\([^)]*\))*?;))|((\b(?:EMPTY|STRING|NUMBER|BOOL|ARRAY|MAP|TRUE|FALSE|NULL|UNKNOWN|DOTALL|IGNORECASE|MULTILINE|ARRAY_ARRAY|ARRAY_STRING|ARRAY_MAP|ARRAY_NUMBER|ARRAY_NULL|DOT|SPACE|NEWLINE|SEMICOLON|COLON|HASH|COMMA|TAB)\b)|(\b(?:IS NOT LT|IS NOT GT|IS NOT GEQ|IS NOT LEQ|IS NOT|IS LT|IS GT|IS GEQ|IS LEQ|IS|NOT IN|NOT|IN|HAS NOT|HAS|AND|OR)\b))
Explanation of the Regular Expression Pattern:
Custom Tokens: Matches specific custom tokens that start with particular characters and are followed by digits.
Quoted Strings: Matches both single and double-quoted strings, including those with escape characters. Curly Braces Content: Matches anything enclosed in curly braces.
Variables: Matches variables that start with specific characters (like @ or $) and can include dots for nested properties.
Numbers: Matches both integers and floating-point numbers, including negative numbers.
Operators: Matches various mathematical and logical operators. Colons and Slashes: Matches specific punctuation characters like colons and slashes.
Keywords: Matches certain keywords that are reserved in the language. Function Definitions: Matches function definitions or similar structures, ensuring they follow specific syntax rules.
Data Types and Modifiers: Matches keywords that represent data types or modifiers.
Logical Operators: Matches complex logical operators used in conditional expressions.
Example:
import re
SUPPORTED_TOKENS = r'(°p\d+°|°a\d+°|°m\d+°)|((?<!\S)(?:!\'(?:\\.|[^\'\n\\])*\'|!"(?:\\.|[^\n"\\])*")(?!\S))|((?:\'(?:\\.|[^\'\n\\])*\'|"(?:\\.|[^\n"\\])*"))|((\{(.*)\}))|((?<!\S)([@$][\w]*(?:\.[\w]*)*)(?!\S))|((?<!\d)-?\d*\.?\d+)|(\*\*|[\+\-\*\(\)/%\^]|==|&&|\|\||!=|>=|<=|>|<|~~|!~~|::|!::)|([\:/])|(\b(?:AS|AND|AT|:|BETWEEN|BY|FROM|IN|INTO|ON|OF|OR|THAN|TO|USING|WITH)\b)|(\b[a-zA-Z_][a-zA-Z0-9_]* *((?:[^;()\'""]*|"(?:[^"\\]|\\.)*"|\'(?:[^\'\\]|\\.)*\'|\([^)]*\))*?;))|((\b(?:EMPTY|STRING|NUMBER|BOOL|ARRAY|MAP|TRUE|FALSE|NULL|UNKNOWN|DOTALL|IGNORECASE|MULTILINE|ARRAY_ARRAY|ARRAY_STRING|ARRAY_MAP|ARRAY_NUMBER|ARRAY_NULL|DOT|SPACE|NEWLINE|SEMICOLON|COLON|HASH|COMMA|TAB)\b)|(\b(?:IS NOT LT|IS NOT GT|IS NOT GEQ|IS NOT LEQ|IS NOT|IS LT|IS GT|IS GEQ|IS LEQ|IS|NOT IN|NOT|IN|HAS NOT|HAS|AND|OR)\b))'
def tokenize_line(line: str, cmd = ''):
if not line:
return [], []
matches = list(re.finditer(SUPPORTED_TOKENS, line))
print(list)
lines = [
'$color AS $length',
'EMPTY + $length IS GT 7 + $length IS 4'
]
for x in lines:
tokenize_line(x)
Any help to understand why this happens and how to fix it would be greatly appreciated!