python - Why is this regex experiencing catastrophic backtracking? -
i've read the articles , other questions on catastrophic backtracking in regular expressions , how can caused nested +
, *
quantifiers. however, regexes still encountering catastrophic backtracking without nested quantifiers. can me understand why?
i wrote these regexes search a specific type of rhyme in lines of welsh poetry. rhyme consists of consonants in beginning of line being repeated @ end, , there must space between beginning , end consonants. i've removed vowels, there 2 exceptions make these regexes ugly. first, there allowed consonants in middle don't repeat, , if there any, it's different type of rhyme. second, letters m, n, r, h, , v allowed interrupt rhyme (appear in beginning not in end or vice versa), can't ignored because rhyme consists of letters.
my script automatically builds regex each line , tests it. works rest of time, 1 line giving catastrophic backtracking. line's text without vowels is:
nn frvvn frv v
the regex automatically finds nn frvvn
rhymes frv v
, tries again last letter (the n
in frvvn
) required in back. if it's not required, rhyme can shortened. here's regex:
^(?p<s_letters> # starting letters [mnrhv]*?\s*n{0,2} # number of optional letters or number # of spaces can come between rhyming letters [mnrhv]*?\s*n{0,2} [mnrhv]*?\s*f{1,2} [mnrhv]*?\s*[rr]?(?:\s*[rr])? # r can rhyme r, that's # not relevant here (i think) [mnrhv]*?\s*v{0,2} [mnrhv]*?\s*v{0,2} [mnrhv]*?\s*n{1,2} [mnrhv\s]*?) (?p<m_letters> # middle letters [^\s]*?(?p<caesura>\s) # caesura (end of rhyme) # first space after rhyme .*) # end letters come late possible (?p<e_letters> # end group [mnrhv]*?\s*n{0,2} [mnrhv]*?\s*n{0,2} [mnrhv]*?\s*f{1,2} [mnrhv]*?\s*[rr]?(?:\s*[rr])? [mnrhv]*?\s*v{0,2} [mnrhv]*?\s*v{0,2} [mnrhv]*?\s*n{1,2} [mnrhv\s]*?)$
even though doesn't have nested quantifiers, still takes forever run. regexes other lines generated in same way run quickly. why this?
i'm not seeing nested quantifiers, seeing lot of ambiguities cause high-exponent polynomial runtime. example, consider part of regex:
[mnrhv]*?\s*[rr]?(?:\s*[rr])? # r can rhyme r, that's # not relevant here (i think) [mnrhv]*?\s*v{0,2} [mnrhv]*?\s*v{0,2} [mnrhv]*?\s*n{1,2} [mnrhv\s]*?) (?p<m_letters> # middle letters [^\s]*?(?p<caesura>\s) # caesura (end of rhyme)
suppose regex engine @ point, , text it's seeing huge block of n
s. n
s can divided between following parts of regex:
[mnrhv]*?\s*[rr]?(?:\s*[rr])? ^^^^^^^^^ [mnrhv]*?\s*v{0,2} ^^^^^^^^^ [mnrhv]*?\s*v{0,2} ^^^^^^^^^ [mnrhv]*?\s*n{1,2} ^^^^^^^^^ ^^^^^^ [mnrhv\s]*?) ^^^^^^^^^^^ (?p<m_letters> [^\s]*?(?p<caesura>\s) ^^^^^^^
if number of n
s n
, there o(n**6)
ways divide n
s, since there 6 *?
blocks match n
here, , in between optional or matches n
.
are \s
parts mandatory? if so, might able improve runtime putting +
instead of *
on them.
Comments
Post a Comment