python - Why is this regex experiencing catastrophic backtracking? -

March 15, 2011

i've read the articles , other questions on catastrophic backtracking in regular expressions , how can caused nested + , * quantifiers. however, regexes still encountering catastrophic backtracking without nested quantifiers. can me understand why?

i wrote these regexes search a specific type of rhyme in lines of welsh poetry. rhyme consists of consonants in beginning of line being repeated @ end, , there must space between beginning , end consonants. i've removed vowels, there 2 exceptions make these regexes ugly. first, there allowed consonants in middle don't repeat, , if there any, it's different type of rhyme. second, letters m, n, r, h, , v allowed interrupt rhyme (appear in beginning not in end or vice versa), can't ignored because rhyme consists of letters.

my script automatically builds regex each line , tests it. works rest of time, 1 line giving catastrophic backtracking. line's text without vowels is:

nn  frvvn  frv v

the regex automatically finds nn frvvn rhymes frv v, tries again last letter (the n in frvvn) required in back. if it's not required, rhyme can shortened. here's regex:

^(?p<s_letters>         # starting letters [mnrhv]*?\s*n{0,2}      # number of optional letters or number                         # of spaces can come between rhyming letters [mnrhv]*?\s*n{0,2} [mnrhv]*?\s*f{1,2} [mnrhv]*?\s*[rr]?(?:\s*[rr])? # r can rhyme r, that's                               # not relevant here (i think) [mnrhv]*?\s*v{0,2} [mnrhv]*?\s*v{0,2} [mnrhv]*?\s*n{1,2} [mnrhv\s]*?) (?p<m_letters>          # middle letters [^\s]*?(?p<caesura>\s)  # caesura (end of rhyme)                         # first space after rhyme      .*)                     # end letters come late possible (?p<e_letters>          # end group [mnrhv]*?\s*n{0,2} [mnrhv]*?\s*n{0,2} [mnrhv]*?\s*f{1,2} [mnrhv]*?\s*[rr]?(?:\s*[rr])? [mnrhv]*?\s*v{0,2} [mnrhv]*?\s*v{0,2} [mnrhv]*?\s*n{1,2} [mnrhv\s]*?)$

even though doesn't have nested quantifiers, still takes forever run. regexes other lines generated in same way run quickly. why this?

i'm not seeing nested quantifiers, seeing lot of ambiguities cause high-exponent polynomial runtime. example, consider part of regex:

[mnrhv]*?\s*[rr]?(?:\s*[rr])? # r can rhyme r, that's                               # not relevant here (i think) [mnrhv]*?\s*v{0,2} [mnrhv]*?\s*v{0,2} [mnrhv]*?\s*n{1,2} [mnrhv\s]*?) (?p<m_letters>          # middle letters [^\s]*?(?p<caesura>\s)  # caesura (end of rhyme)

suppose regex engine @ point, , text it's seeing huge block of ns. ns can divided between following parts of regex:

[mnrhv]*?\s*[rr]?(?:\s*[rr])? ^^^^^^^^^  [mnrhv]*?\s*v{0,2} ^^^^^^^^^  [mnrhv]*?\s*v{0,2} ^^^^^^^^^ [mnrhv]*?\s*n{1,2} ^^^^^^^^^   ^^^^^^ [mnrhv\s]*?) ^^^^^^^^^^^ (?p<m_letters> [^\s]*?(?p<caesura>\s) ^^^^^^^

if number of ns n, there o(n**6) ways divide ns, since there 6 *? blocks match n here, , in between optional or matches n.

are \s parts mandatory? if so, might able improve runtime putting + instead of * on them.

Search This Blog

Macro

python - Why is this regex experiencing catastrophic backtracking? -

Comments

Post a Comment

Popular posts from this blog

symfony - TEST environment only: The database schema is not in sync with the current mapping file -

twig - Using Twigbridge in a Laravel 5.1 Package -

jdbc - Not able to establish database connection in eclipse -