regex - Python make sure address matches specific format -


i have been playing around regular expressions, haven't had luck yet. need introduce address validation. need make sure user defined address matches format:

"717 n 2nd st, mankato, mn 56001" 

or possibly 1 too:

"717 n 2nd st, mankato, mn, 56001" 

and throw else out , alert user improper format. have been looking @ documentation , have tried , failed many regular expression patterns. have tried (and many variations) without luck:

pat = r'\d{1,6}(\w+),\s(w+),\s[a-za-z]{2}\s{1,6}' 

this 1 works, allows junk because making sure starts house number , ends zip code (i think):

pat = r'\d{1,6}( \w+){1,6}' 

the comma placement crucial splitting input string comma first item address, city, state , zip split space (here use second regex in case have comma between state , zip).

essentially this:

# check format "717 n 2nd st, mankato, mn 56001" pat_1 = 'regex match above pattern' if re.match(pat_1, addr, re.ignorecase):     # extract address   # check pattern "717 n 2nd st, mankato, mn, 56001" pat_2 = 'regex match above format' if re.match(pat_2, addr, re.ignorecase):     # extract address   else:     raise valueerror('"{}" must match format: "717 n 2nd st, mankato, mn 56001"'.format(addr))  # stuff address 

if me forming regex make sure there pattern match, appreciate it!

here's 1 might help. whenever possible, prefer use verbose regular expressions embedded comments, maintainability.

also note use of (?p<name>pattern). helps document intent of match, , provides useful mechanism extract data, if needs go beyond simple regex validation.

import re  # goal:  '717 n 2nd st, mankato, mn 56001', # goal:  '717 n 2nd st, mankato, mn, 56001', regex = r'''     (?x)            # verbose regular expression     (?i)            # ignore case     (?p<housenumber>\d+)\s+        # matches '717 '     (?p<direction>[news])\s+       # matches 'n '     (?p<streetname>\w+)\s+         # matches '2nd '     (?p<streetdesignator>\w+),\s+  # matches 'st, '     (?p<townname>.*),\s+           # matches 'mankato, '     (?p<state>[a-z]{2}),?\s+       # matches 'mn ' , 'mn, '     (?p<zip>\d{5})                 # matches '56001' '''  regex = re.compile(regex)  item in (     '717 n 2nd st, mankato, mn 56001',     '717 n 2nd st, mankato, mn, 56001',     '717 n 2nd, makata, 56001',   # should reject 1     '1234 n d ave, east boston, ma, 02134',     ):     match = regex.match(item)     print item     if match:         print "    house on {direction} side of {townname}".format(**match.groupdict())     else:         print "    invalid entry" 

to make fields optional, replace + *, since + means one-or-more, , * means zero-or-more. here version matches new requirements in comments:

import re  # goal:  '717 n 2nd st, mankato, mn 56001', # goal:  '717 n 2nd st, mankato, mn, 56001', # goal:  '717 n 2nd st ne, mankato, mn, 56001', # goal:  '717 n 2nd, mankato, mn, 56001', regex = r'''     (?x)            # verbose regular expression     (?i)            # ignore case     (?p<housenumber>\d+)\s+         # matches '717 '     (?p<direction>[news])\s+        # matches 'n '     (?p<streetname>\w+)\s*          # matches '2nd ', optional trailing space     (?p<streetdesignator>\w*)\s*    # optionally matches 'st '     (?p<streetdirection>[news]*)\s* # optionally matches 'ne'     ,\s+                            # force comma after street     (?p<townname>.*),\s+            # matches 'mankato, '     (?p<state>[a-z]{2}),?\s+        # matches 'mn ' , 'mn, '     (?p<zip>\d{5})                  # matches '56001' '''  regex = re.compile(regex)  item in (     '717 n 2nd st, mankato, mn 56001',     '717 n 2nd st, mankato, mn, 56001',     '717 n 2nd, makata, 56001',   # should reject 1     '1234 n d ave, east boston, ma, 02134',     '717 n 2nd st ne, mankato, mn, 56001',     '717 n 2nd, mankato, mn, 56001',     ):     match = regex.match(item)     print item     if match:         print "    house on {direction} side of {townname}".format(**match.groupdict())     else:         print "    invalid entry" 

next, consider or operator, |, , non-capturing group operator, (?:pattern). together, can describe complex alternatives in input format. version matches new requirement addresses have direction before street name, , have direction after street name, no address has direction in both places.

import re  # goal:  '717 n 2nd st, mankato, mn 56001', # goal:  '717 n 2nd st, mankato, mn, 56001', # goal:  '717 2nd st ne, mankato, mn, 56001', # goal:  '717 n 2nd, mankato, mn, 56001', regex = r'''     (?x)            # verbose regular expression     (?i)            # ignore case     (?: # matches sort of street address         (?: # matches '717 n 2nd st' or '717 n 2nd'             (?p<housenumber>\d+)\s+      # matches '717 '             (?p<direction>[news])\s+     # matches 'n '             (?p<streetname>\w+)\s*       # matches '2nd ', optional trailing space             (?p<streetdesignator>\w*)\s* # optionally matches 'st '         )         | # or         (?:  # matches '717 2nd st ne' or '717 2nd ne'             (?p<housenumber2>\d+)\s+      # matches '717 '             (?p<streetname2>\w+)\s+       # matches '2nd '             (?p<streetdesignator2>\w*)\s* # optionally matches 'st '             (?p<direction2>[news]+)       # matches 'ne'         )     )     ,\s+                             # force comma after street     (?p<townname>.*),\s+             # matches 'mankato, '     (?p<state>[a-z]{2}),?\s+         # matches 'mn ' , 'mn, '     (?p<zip>\d{5})                   # matches '56001' '''  regex = re.compile(regex)  item in (     '717 n 2nd st, mankato, mn 56001',     '717 n 2nd st, mankato, mn, 56001',     '717 n 2nd, makata, 56001',   # should reject 1     '1234 n d ave, east boston, ma, 02134',     '717 2nd st ne, mankato, mn, 56001',     '717 n 2nd, mankato, mn, 56001',     ):     match = regex.match(item)     print item     if match:         d = match.groupdict()         print "    house on {0} side of {1}".format(             d['direction'] or d['direction2'],             d['townname'])     else:         print "    invalid entry" 

Comments

Popular posts from this blog

powershell Start-Process exit code -1073741502 when used with Credential from a windows service environment -

twig - Using Twigbridge in a Laravel 5.1 Package -

c# - LINQ join Entities from HashSet's, Join vs Dictionary vs HashSet performance -