regex - Python make sure address matches specific format -
i have been playing around regular expressions, haven't had luck yet. need introduce address validation. need make sure user defined address matches format:
"717 n 2nd st, mankato, mn 56001"
or possibly 1 too:
"717 n 2nd st, mankato, mn, 56001"
and throw else out , alert user improper format. have been looking @ documentation , have tried , failed many regular expression patterns. have tried (and many variations) without luck:
pat = r'\d{1,6}(\w+),\s(w+),\s[a-za-z]{2}\s{1,6}'
this 1 works, allows junk because making sure starts house number , ends zip code (i think):
pat = r'\d{1,6}( \w+){1,6}'
the comma placement crucial splitting input string comma first item address, city, state , zip split space (here use second regex in case have comma between state , zip).
essentially this:
# check format "717 n 2nd st, mankato, mn 56001" pat_1 = 'regex match above pattern' if re.match(pat_1, addr, re.ignorecase): # extract address # check pattern "717 n 2nd st, mankato, mn, 56001" pat_2 = 'regex match above format' if re.match(pat_2, addr, re.ignorecase): # extract address else: raise valueerror('"{}" must match format: "717 n 2nd st, mankato, mn 56001"'.format(addr)) # stuff address
if me forming regex make sure there pattern match, appreciate it!
here's 1 might help. whenever possible, prefer use verbose regular expressions embedded comments, maintainability.
also note use of (?p<name>pattern)
. helps document intent of match, , provides useful mechanism extract data, if needs go beyond simple regex validation.
import re # goal: '717 n 2nd st, mankato, mn 56001', # goal: '717 n 2nd st, mankato, mn, 56001', regex = r''' (?x) # verbose regular expression (?i) # ignore case (?p<housenumber>\d+)\s+ # matches '717 ' (?p<direction>[news])\s+ # matches 'n ' (?p<streetname>\w+)\s+ # matches '2nd ' (?p<streetdesignator>\w+),\s+ # matches 'st, ' (?p<townname>.*),\s+ # matches 'mankato, ' (?p<state>[a-z]{2}),?\s+ # matches 'mn ' , 'mn, ' (?p<zip>\d{5}) # matches '56001' ''' regex = re.compile(regex) item in ( '717 n 2nd st, mankato, mn 56001', '717 n 2nd st, mankato, mn, 56001', '717 n 2nd, makata, 56001', # should reject 1 '1234 n d ave, east boston, ma, 02134', ): match = regex.match(item) print item if match: print " house on {direction} side of {townname}".format(**match.groupdict()) else: print " invalid entry"
to make fields optional, replace +
*
, since +
means one-or-more, , *
means zero-or-more. here version matches new requirements in comments:
import re # goal: '717 n 2nd st, mankato, mn 56001', # goal: '717 n 2nd st, mankato, mn, 56001', # goal: '717 n 2nd st ne, mankato, mn, 56001', # goal: '717 n 2nd, mankato, mn, 56001', regex = r''' (?x) # verbose regular expression (?i) # ignore case (?p<housenumber>\d+)\s+ # matches '717 ' (?p<direction>[news])\s+ # matches 'n ' (?p<streetname>\w+)\s* # matches '2nd ', optional trailing space (?p<streetdesignator>\w*)\s* # optionally matches 'st ' (?p<streetdirection>[news]*)\s* # optionally matches 'ne' ,\s+ # force comma after street (?p<townname>.*),\s+ # matches 'mankato, ' (?p<state>[a-z]{2}),?\s+ # matches 'mn ' , 'mn, ' (?p<zip>\d{5}) # matches '56001' ''' regex = re.compile(regex) item in ( '717 n 2nd st, mankato, mn 56001', '717 n 2nd st, mankato, mn, 56001', '717 n 2nd, makata, 56001', # should reject 1 '1234 n d ave, east boston, ma, 02134', '717 n 2nd st ne, mankato, mn, 56001', '717 n 2nd, mankato, mn, 56001', ): match = regex.match(item) print item if match: print " house on {direction} side of {townname}".format(**match.groupdict()) else: print " invalid entry"
next, consider or operator, |
, , non-capturing group operator, (?:pattern)
. together, can describe complex alternatives in input format. version matches new requirement addresses have direction before street name, , have direction after street name, no address has direction in both places.
import re # goal: '717 n 2nd st, mankato, mn 56001', # goal: '717 n 2nd st, mankato, mn, 56001', # goal: '717 2nd st ne, mankato, mn, 56001', # goal: '717 n 2nd, mankato, mn, 56001', regex = r''' (?x) # verbose regular expression (?i) # ignore case (?: # matches sort of street address (?: # matches '717 n 2nd st' or '717 n 2nd' (?p<housenumber>\d+)\s+ # matches '717 ' (?p<direction>[news])\s+ # matches 'n ' (?p<streetname>\w+)\s* # matches '2nd ', optional trailing space (?p<streetdesignator>\w*)\s* # optionally matches 'st ' ) | # or (?: # matches '717 2nd st ne' or '717 2nd ne' (?p<housenumber2>\d+)\s+ # matches '717 ' (?p<streetname2>\w+)\s+ # matches '2nd ' (?p<streetdesignator2>\w*)\s* # optionally matches 'st ' (?p<direction2>[news]+) # matches 'ne' ) ) ,\s+ # force comma after street (?p<townname>.*),\s+ # matches 'mankato, ' (?p<state>[a-z]{2}),?\s+ # matches 'mn ' , 'mn, ' (?p<zip>\d{5}) # matches '56001' ''' regex = re.compile(regex) item in ( '717 n 2nd st, mankato, mn 56001', '717 n 2nd st, mankato, mn, 56001', '717 n 2nd, makata, 56001', # should reject 1 '1234 n d ave, east boston, ma, 02134', '717 2nd st ne, mankato, mn, 56001', '717 n 2nd, mankato, mn, 56001', ): match = regex.match(item) print item if match: d = match.groupdict() print " house on {0} side of {1}".format( d['direction'] or d['direction2'], d['townname']) else: print " invalid entry"
Comments
Post a Comment