Wednesday, 21 August 2013

FInd a US street address in text (preferably using Python regex)

FInd a US street address in text (preferably using Python regex)

Disclaimer: I read very carefully this thread: Street Address search in a
string - Python or Ruby and many other resources.
Nothing works for me so far.
In some more details here is what I am looking for is:
The rules are relaxed and I definitely am not asking for a perfect code
that covers all cases; just a few simple basic ones with assumptions that
the address should be in the format:
a) Street number (1...N digits); b) Street name : one or more words
capitalized; b-2) (optional) would be best if it could be prefixed with
abbrev. "S.", "N.", "E.", "W." c) (optional) unit/apartment/etc can be any
(incl. empty) number of arbitrary characters d) Street "type": one of
("st.", "ave.", "way"); e) City name : 1 or more Capitalized words; f)
(optional) state abbreviation (2 letters) g) (optional) zip which is any 5
digits.
None of the above needs to be a valid thing (e.g. an existing city or zip).
I am trying expressions like these so far:
pat = re.compile(r'\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH),
[0-9]{5}(-[0-9]{4})?', re.IGNORECASE)
>>> pat.search("123 East Virginia avenue, unit 123, San Ramondo, CA, 94444")
Don't work, and for me it's not easy to understand why. Specifically: how
do I separate in my pattern a group of any words from one of specific
words that should follow, like state abbrev. or street "type ("st., ave.)?
Anyhow: here is an example of what I am hoping to get: Given def
ex_addr(text): # does the re magic # returns 1st address (all addresses?)
or None if nothing found
for t in [
'The meeting will be held at 22 West Westin st., South Carolina, 12345 on
Nov.-18',
'The meeting will be held at 22 West Westin street, SC, 12345 on Nov.-18',
'Hi there,\n How about meeting tomorr. @10am-sh in Chadds @ 123 S.
Vancouver ave. in Ottawa? \nThanks!!!',
'Hi there,\n How about meeting tomorr. @10am-sh in Chadds @ 123 S.
Vancouver avenue in Ottawa? \nThanks!!!',
'This was written in 1999 in Montreal',
"Cool cafe at 420 Funny Lane, Cupertino CA is way too cool",
"We're at a party at 12321 Mammoth Lane, Lexington MA 77777; Come have a
beer!"
] print ex_addr(t)
I would like to get: '22 West Westin st., South Carolina, 12345' '22 West
Westin street, SC, 12345' '123 S. Vancouver ave. in Ottawa' '123 S.
Vancouver avenue in Ottawa' None # for 'This was written in 1999 in
Montreal', "420 Funny Lane, Cupertino CA", "12321 Mammoth Lane, Lexington
MA 77777"
Could you please help?

No comments:

Post a Comment