Separating address fields using content transformation
added: 2/1/2010
Version: 2.30.1
Web sites often display addresses in a single field (HTML tag), which makes it difficult to extract the address as separate fields for street, city, zip and state.
For example, look at the address on this web page:
http://maps.google.com/maps/place?cid=15533578316548331756...

The address is placed in a single HTML tag:
<span>5408 Lewisburg Road, Birmingham, AL 35207-1336</span>
It is impossible to select each element in the address, so I'll now show you how you can separate the address fields using content transformation. It is quite easy to separate the fields using regex scripts, but regex syntax can sometimes look a bit daunting. Please see this web site for a good regex tutorial:
http://www.regular-expressions.info/reference.html
First I'll extract the street address (5408 Lewisburg Road). This is quite easy since all I need to do is extract everything until the first comma.
I add a new content element that selects the entire address field and then I add the following regex content transformation.
(.*?),
$1
I now want to extract the city (Birmingham), so I add another content element that also selects the entire address field and then I add the following regex content transformation.
,\s(.*?),
This regex extracts everything between the first comma followed by a blank space and the next comma (\s means blank space).
I now want to extract the state (AL), so I again add a content element that selects the entire address field and then I add the following regex content transformation.
,.*?,\s(.*?)\s
$1
This regex extracts everything between the second comma followed by a blank space and the next blank space.
The last thing I need is the zip code and I can extract that part using this regex:
([0-9-]*)$
$1
This regex looks at the end of the text and extracts everything that are numbers or dashes ($ means end of the text).