Visual Web Ripper Logo Visual Web Ripper Logo

Highlighted features

Separating address fields using content transformation

added: 2/1/2010 Version: 2.30.1

Web sites often display addresses in a single field (HTML tag), which makes it difficult to extract the address as separate fields for street, city, zip and state.

For example, look at the address on this web page:

http://maps.google.com/maps/place?cid=15533578316548331756...

The address is placed in a single HTML tag:

<span>5408 Lewisburg Road, Birmingham, AL 35207-1336</span>

It is impossible to select each element in the address, so I'll now show you how you can separate the address fields using content transformation. It is quite easy to separate the fields using regex scripts, but regex syntax can sometimes look a bit daunting. Please see this web site for a good regex tutorial:

http://www.regular-expressions.info/reference.html

First I'll extract the street address (5408 Lewisburg Road). This is quite easy since all I need to do is extract everything until the first comma.

I add a new content element that selects the entire address field and then I add the following regex content transformation.

(.*?),
$1

I now want to extract the city (Birmingham), so I add another content element that also selects the entire address field and then I add the following regex content transformation.

,\s(.*?),

This regex extracts everything between the first comma followed by a blank space and the next comma (\s means blank space).

I now want to extract the state (AL), so I again add a content element that selects the entire address field and then I add the following regex content transformation.

,.*?,\s(.*?)\s
$1

This regex extracts everything between the second comma followed by a blank space and the next blank space.

The last thing I need is the zip code and I can extract that part using this regex:

([0-9-]*)$
$1

This regex looks at the end of the text and extracts everything that are numbers or dashes ($ means end of the text).
 

Comments

3/28/2010

for some reason, mine is not output file doesn't show any extracted data and is left blank

  Required Field - required field
Comment Required Field
Attachement
Loading...
Add
  • Very user friendly visual project designer.
  • Extract complete data structures, such as product catalogues.
  • Repeatedly submit forms for all possible input values.
  • Extract data from highly dynamic web sites including AJAX web sites.
  • Web data extraction scheduler with email notifications and logging.
  • Custom post-processing and comprehensive API.
  • Only $299 including 1 year maintenance.

© 2009-2010 Sequentum  |  Terms & Conditions  |  Privacy Statement  |  Login