Visual Web Ripper Logo Visual Web Ripper Logo

Highlighted features

Creating a list from a single block of text

added: 1/31/2010 Version: 2.30.1

Sometimes web sites list information without any real structure. For example, you may have a news web site listing headlines with short descriptions, but there are now HTML tags that tie headlines and descriptions together. The HTML may look like this:

<div>
<a>Headline1</a>
description1
<br>
<a>headline2</a>
description2
</div>

All news articles are inserted into one single HTML tag with headlines inserted as links.

When you extract this data you would normally want headline and description to go into the same data row, but it is impossible to select a list that includes headline and description in one row. If you try to select the description text, the whole page will get selected because all the news articles share the same parent HTML tag.

You can view an example of such a scenario on this web page:

http://knowledge.wharton.upenn.edu/category.cfm?cid=1

To extract data from this web site follow these steps (link to demo project below):

1. Create a list selection that selects all the headline links, and then create a page area template with this selection.

2. Open the page area template and add a new content element. You are in a page area template that only covers the headline links, so you are only able to select the headline links. However, to get to the description text you need to select the parent element of the headline link. You do that by manually editing the selection xpath and enter ".." as path. ".." means parent element in xpath syntax. You should now see that the whole page is getting selected.

You now need to add a content transformation script that extracts only the description text that belongs to the current headline. You need a regex script that extracts text between the current headline and the date tag below the description.

3. Add a new content element that selects the headline and choose Html as capture type. Name the element titleHtml. You will only use this element for processing, so untick the "Save content" option so it doesn't appear in the output data.

4. Edit the description content element (or add it if you didn't do so in step 2) and enter the following content transformation regex script:

{$titleHtml}(.*?)<EM
$1
strip_html

This regex script will extract all text between the current headline and the date tag below the description. {$titleHtml} is a data bound field and is replaced with the extracted value from the content element titleHtml.

5. Add a new content element that selects the headline.

When you run the project you will now get a data row containing headline and description for each news article on the web page.

Download demo project Upenn.zip

Comments

4/10/2010

This is a very tricky method. At least it works :-)
How about a more complicated situation like this:
<div>
Headline1: description1<br>
headline2: description2<br>
</div>
Is it possible to split almost plain text into content elements?
I think about an ability to transform content into more complex html (content element into template). We can transform innerHTML into this:
<div>
<P><B>Headline1</B>: <I>description1</I></P>
<P><B>headline2</B>: <I>description2</I></P>
</div>
 So we can use standard methods after this transformation. But it seems there is no such ability yet.
 

4/10/2010

Sequentum Support

That's a very interesting idea. It's going on the to-do list.

  Required Field - required field
Comment Required Field
Attachement
Loading...
Add
  • Very user friendly visual project designer.
  • Extract complete data structures, such as product catalogues.
  • Repeatedly submit forms for all possible input values.
  • Extract data from highly dynamic web sites including AJAX web sites.
  • Web data extraction scheduler with email notifications and logging.
  • Custom post-processing and comprehensive API.
  • Only $299 including 1 year maintenance.

© 2009-2010 Sequentum  |  Terms & Conditions  |  Privacy Statement  |  Login