Creating a list from a single block of text
added: 1/31/2010
Version: 2.30.1
Sometimes web sites list information without any real structure. For example, you may have a news web site listing headlines with short descriptions, but there are now HTML tags that tie headlines and descriptions together. The HTML may look like this:
<div>
<a>Headline1</a>
description1
<br>
<a>headline2</a>
description2
</div>
All news articles are inserted into one single HTML tag with headlines inserted as links.
When you extract this data you would normally want headline and description to go into the same data row, but it is impossible to select a list that includes headline and description in one row. If you try to select the description text, the whole page will get selected because all the news articles share the same parent HTML tag.
You can view an example of such a scenario on this web page:
http://knowledge.wharton.upenn.edu/category.cfm?cid=1
To extract data from this web site follow these steps (link to demo project below):
1. Create a list selection that selects all the headline links, and then create a page area template with this selection.
2. Open the page area template and add a new content element. You are in a page area template that only covers the headline links, so you are only able to select the headline links. However, to get to the description text you need to select the parent element of the headline link. You do that by manually editing the selection xpath and enter ".." as path. ".." means parent element in xpath syntax. You should now see that the whole page is getting selected.
You now need to add a content transformation script that extracts only the description text that belongs to the current headline. You need a regex script that extracts text between the current headline and the date tag below the description.
3. Add a new content element that selects the headline and choose Html as capture type. Name the element titleHtml. You will only use this element for processing, so untick the "Save content" option so it doesn't appear in the output data.
4. Edit the description content element (or add it if you didn't do so in step 2) and enter the following content transformation regex script:
{$titleHtml}(.*?)<EM
$1
strip_html
This regex script will extract all text between the current headline and the date tag below the description. {$titleHtml} is a data bound field and is replaced with the extracted value from the content element titleHtml.
5. Add a new content element that selects the headline.
When you run the project you will now get a data row containing headline and description for each news article on the web page.
Download demo project Upenn.zip