Extracting new data only
added: 2/26/2010
Version: 2.33.1
When extracting data from websites such as forums, it is often desirable to extract only new data that has been posted since the last time data was extracted. This can be achieved by cancelling data extraction when duplicate data is detected.
Visual Web Ripper saves reference data every time a project is run and the reference data is used to detect duplicate data between project runs.
Visual Web Ripper can cancel an entire data table when duplicate data is found, or it can cancel just the duplicate data row. A data table always correspond to a Visual Web Ripper template, but not all templates create a new data table. Form templates and templates defining a list will create a new data table.
I'll look at the following website as an example:
I'll extract extract data from all topics on the first 100 pages in the forum and update the data on an hourly basis. Extracting data from 100 pages of topics will take quite a while, and if I extract all topics every time I update the data, I'll be extracting a lot of duplicate data. I'll use the duplicate feature to ensure that I only extract new or updated topics.
First I need to design the project, which is quite simple in this case. I create a page area template to iterate through all the topics on one page and a page navigation template to iterate through all the pages. The page area template will contain the topic type, title and last post date. The page area template will also contain a link template that links through to the topic detail page and extracts some data from there.
The topic title is displayed in two different locations depending on the topic type, so I use an alternative content element for the title to make sure I have both locations covered.
I now need to decide how to detect duplicate data. The topic title and last post date can be used to detect duplicate data, so I set the "Duplicate check" element option on the "title" and "last post date" content elements.
I now need to decide what action to take when duplicate data is detected. The default action is to take no action, so I need to change the "Duplicate action" option, which is found in the "More" options tab. I need to change the option on the template that generates a new data table, which is the page area template in this case. I will change the action to "CancelDataTable", which will cancel data extraction when duplicate data is first detected.
The next problem is the sticky topics at the top of the forum. These topics remain at the top no matter if they have been updated or not, and since I have set the option to cancel data extraction when the first duplicate row is detected, Visual Web Ripper will cancel data extraction immediately without checking if new non-sticky topics have been posted.
If I know approximately how many sticky posts there will be at the top of the forum, I can use the "Min. CancelDataRow checks" option. This options specifies the minimum number of rows Visual Web Ripper will process before cancelling the data table. Duplicate data will still be removed, but Visual Web Ripper will continue to iterate through data until it has processed this minimum number of data rows.
If I don't know how many sticky topics the forum will have, or if I just want a more exact and reliable approach than guessing the maximum number of sticky topics, then I can use a script to decide if the data table should be cancelled or if Visual Web Ripper should just cancel the data row. In this case I could create a script that checks if the topic type contains the text "Announcement:" or "Sticky:". The duplicate script is only executed when duplicate data is detected and the script must then return the action to take.
- using System;
- using mshtml;
- using VisualWebRipper;
- public class Script
- {
-
- public static DuplicateAction DecideDuplicateAction(WrDuplicateActionArguments args)
- {
- try
- {
- if(args.DataRow["type"] == "Sticky:" || args.DataRow["type"] == "Announcement:")
- return DuplicateAction.CancelDataRow;
- return DuplicateAction.CancelDataTable;
- }
- catch(Exception exp)
- {
- args.WriteDebug(exp.Message);
- return DuplicateAction.NoAction;
- }
- }
- }
Visual Web Ripper saves duplicate reference data every time the project has completed extracting data. Visual Web Ripper loads any existing reference data into memory before it starts extracting data, so it can compare previously extracted data with new data. After the project has run for a long time, the size of the reference data can become substantial. If I'm cancelling data extracting when I first encounter duplicate data, then I normally only need reference data from the last data extraction and not from the very first time I ran the project. I can set the "Discard old reference data" to only use reference data from the last time I ran the project.
Download demo project Gaiaonline.zip