Extracting Data From CAPTCHA-Protected Websites
Visual Web Ripper features both semi-automatic and full-automatic data extraction from websites using CAPTCHA protection. Full-automatic data extraction requires an account with a third party CAPTCHA recognition service and a fee is charged for each CAPTCHA image. Semi-automatic data extraction is free, but requires you to manually decode CAPTCHA images while running a data extraction project.
Using Proxy Servers
Sometimes the easiest solution to CAPTCHA protected websites is using a list of proxy servers. This is especially true when CAPTCHA pages are displayed randomly after browsing the website for a while. Proxy servers will not help if you always need to pass a CAPTCHA page in order to enter a section of a website.
Semi-Automatic Data Extraction
To configure your data extraction project for semi-automatic CAPTCHA processing, you need to do the following:
- Add a content element that selects the CAPTCHA image. Then use the Misc options tab to uncheck the Save content option.
- Add a FormField element that selects the CAPTCHA input field. Then use the AdvancedOptions tab to select the image element as a CAPTCHA element.
- Add a FormSubmit template that submits the CAPTCHA form. You may need to set the Misc option Optional template if the CAPTCHA form is not always displayed.
When Visual Web Ripper encounters a CAPTCHA element, it will display the CAPTCHA image and request the CAPTCHA code.
Full-Automatic Data Extraction
Full-automatic CAPTCHA processing requires an account with a third party CAPTCHA recognition service. The third party recognition service must provide a .NET API and you must create a Visual Web Ripper script that uses this API to call the service.
Visual Web Ripper includes the API and standard script to call the following CAPTCHA recognition service.
http://www.deathbycaptcha.com
This CAPTCHA recognition service currently charges US$1.39 per 1000 CAPTCHAs. We are not affiliated with this company and therefore don't charge any additional fees for this service.
To configure your data extraction project for full-automatic CAPTCHA processing, you need to do the following:
- Add a content element that selects the CAPTCHA image. Then use the Misc options tab to uncheck the Save content option.
- Add a FormField element that selects the CAPTCHA input field. Then use the AdvancedOptions tab to select the image element as a CAPTCHA element.
- Use the AdvancedOptions tab to add a Decode CAPTCHA script to the FormField element that selects the CAPTCHA input field.
- Add a FormSubmit template that submits the CAPTCHA form. You may need to set the Misc option Optional template if the CAPTCHA form is not always displayed.
Decode CAPTCHA Script
A decode CAPTCHA script is used to call a CAPTCHA recognition service. The script gets the CAPTCHA image is an input parameter and should return the decoded CAPTCHA value in string format.
You can add a decode CAPTCHA script to a FormField element by clicking the Decode CAPTCHA script option button in Advanced Options.
The script editor opens after you click the Decode CAPTCHA script button.
The default decode CAPTCHA script is designed to work with the www.deathbycaptcha.com service and if you are using this service, you only need to add your login name and password.
A decode CAPTCHA script can be written in C# or VB.NET.
C# and VB.NET Scripts
A decode CAPTCHA script must have one method as shown below.
-
using
System;
-
using
mshtml;
-
using
VisualWebRipper;
-
public
class
Script
-
{
-
-
public
static
string DecodeCaptcha(WrDecodeCaptchaArguments args)
-
{
-
try
-
{
-
-
string captcha = DeathByCaptchaService.DecodeCaptcha
-
(args.ImagePath, "login", "password");
-
-
return
captcha;
-
}
-
catch
(Exception exp)
-
{
-
args.WriteDebug(exp.Message);
-
return
""
;
-
}
-
}
-
}
public static bool DecodeCaptcha(WrDecodeCaptchaArguments args)
The script method DecodeCaptcha must have this exact name and signature, so change only the method body, not the method signature. The method must return decoded CAPTCHA value.
WrProjectInitializeArguments Properties
| Name |
Type |
Description |
| ImagePath |
String |
The CAPTCHA image path. |
| Project |
WrProject |
The current Visual Web Ripper project. |
| DestinationDataSource |
WrDataSource |
Destination data source configuration. |
| InputDataSource |
WrInputDataSource |
Input data source configuration. |
| StartTemplate |
WrTemplate |
The first template in the project. |
| Database |
WrSharedDatabase |
An open database connection.
|
| InputParameters |
WrInputParameters |
Input parameters for the current project.
|