Visual Web Ripper Logo Visual Web Ripper Logo
Welcome Guest Search | Active Topics | Log In | Register

Tag as favorite
Blocked by Firewall
Enemia
#1 Posted : Monday, August 15, 2011 10:28:39 AM
Groups: Registered
Joined: 8/6/2010
Posts: 15
Hi VWR Community!

following is my current problem:
i scrape a website with many requests. after app. 90min. i'm blocked by the website.
i assume, that a firewall is blocking me. in less than one hour the blocking is abandoned and i can continue.

what i already tried:
1. i used random page delay, with a delay time from 6000 to 10000 ms. -> no success
2. i used a proxy -> it works (but i don't want to use a proxy in my infrastructure)
3. i switched between ie, webbrowser and webcrawler -> no success
4. doing partial load (delay between 2-3seconds) -> 30min vwr + 5min pause + 30min vwr + 5min pause and so on --> success

what i like to have:
1. i want vwr to stop after a specific time / or a specific worklad, wait a specific amount of time and continue working after that. my idea is to have a global counter or a global variable, which i can increase using a script.
on my first page is a search form, after every search i like to increase my counter. if this counter is 500, i can use the .net SLEEP function to pause the script. unfortunately i don't know, how i can define a global variable and increase its value within my vwr template.

2. if vwr can not reach the website (because it it blocked by firewall), i like vwr to wait a specific amount of time and try again. for the time being vwr continues with lots of errors.

3. maybe someone has an alternative?

i appreciate any help :-)

greetings,
Marc
Banner
#2 Posted : Monday, August 15, 2011 11:49:20 AM
Groups: Registered
Joined: 3/9/2011
Posts: 23
You might be forced to run with a proxy. There are a lot of professional proxy services out there.

Are you using input data? If so.. you could try selecting top X rows (and deleting them with the same query), run your ripper.. it will automatically stop when it gets done.. then just put in a delay and run the same ripper again (getting the next top xx rows)

You could do all that with a simple batch file. I have a project that doesn't have the time-out problem, but does run a little slow with 1 ripper, so I used this:

Quote:

UPDATE TOP(250)
CaseRecords
SET
GrabbedByMailRipper = 1
OUTPUT
Inserted.ID AS ID,
Inserted.Address1 AS Address1,
Inserted.Address2 AS Address2,
Inserted.City AS City,
Inserted.State AS State,
Inserted.Zipcode AS Zipcode
WHERE
(
GrabbedByMailRipper IS NULL OR
GrabbedByMailRipper = 0
) AND
cAddress1 IS NULL


as the input SQL query, flagging my records, and then run up to 4 rippers at once (same project) in a loop. Seems to work well for the most part. I did have to put a delay between the rippers starting.

Just an FYI: If you have a UPDATE or DELETE statement that returns data as input, it will run every time you open the project in VWR development environment, not just when the project is run. In my case I have to shut off SQL server if I open that project in development.


Hope this helps.
Enemia
#3 Posted : Thursday, August 25, 2011 10:05:45 AM
Groups: Registered
Joined: 8/6/2010
Posts: 15
Hi,

thanks for your advice.
i've got a further question:

when i got locked from webite, my vwr log show this:

11-08-25 15:30 Navigation error (Click). Timeout waiting for browser to navigate. Action:
11-08-25 15:30 Error submitting form: Suche
11-08-25 15:30 Processing start URL http://www.xxxyyy.de
11-08-25 15:35 Navigation error (Direct URL). Timeout waiting for page load. Url: http://www.xxxyyy.de
11-08-25 15:35 Error processing start URL url (Timeout loading URL)
11-08-25 15:35 Processing start URL http://www.xxxyyy.de
11-08-25 15:35 Form element was not found: Artikel
11-08-25 15:35 Unable to connect to the target website. The website may be unreachable or you are not connected to the Internet.
11-08-25 15:35 Unable to recover from error by navigating to: http://www.xxxyyy.de/

Now this error message continues for each input data row.
how can i stop processing, when this error occurs. and how can i give a specified return code back to the script, which calls vwr.

thanks,
marc
Sequentum Support
#4 Posted : Monday, September 05, 2011 5:24:21 AM

Groups: Administrators
Joined: 4/10/2010
Posts: 1,239
Location: Sydney, Australia
Hi,

In the Misc tab, you can select an option "Required element". If VWR cannot see the element, the project will be stopped automatically.
Enemia
#5 Posted : Tuesday, September 06, 2011 5:53:42 AM
Groups: Registered
Joined: 8/6/2010
Posts: 15
hi,

thanks for help.
i want to grep exactly this error (timeout while loading), and give a user specified return code back to calling script (a php background process), so that further error handling can be processed by this script.

is this possible?
Users browsing this topic
Guest
Tag as favorite
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

Powered by YAF 1.9.4 RC1 | YAF © 2003-2009, Yet Another Forum.NET
This page was generated in 0.095 seconds.