Extracting Data From web Scraper Protected Web Sites

Many web sites implement various techniques to prevent web scrapers from extracting web data from their web sites. The most popular protection techniques are CAPTCHA and IP banning.
CAPTCHA protected web sites displays a word as an image and requires the user to enter the word he sees in order to proceed. It is impossible for web scraping software to bypass a CAPTCHA screen, because the web scraper is unable to extract the word from the image. OCR technology can be used to recognise words in an image, but most CAPCHA images include noise which makes it impossible to consistently recognise the words using OCR.
Visual Web Ripper is an advanced web grabber tool that features semi-automatic processing of CAPCHA protected web sites. Visual Web Ripper can recognise CAPTCHA screens while extracting data and display the CACHA image in a Window. Once the user enters the CAPTCHA word in the form, Visual Web Ripper will automatically enter the word on the website and continue extracting web data. CAPTCHA is normally only used in a few places on a website in order not to annoy ordinary users, so the operator of the web scraping software normally only need to enter a CAPTCHA word a few times for each web scraping session.
If you are extracting large quantities of data from a web site, the web site may recognise your IP-address and ban the IP-address from the website. This means you will no longer be able to visit the web site, or extract data from the web site.
Instead of using your own IP-address to access the web site, you can access the website through a proxy-server, so the web site sees the proxy-server’s IP-address instead of yours. The Visual Web Ripper web scraping software allows you to enter a list of proxy-servers and will automatically cycle through the proxy-servers, so the target website doesn’t see one single IP-address extracting lots of web data.
Another benefit of using a proxy-server is that the target website will never be able to recognise you by looking up the owner of you IP-address.
Most free proxy-servers are quite unreliable, and if you are unwilling to pay for stable proxy-servers, you may want to take a look at the free TOR network. TOR is a network of proxies, so your web request will go through multiple proxy-servers before ending up on the target web server. This is obviously a very secure and private way of scraping the web, but it does reduce the web data extraction speed. The Visual Web Ripper web scraping software works well with the TOR network.

Tags: , , ,