Screen scraping in server side

Screen scraping in server side

My experience with scraping is that if you arent doing anything super complex (like logging into a secure website like an online banking website, etc.) then Python has some great libraries that will help you out a lot.

To answer your questions:

1) You may need to be more clear, but this really depends on your server/client architecture.

2) As a matter of fact you do. Urllib and Urllib2 (built-in Python libraries) both have functions that enable you to encrypt data before you make a POST. As far as how secure this encryption is, for most applications, this will suffice.

3) I actually have done scraping on online banking sites! Im not exactly familiar with that tool, but I would recommend using something a little different than a scraper. Selenium, which is a web-driver, allows you to simulate the use of a browser, meaning anything that the broswer does in the background in order to validate the session is automatically taken care of. The main problem I ran into while trying to scrape the banking site was the loss of important session data.

Selenium – https://pypi.python.org/pypi/selenium

Other libraries you may find useful are: urllib, urllib2, and Mechanize

I hope I was somewhat helpful!

Ive used screen-scraper to scrape banking sites before. It will impact the site just like your browser–if the site uses encryption the connection from screen-scraper to the site will be too.

If you have a client page sending data to screen-scraper, you probably should encrypt that. I generally just make the connection via SSH.

Screen scraping in server side

1) What do you mean by server side? Your proxy server or screen-scraper software? Any of them can read/store your information.

2) If you are connecting through HTTPS then your software should warn you about malicious proxy server: https://security.stackexchange.com/questions/8145/does-https-prevent-man-in-the-middle-attacks-by-proxy-server

3) I dont think they have some logger which they can read. But if you are concerned you can try to write your own. There are some APIs which you can read HTML easily with jQuery sintax:
https://pypi.python.org/pypi/pyquery or XPath: http://net.tutsplus.com/tutorials/javascript-ajax/web-scraping-with-node-js/

Leave a Reply

Your email address will not be published. Required fields are marked *