Scraping DATA from Javascript using SCRAPY and PYTHON

Scraping DATA from Javascript using SCRAPY and PYTHON

If you look deeper into source each link has following markup:

<a id=DGMovie_ctl03_lnk href=javascript:__doPostBack(DGMovie$ctl03$lnk,)>AGNI PARIKSHAYA</a>

Now you know how this javascript function is actually called, you have value of event target and event argument. To make sure that you are on right track you can also check what happens by investigating page with developer tools, if you are using chrome remember to check preserve log button. You will see first argument to postback in href as EVENTTARGET.

Following xpath with regular expressions will give you all postback arguments:

sel.xpath(//*[contains(@id,DGMovie)]/@href).re(doPostBack(([^]+))

You need to make POST request with each param to get your information. Note that your web page uses iframes, so you need to get into iframe source first.

pawel@stack:~/stack$ scrapy shell http://cbfcindia.gov.in/html/uniquepage.aspx?va=a&Type=search
In [31]: url = sel.xpath(//iframe/@src).extract()[0]

In [33]: url
Out[33]: usearchresults.aspx?va=a&Type=search

In [35]: from urlparse import urljoin

In [36]: url = urljoin(response.url, url) 

In [39]: from scrapy.http import Request

In [40]: req = Request(url)
in [41]: fetch(req)

# after fetching request..
In [48]: js_links = sel.xpath(//*[contains(@id,DGMovie)]/@href).re(doPostBack(([^]+))
In [49]: param = js_links[0]

In [50]: param
Out[50]: uDGMovie$ctl03$lnk

In [51]: from scrapy.http import FormRequest

In [52]: fr = FormRequest.from_response(response, formdata={__EVENTTARGET:param})

In [53]: fetch(fr)
2014-06-02 21:09:09+0100 [default] DEBUG: Redirecting (302) to <GET http://cbfcindia.gov.in/html/SearchDetails.aspx?mid=15&Loc=Backlog> from <POST http://cbfcindia.gov.in/html/searchresults.aspx?va=a&Type=search>
2014-06-02 21:09:10+0100 [default] DEBUG: Crawled (200) <GET http://cbfcindia.gov.in/html/SearchDetails.aspx?mid=15&Loc=Backlog> (referer: None)
In [54]: view(response)

In spider you need to refactor your parse method so that it yields FormRequest with callback to parse_items, than move your parsing logic to parse_items (from parse).

Dont forget about pagination, this is done with postbacks as well!

Those asp.net pages with postback are usually most difficult to parse. Read more about them if you are interested

Scraping DATA from Javascript using SCRAPY and PYTHON

Leave a Reply

Your email address will not be published. Required fields are marked *