python - Need for help in this scrapy regular expression -
i pretty new scrapy, trying crawl website using crawlspider, want crawl recursively based on "next" button. not working. think problem comes regular expression, checked many times, can not find mistake. crawl landing page without proceed next page.
# -*- coding: utf-8 -*- start_urls = ['https://shopping.yahoo.com/merchantrating/?mid=13652'] rules = ( rule(linkextractor(allow = "/merchantrating/;_ylt=anf3hf19r8mgfpwuyujuny4ceb0f\?mid=13652&sort=1&start=\d+"), callback = 'parse_start_url', follow = true), ) def parse_start_url(self, response): sel = selector(response) contents = sel.xpath('//p') content in contents: item = bedbugsitem() item['pagecontent'] = content.xpath('text()').extract() self.items.append(item) homecoming self.items
use xpath instead:
rules = ( rule(linkextractor( restrict_xpaths = [ "//div[@class='pagination']//a[contains(., 'next')]" ]), callback = 'parse_start_url', follow = true), )
python regex scrapy
No comments:
Post a Comment