Breedlove: web crawler - Python Scrapy, LinkExtracotr doesn't work on some specific url redirecting -

Thursday 15 April 2010

web crawler - Python Scrapy, LinkExtracotr doesn't work on some specific url redirecting -

actually new in web , scrapy... please understand if question foolish.

here want, (a)http://www.seoultech.ac.kr/include linked url (b)ctl.seoultech.ac.kr. (b)'s domain subdomain of (a)'s

and start_urls (a), , using allow_domains=(b) of linkextractor , crawler extracts 1 page (b), ,

second, since page (b) includes urls domain, expected extract urls contained within (b), doesn't work, crawling (b).

url (b) redirected http://ctl.seoultech.ac.kr/web/index.php know scrapy process itself, think not problem.

the below simple code.

class seoultech(crawlspider):     name = 'seoultech'     start_urls = ['http://www.seoultech.ac.kr/']     allowed_domains = ['seoultech.ac.kr']     rules = (                 rule(linkextractor(allow_domains=("ctl.seoultech.ac.kr",)), callback="parse_item", follow=true),              )      def parse_item(self, response):         itemobj = items.seoultechitem()         itemobj['url'] = response.url         yield itemobj  # pipeline store url json format

as said, url (b) redirected http://ctl.seoultech.ac.kr/web/index.php. linkextractor not address page of url (b).

python-2.7 web-crawler scrapy-spider

Breedlove

Thursday 15 April 2010

web crawler - Python Scrapy, LinkExtracotr doesn't work on some specific url redirecting -

No comments:

Post a Comment