Thursday 15 April 2010

web crawler - Python Scrapy, LinkExtracotr doesn't work on some specific url redirecting -



web crawler - Python Scrapy, LinkExtracotr doesn't work on some specific url redirecting -

actually new in web , scrapy... please understand if question foolish.

here want, (a)http://www.seoultech.ac.kr/include linked url (b)ctl.seoultech.ac.kr. (b)'s domain subdomain of (a)'s

and start_urls (a), , using allow_domains=(b) of linkextractor , crawler extracts 1 page (b), ,

second, since page (b) includes urls domain, expected extract urls contained within (b), doesn't work, crawling (b).

url (b) redirected http://ctl.seoultech.ac.kr/web/index.php know scrapy process itself, think not problem.

the below simple code.

class seoultech(crawlspider): name = 'seoultech' start_urls = ['http://www.seoultech.ac.kr/'] allowed_domains = ['seoultech.ac.kr'] rules = ( rule(linkextractor(allow_domains=("ctl.seoultech.ac.kr",)), callback="parse_item", follow=true), ) def parse_item(self, response): itemobj = items.seoultechitem() itemobj['url'] = response.url yield itemobj # pipeline store url json format

as said, url (b) redirected http://ctl.seoultech.ac.kr/web/index.php. linkextractor not address page of url (b).

python-2.7 web-crawler scrapy-spider

No comments:

Post a Comment