web crawler - Python Scrapy, LinkExtracotr doesn't work on some specific url redirecting -
actually new in web , scrapy... please understand if question foolish.
here want, (a)http://www.seoultech.ac.kr/
include linked url (b)ctl.seoultech.ac.kr
. (b)'s domain subdomain of (a)'s
and start_urls
(a), , using allow_domains
=(b) of linkextractor
, crawler extracts 1 page (b), ,
second, since page (b) includes urls domain, expected extract urls contained within (b), doesn't work, crawling (b).
url (b) redirected http://ctl.seoultech.ac.kr/web/index.php
know scrapy process itself, think not problem.
the below simple code.
class seoultech(crawlspider): name = 'seoultech' start_urls = ['http://www.seoultech.ac.kr/'] allowed_domains = ['seoultech.ac.kr'] rules = ( rule(linkextractor(allow_domains=("ctl.seoultech.ac.kr",)), callback="parse_item", follow=true), ) def parse_item(self, response): itemobj = items.seoultechitem() itemobj['url'] = response.url yield itemobj # pipeline store url json format
as said, url (b) redirected http://ctl.seoultech.ac.kr/web/index.php
. linkextractor not address page of url (b).
python-2.7 web-crawler scrapy-spider
No comments:
Post a Comment