Monday 15 August 2011

python - scrapy outputs results into one row of csv -



python - scrapy outputs results into one row of csv -

this similar this answers didn't work me. follow question initial csv output woes. dreyescat's help able crawlspider output csv. however, print 2 columns (that correspond 2 fields) , 1 row (dumping results in appropriate column). recreated illustration dreyescat gave me hackernews , works , that's i'm trying replicate.

here's code (which pretty much copied hackernews example):

import scrapy scrapy.contrib.spiders import crawlspider, rule scrapy.contrib.linkextractors import linkextractor targets.items import targetsitem class myspider(crawlspider): name = 'reuters' allowed_domains = ['blogs.reuters.com'] start_urls = [ 'http://blogs.reuters.com/us/' ] rules = ( rule(linkextractor(allow_domains=('blogs.reuters.com', )), callback='parse_item'), ) def parse_item(self, response): item = targetsitem() item['title'] = response.xpath('//h2/a/text()').extract() item['link'] = response.xpath('//h2/a/@href').extract() homecoming item

the edited output console looks this:

2014-10-24 13:04:04-0400 [reuters] debug: scraped <200 http://blogs.reuters.com/hugo-dixon/> {'link': [u'//blogs.reuters.com/hugo-dixon/2014/10/20/markets-right-to-worry-about-euro-zone/', u'//blogs.reuters.com/hugo-dixon/2014/10/13/italy-has-no-good-plan-b/', u'//blogs.reuters.com/hugo-dixon/2014/10/06/how-to-manage-a-corporate-crisis/', 'title': [u'markets right worry euro zone', u'italy has no plan b', u'how manage corporate crisis']}

but want output illustration dreyescat gave me:

2014-10-24 13:14:54-0400 [hackernews] debug: scraped <200 https://news.ycombinator.com/item?id=8502433> {'comment': [u"i - java people want work in java. however, tool seems targeted @ m in mvc paradigm. still need write views , controllers in objective-c. unless app has big number of complex model objects, it's quicker retype model classes in objective-c. of course of study if app have lot of complex model objects (as google does) , want have them in sync across platforms without having retype makes ton of sense. bulk of apps, not."], 'title': [u'google j2objc, java ios objective-c translation tool , runtime']}

i suspect has xpath @ point, have little thought i'm doing wrong. hopefully, can help me out. much appreciated!

your xpath seems targetting element of homepage itself. code doesn't work that: allow me seek explain.

rules = ( rule(linkextractor(allow_domains=('blogs.reuters.com', )), callback='parse_item'), )

the above code block defines kind of links useful (to processed further). spider picks links in above domain, , opens page, , passes individual page parse_item function. xpath in parse_item function should targetting @ page opens when click 1 of blogs.reuters.com/... links.

in case, links in homepage lead individual articles. checked headline of article can captured using xpath //h2/text().

so maybe should alter parse_item function this:

def parse_item(self, response): item = targetsitem() item['title'] = response.xpath('//h2/text()').extract() item['link'] = response.xpath('<insert xpath obtain link news post>').extract() homecoming item

remember, parse_item gets every link that's in domain blogs.reuters.com. you've write xpaths understand page @ every link.

i couldn't find link of page in page. can maybe utilize url in case:

item['link'] = response.url #or something. read manual

python scrapy

No comments:

Post a Comment