Wednesday 15 February 2012

python - Scrapy accidentally over-writing items when running concurrently? -



python - Scrapy accidentally over-writing items when running concurrently? -

i have been running scrapy scraper, , noticed (about 10% of time) returning duplicate results. in other words, assigning results item item.

i assume concurrency , global variables, i'm not sure what. have set 250ms delay between requests, looks though results still beingness returned in parallel , accidentally over-writing each other.

this spider code:

def start_requests(self): settings = get_project_settings() ids = settings.get('ids', none) i, id in enumerate(ids): yield formrequest( url=self._form_url, formdata={ 'id': id }, meta={'id': id}, ) def parse(self, response): addr_xpath = '//div[@class="w80p left floatright"]//text()' addresses = response.xpath(addr_xpath).extract() if not addresses: raise dropitem("can't find address") item = myitem() item['address'] = ', '.join(addresses) homecoming item

what doing wrong?

python web-scraping scrapy

No comments:

Post a Comment