Breedlove: python - Scrapy accidentally over-writing items when running concurrently? -

Wednesday, 15 February 2012

python - Scrapy accidentally over-writing items when running concurrently? -

i have been running scrapy scraper, , noticed (about 10% of time) returning duplicate results. in other words, assigning results item item.

i assume concurrency , global variables, i'm not sure what. have set 250ms delay between requests, looks though results still beingness returned in parallel , accidentally over-writing each other.

this spider code:

def start_requests(self):     settings = get_project_settings()     ids = settings.get('ids', none)     i, id in enumerate(ids):         yield formrequest(             url=self._form_url,             formdata={ 'id': id },             meta={'id': id},         )  def parse(self, response):     addr_xpath = '//div[@class="w80p  left floatright"]//text()'     addresses = response.xpath(addr_xpath).extract()     if not addresses:         raise dropitem("can't find address")      item = myitem()     item['address'] = ', '.join(addresses)      homecoming item

what doing wrong?

python web-scraping scrapy

Breedlove

Wednesday, 15 February 2012

python - Scrapy accidentally over-writing items when running concurrently? -

No comments:

Post a Comment