Sunday, 15 April 2012

web scraping - Python Html table can't find data when running on server -



web scraping - Python Html table can't find data when running on server -

hi code won't work when running online, returns none when utilize find how can prepare this?

this code;

import time import sys import urllib import re bs4 import beautifulsoup, navigablestring print "initializing python script" print "the passed arguments " urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/", "http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/", "https://www.alternate.nl/gigabyte/gv-n78toc-3gd-grafische-kaart/html/product/1115798", "http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"] =0 regex = '<title>(.+?)</title>' pattern = re.compile(regex) word = "tweakers" alternate = "alternate" while i<len(urls): dataraw = urllib.urlopen(urls[i]) info = dataraw.read() soup = beautifulsoup(data) table = soup.find("table", {"class" : "spec-detail"}) print table i+=1

here outcome:

initializing python script passed arguments none none none none script finalized

i have tried using findall , other methods.. don't seem understand why working on command line not on server itself... help?

edit

traceback (most recent phone call last): file "python_script.py", line 35, in soup = beautifulsoup(urllib2.urlopen(url), 'html.parser') file "/usr/lib/python2.7/urllib2.py", line 126, in urlopen homecoming _opener.open(url, data, timeout) file "/usr/lib/python2.7/urllib2.py", line 406, in open response = meth(req, response) file "/usr/lib/python2.7/urllib2.py", line 519, in http_response 'http', request, response, code, msg, hdrs) file "/usr/lib/python2.7/urllib2.py", line 444, in error homecoming self._call_chain(*args) file "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain result = func(*args) file "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default raise httperror(req.get_full_url(), code, msg, hdrs, fp) urllib2.httperror: http error 403: forbidden

i'm suspecting experiencing differences between parsers.

specifying parser explicitly works me:

import urllib2 bs4 import beautifulsoup urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/", "http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/", "https://www.alternate.nl/gigabyte/gv-n78toc-3gd-grafische-kaart/html/product/1115798", "http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"] url in urls: soup = beautifulsoup(urllib2.urlopen(url), 'html.parser') table = soup.find("table", {"class": "spec-detail"}) print table

in case, i'm using html.parser, can play around , specify lxml or html5lib, example.

note 3rd url doesn't contain table class="spec-detail" and, therefore, prints none it.

i've introduced few improvements:

removed unused imports replaced while loop indexing nice loop removed unrelated code replaced urllib urllib2

you can utilize requests module , set appropriate user-agent header pretending real browser:

from bs4 import beautifulsoup import requests urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/", "http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/", "https://www.alternate.nl/gigabyte/gv-n78toc-3gd-grafische-kaart/html/product/1115798", "http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"] headers = {'user-agent': 'mozilla/5.0 (macintosh; intel mac os x 10_9_5) applewebkit/537.36 (khtml, gecko) chrome/37.0.2062.124 safari/537.36'} url in urls: response = requests.get(url, headers=headers) soup = beautifulsoup(response.content, 'html.parser') table = soup.find("table", {"class": "spec-detail"}) print table

hope helps.

python web-scraping beautifulsoup

No comments:

Post a Comment