web scraping - Python Html table can't find data when running on server -
hi code won't work when running online, returns none
when utilize find
how can prepare this?
this code;
import time import sys import urllib import re bs4 import beautifulsoup, navigablestring print "initializing python script" print "the passed arguments " urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/", "http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/", "https://www.alternate.nl/gigabyte/gv-n78toc-3gd-grafische-kaart/html/product/1115798", "http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"] =0 regex = '<title>(.+?)</title>' pattern = re.compile(regex) word = "tweakers" alternate = "alternate" while i<len(urls): dataraw = urllib.urlopen(urls[i]) info = dataraw.read() soup = beautifulsoup(data) table = soup.find("table", {"class" : "spec-detail"}) print table i+=1
here outcome:
initializing python script passed arguments none none none none script finalized
i have tried using findall , other methods.. don't seem understand why working on command line not on server itself... help?
edit
traceback (most recent phone call last): file "python_script.py", line 35, in soup = beautifulsoup(urllib2.urlopen(url), 'html.parser') file "/usr/lib/python2.7/urllib2.py", line 126, in urlopen homecoming _opener.open(url, data, timeout) file "/usr/lib/python2.7/urllib2.py", line 406, in open response = meth(req, response) file "/usr/lib/python2.7/urllib2.py", line 519, in http_response 'http', request, response, code, msg, hdrs) file "/usr/lib/python2.7/urllib2.py", line 444, in error homecoming self._call_chain(*args) file "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain result = func(*args) file "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default raise httperror(req.get_full_url(), code, msg, hdrs, fp) urllib2.httperror: http error 403: forbidden
i'm suspecting experiencing differences between parsers.
specifying parser explicitly works me:
import urllib2 bs4 import beautifulsoup urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/", "http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/", "https://www.alternate.nl/gigabyte/gv-n78toc-3gd-grafische-kaart/html/product/1115798", "http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"] url in urls: soup = beautifulsoup(urllib2.urlopen(url), 'html.parser') table = soup.find("table", {"class": "spec-detail"}) print table
in case, i'm using html.parser
, can play around , specify lxml
or html5lib
, example.
note 3rd url doesn't contain table
class="spec-detail"
and, therefore, prints none
it.
i've introduced few improvements:
removed unused imports replaced while loop indexing nice loop removed unrelated code replacedurllib
urllib2
you can utilize requests
module , set appropriate user-agent
header pretending real browser:
from bs4 import beautifulsoup import requests urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/", "http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/", "https://www.alternate.nl/gigabyte/gv-n78toc-3gd-grafische-kaart/html/product/1115798", "http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"] headers = {'user-agent': 'mozilla/5.0 (macintosh; intel mac os x 10_9_5) applewebkit/537.36 (khtml, gecko) chrome/37.0.2062.124 safari/537.36'} url in urls: response = requests.get(url, headers=headers) soup = beautifulsoup(response.content, 'html.parser') table = soup.find("table", {"class": "spec-detail"}) print table
hope helps.
python web-scraping beautifulsoup
No comments:
Post a Comment