Breedlove: web scraping - Python Html table can't find data when running on server -

Sunday, 15 April 2012

web scraping - Python Html table can't find data when running on server -

hi code won't work when running online, returns none when utilize find how can prepare this?

this code;

import time import sys  import urllib import re bs4 import beautifulsoup, navigablestring  print "initializing python script"  print "the passed arguments " urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/", "http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/", "https://www.alternate.nl/gigabyte/gv-n78toc-3gd-grafische-kaart/html/product/1115798", "http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"] =0 regex = '<title>(.+?)</title>' pattern = re.compile(regex) word = "tweakers" alternate = "alternate" while i<len(urls):    dataraw = urllib.urlopen(urls[i])     info = dataraw.read()   soup = beautifulsoup(data)   table = soup.find("table", {"class" : "spec-detail"})   print table   i+=1

here outcome:

initializing python script passed arguments none none none none script finalized

i have tried using findall , other methods.. don't seem understand why working on command line not on server itself... help?

edit

traceback (most recent  phone call last):   file "python_script.py", line 35, in  soup = beautifulsoup(urllib2.urlopen(url), 'html.parser')   file "/usr/lib/python2.7/urllib2.py", line 126, in urlopen  homecoming _opener.open(url, data, timeout)   file "/usr/lib/python2.7/urllib2.py", line 406, in open response = meth(req, response)   file "/usr/lib/python2.7/urllib2.py", line 519, in http_response 'http', request, response, code, msg, hdrs)   file "/usr/lib/python2.7/urllib2.py", line 444, in error  homecoming self._call_chain(*args)   file "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain result = func(*args)   file "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default raise httperror(req.get_full_url(), code, msg, hdrs, fp) urllib2.httperror: http error 403: forbidden

i'm suspecting experiencing differences between parsers.

specifying parser explicitly works me:

import urllib2 bs4 import beautifulsoup  urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/",         "http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/",         "https://www.alternate.nl/gigabyte/gv-n78toc-3gd-grafische-kaart/html/product/1115798",         "http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"]  url in urls:     soup = beautifulsoup(urllib2.urlopen(url), 'html.parser')     table = soup.find("table", {"class": "spec-detail"})     print table

in case, i'm using html.parser, can play around , specify lxml or html5lib, example.

note 3rd url doesn't contain table class="spec-detail" and, therefore, prints none it.

i've introduced few improvements:

removed unused imports replaced while loop indexing nice loop removed unrelated code replaced urllib urllib2

you can utilize requests module , set appropriate user-agent header pretending real browser:

from bs4 import beautifulsoup import requests  urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/",         "http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/",         "https://www.alternate.nl/gigabyte/gv-n78toc-3gd-grafische-kaart/html/product/1115798",         "http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"]  headers = {'user-agent': 'mozilla/5.0 (macintosh; intel mac os x 10_9_5) applewebkit/537.36 (khtml, gecko) chrome/37.0.2062.124 safari/537.36'} url in urls:     response = requests.get(url, headers=headers)     soup = beautifulsoup(response.content, 'html.parser')     table = soup.find("table", {"class": "spec-detail"})     print table

hope helps.

python web-scraping beautifulsoup

Breedlove

Sunday, 15 April 2012

web scraping - Python Html table can't find data when running on server -

No comments:

Post a Comment