Saturday 15 January 2011

python - Why this code doesn't works with all URLs? -



python - Why this code doesn't works with all URLs? -

i'm new python, playing code. i'm trying parse html webpage , extract info parsed document:

from urllib import request bs4 import beautifulsoup #some code here... link = str(input("enter url: ")) sock = request.urlopen(link) pagetext = sock.read() sock.close() #some code here... file = open("c:/test.txt", 'w') file.write(pagetext.decode("utf-8")) #some code here...

i'm getting error in file.write() line, , searched on net still have no clue on how prepare that.

the error:

traceback (most recent phone call last): file "c:/users/monster/pycharmprojects/testpro_1/testfile.py", line 16, in <module> file.write(pagetext.decode("utf-8")) file "c:\python34\lib\encodings\cp1252.py", line 19, in encode homecoming codecs.charmap_encode(input,self.errors,encoding_table)[0] unicodeencodeerror: 'charmap' codec can't encode characters in position 413334-413340: character maps <undefined>

my code works sites www.google.com or www.flipkart.com , gives error urls www.facebook.com , www.youtube.com. think 1 possilbe reason doesn't works www.facebook.com , youtube.com because developed in php or other language , not html web pages correct?

the problem you're trying write text file cp1252 encoding, info include characters don't exist in cp1252.

in python, open function takes optional encoding argument text files. docs say, if don't specify anything:

the default encoding platform dependent (whatever locale.getpreferredencoding() returns)

on windows, "preferred encoding" returned function going whatever you've set default system. on version of windows, if haven't changed settings, pre-configured default "code page 1252", microsoft's variation on ibm's variation on latin-1. can handle 256 different characters (almost, not quite, identical first 256 characters in unicode). if have other characters, you're going error.

the reason works on pages not others pages don't have normal english language characters fit every character set.

if want save utf-8 text file, have explicitly:

f = open('c:/test.txt', 'w', encoding='utf-8') f.write(pagetext.decode('utf-8'))

if want save cp1252 text file—or, rather, whatever system's default encoding happens be, may utf-8 if runs script on mac or shift-jis-based cp932 on japanese windows box—by skipping or replacing or escaping characters don't fit cp1252, can too:

f = open('c:/test.txt', 'w', errors='replace') f.write(pagetext.decode('utf-8'))

or, of course, if want cp1252 no matter scheme set to, so:

f = open('c:/test.txt', 'w', encoding='cp1252', errors='replace') f.write(pagetext.decode('utf-8'))

if want save raw bytes without worrying are, open file in binary mode , don't decode bytes in first place:

f = open('c:/test.txt', 'wb') f.write(pagetext)

of course of study if open file in cp1252 (or shift-jis, etc.) text editor, it's going mojibake… but that's not program's fault anymore. :)

however, you've got problem here. you're assuming every web page utf-8. that's not true. pre-html5 web pages are, in fact, in latin-1 default—but can specify different encoding in headers (or in meta tag, or, xhtml, in top-level xml tag). in particular, seek facebook page:

>>> print(sock.getheader('content-type')) 'text/html; charset=utf-8'

that how know it's, in case, utf-8.

for html5, it's… a lot more complicated. ideally you'll want utilize library you. (since you're using beautifulsoup, many mutual cases "unicode, dammit" work enough—and works pretty pre-html5—but standards-correct implementation better.)

python python-3.x unicode urllib

No comments:

Post a Comment