Tuesday 15 January 2013

unicode - Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Senitment Analysis -NLP) -



unicode - Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Senitment Analysis -NLP) -

i playing around nltk assignment on sentiment analysis. using python 2.7. nltk 3.0 , numpy 1.9.1 version.

this code :

__author__ = 'karan' import nltk import re import sys def main(): print("start"); # getting stop words stopwords = open("english.txt","r"); stop_word = stopwords.read().split(); allstopwrd = [] wd in stop_word: allstopwrd.append(wd); print("stop words-> ",allstopwrd); # sample , cleaning tweet1= 'love, new toyí ½í¸í ½í¸#iphone6. http://t.co/shy1cab7sx' print("old tweet-> ",tweet1) tweet1 = tweet1.lower() tweet1 = ' '.join(re.sub("(@[a-za-z0-9]+)|([^0-9a-za-z \t])|(\w+:\/\/\s+)"," ",tweet1).split()) print(tweet1); tw = tweet1.split() print(tw) #tokenize sentences = nltk.word_tokenize(tweet1) print("tokenized ->", sentences) #remove stop words otweet =[] w in tw: if w not in allstopwrd: otweet.append(w); print("sans stop word-> ",otweet) # taggers neg/pos/inc/dec/inv words taggers ={} negwords = open("neg.txt","r"); neg_word = negwords.read().split(); print("ned words-> ",neg_word) poswords = open("pos.txt","r"); pos_word = poswords.read().split(); print("pos words-> ",pos_word) incrwords = open("incr.txt","r"); inc_word = incrwords.read().split(); print("incr words-> ",inc_word) decrwords = open("decr.txt","r"); dec_word = decrwords.read().split(); print("dec wrds-> ",dec_word) invwords = open("inverse.txt","r"); inv_word = invwords.read().split(); print("inverse words-> ",inv_word) nw in neg_word: taggers.update({nw:'negative'}); pw in pos_word: taggers.update({pw:'positive'}); iw in inc_word: taggers.update({iw:'inc'}); dw in dec_word: taggers.update({dw:'dec'}); ivw in inv_word: taggers.update({ivw:'inv'}); print("tagger-> ",taggers) print(taggers.get('little')) # parts of speech postagger = [nltk.pos_tag(tw)] print("postagger-> ",postagger) main();

this error getting when running code:

syntaxerror: non-ascii character '\xc3' in file c:/users/karan/pycharmprojects/mainproject/sentiment.py on line 19, no encoding declared; see http://www.python.org/peps/pep-0263.html details

how prepare error?

i tried code using python 3.4.2 , nltk 3.0 , numpy 1.9.1 error:

traceback (most recent phone call last): file "c:/users/karan/pycharmprojects/mainproject/sentiment.py", line 80, in <module> main(); file "c:/users/karan/pycharmprojects/mainproject/sentiment.py", line 72, in main postagger = [nltk.pos_tag(tw)] file "c:\python34\lib\site-packages\nltk\tag\__init__.py", line 100, in pos_tag tagger = load(_pos_tagger) file "c:\python34\lib\site-packages\nltk\data.py", line 779, in load resource_val = pickle.load(opened_resource) unicodedecodeerror: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)

add next top of file # coding=utf-8

if go link in error can seen reason why:

defining encoding

python default ascii standard encoding if no other encoding hints given. define source code encoding, magic comment must placed source files either first or sec line in file, such as: # coding=

python unicode nlp nltk

No comments:

Post a Comment