Breedlove: unicode - Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Senitment Analysis -NLP) -

Tuesday, 15 January 2013

unicode - Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Senitment Analysis -NLP) -

i playing around nltk assignment on sentiment analysis. using python 2.7. nltk 3.0 , numpy 1.9.1 version.

this code :

__author__ = 'karan' import nltk import re import sys    def main():     print("start");     # getting stop words     stopwords = open("english.txt","r");     stop_word = stopwords.read().split();     allstopwrd = []     wd in stop_word:         allstopwrd.append(wd);     print("stop words-> ",allstopwrd);      # sample , cleaning     tweet1= 'love, new toyí ½í¸í ½í¸#iphone6. http://t.co/shy1cab7sx'     print("old tweet-> ",tweet1)     tweet1 = tweet1.lower()     tweet1 = ' '.join(re.sub("(@[a-za-z0-9]+)|([^0-9a-za-z \t])|(\w+:\/\/\s+)"," ",tweet1).split())     print(tweet1);     tw = tweet1.split()     print(tw)       #tokenize     sentences = nltk.word_tokenize(tweet1)     print("tokenized ->", sentences)       #remove stop words     otweet =[]     w in tw:         if w not in allstopwrd:             otweet.append(w);     print("sans stop word-> ",otweet)       # taggers neg/pos/inc/dec/inv words     taggers ={}     negwords = open("neg.txt","r");     neg_word = negwords.read().split();     print("ned words-> ",neg_word)     poswords = open("pos.txt","r");     pos_word = poswords.read().split();     print("pos words-> ",pos_word)     incrwords = open("incr.txt","r");     inc_word = incrwords.read().split();     print("incr words-> ",inc_word)     decrwords = open("decr.txt","r");     dec_word = decrwords.read().split();     print("dec wrds-> ",dec_word)     invwords = open("inverse.txt","r");     inv_word = invwords.read().split();     print("inverse words-> ",inv_word)     nw in neg_word:         taggers.update({nw:'negative'});     pw in pos_word:         taggers.update({pw:'positive'});     iw in inc_word:         taggers.update({iw:'inc'});     dw in dec_word:         taggers.update({dw:'dec'});     ivw in inv_word:         taggers.update({ivw:'inv'});     print("tagger-> ",taggers)     print(taggers.get('little'))      # parts of speech     postagger = [nltk.pos_tag(tw)]     print("postagger-> ",postagger)  main();

this error getting when running code:

syntaxerror: non-ascii character '\xc3' in file c:/users/karan/pycharmprojects/mainproject/sentiment.py on line 19, no encoding declared; see http://www.python.org/peps/pep-0263.html details

how prepare error?

i tried code using python 3.4.2 , nltk 3.0 , numpy 1.9.1 error:

traceback (most recent  phone call last):   file "c:/users/karan/pycharmprojects/mainproject/sentiment.py", line 80, in <module>     main();   file "c:/users/karan/pycharmprojects/mainproject/sentiment.py", line 72, in main     postagger = [nltk.pos_tag(tw)]   file "c:\python34\lib\site-packages\nltk\tag\__init__.py", line 100, in pos_tag     tagger = load(_pos_tagger)   file "c:\python34\lib\site-packages\nltk\data.py", line 779, in load     resource_val = pickle.load(opened_resource) unicodedecodeerror: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)

add next top of file # coding=utf-8

if go link in error can seen reason why:

defining encoding

python default ascii standard encoding if no other encoding hints given. define source code encoding, magic comment must placed source files either first or sec line in file, such as: # coding=

python unicode nlp nltk

Breedlove

Tuesday, 15 January 2013

unicode - Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Senitment Analysis -NLP) -

No comments:

Post a Comment