Monday 15 April 2013

python - UnicodeDecodeError in using StanfordParser for parsing tweets -



python - UnicodeDecodeError in using StanfordParser for parsing tweets -

i trying parse stream of tweets using stanfordparser caseless english language model (englishpcfg.caseless.ser.gz), mentioned in faqs: http://nlp.stanford.edu/software/parser-faq.shtml#ca. encountered next error while calling raw_parse method:

import nltk nltk.parse.stanford import stanfordparser parser = stanfordparser( path_to_jar="stanford-parser.jar" \ , path_to_models_jar="stanford-corenlp-caseless-2014-02-25-models.jar" \ , model_path="edu/stanford/nlp/models/lexparser/englishpcfg.caseless.ser.gz" , encoding='utf-8' ) tweet = 'good news™: weather going awesome today ultimate.' tweet_unicode = unicode(tweet, 'utf-8') parser.raw_parse(tweet_unicode) unicodedecodeerror traceback (most recent phone call last) <ipython-input-21-71163f9030ad> in <module>() ----> 1 parser.raw_parse_sents(sentence) /library/python/2.7/site-packages/nltk/parse/stanford.pyc in raw_parse_sents(self, sentences, verbose) 174 '-outputformat', 'penn', 175 ] --> 176 homecoming self._parse_trees_output(self._execute(cmd, '\n'.join(sentences), verbose)) 177 178 def tagged_parse(self, sentence, verbose=false): /library/python/2.7/site-packages/nltk/parse/stanford.pyc in _execute(self, cmd, input_, verbose) 235 stdout, stderr = java(cmd, classpath=(self._stanford_jar, self._model_jar), 236 stdout=pipe, stderr=pipe) --> 237 stdout = stdout.decode(encoding) 238 239 os.unlink(input_file.name) /system/library/frameworks/python.framework/versions/2.7/lib/python2.7/encodings/utf_8.pyc in decode(input, errors) 14 15 def decode(input, errors='strict'): ---> 16 homecoming codecs.utf_8_decode(input, errors, true) 17 18 class incrementalencoder(codecs.incrementalencoder): unicodedecodeerror: 'utf8' codec can't decode byte 0xaa in position 56: invalid start byte

for other tweets containing special characters, method works fine; in particular case, fails due trademark character. pointers on how resolve this? examining source code of parser file, looks error occurs while reading temporary file created parser.

and, more generally, how ensure characters in english language accounted for?

python unicode nltk stanford-nlp

No comments:

Post a Comment