Breedlove: python - UnicodeDecodeError in using StanfordParser for parsing tweets -

Monday, 15 April 2013

python - UnicodeDecodeError in using StanfordParser for parsing tweets -

i trying parse stream of tweets using stanfordparser caseless english language model (englishpcfg.caseless.ser.gz), mentioned in faqs: http://nlp.stanford.edu/software/parser-faq.shtml#ca. encountered next error while calling raw_parse method:

import nltk nltk.parse.stanford import stanfordparser parser = stanfordparser(                       path_to_jar="stanford-parser.jar" \                     , path_to_models_jar="stanford-corenlp-caseless-2014-02-25-models.jar" \                     , model_path="edu/stanford/nlp/models/lexparser/englishpcfg.caseless.ser.gz"                     , encoding='utf-8'                     ) tweet = 'good news™: weather going awesome today ultimate.' tweet_unicode = unicode(tweet, 'utf-8') parser.raw_parse(tweet_unicode)  unicodedecodeerror                        traceback (most recent  phone call last) <ipython-input-21-71163f9030ad> in <module>() ----> 1 parser.raw_parse_sents(sentence)  /library/python/2.7/site-packages/nltk/parse/stanford.pyc in raw_parse_sents(self, sentences, verbose)     174             '-outputformat', 'penn',     175         ] --> 176          homecoming self._parse_trees_output(self._execute(cmd, '\n'.join(sentences), verbose))     177      178     def tagged_parse(self, sentence, verbose=false):  /library/python/2.7/site-packages/nltk/parse/stanford.pyc in _execute(self, cmd, input_, verbose)     235             stdout, stderr = java(cmd, classpath=(self._stanford_jar, self._model_jar),     236                                   stdout=pipe, stderr=pipe) --> 237             stdout = stdout.decode(encoding)     238      239         os.unlink(input_file.name)  /system/library/frameworks/python.framework/versions/2.7/lib/python2.7/encodings/utf_8.pyc in decode(input, errors)      14       15 def decode(input, errors='strict'): ---> 16      homecoming codecs.utf_8_decode(input, errors, true)      17       18 class incrementalencoder(codecs.incrementalencoder):  unicodedecodeerror: 'utf8' codec can't decode byte 0xaa in position 56: invalid start byte

for other tweets containing special characters, method works fine; in particular case, fails due trademark character. pointers on how resolve this? examining source code of parser file, looks error occurs while reading temporary file created parser.

and, more generally, how ensure characters in english language accounted for?

python unicode nltk stanford-nlp

Breedlove

Monday, 15 April 2013

python - UnicodeDecodeError in using StanfordParser for parsing tweets -

No comments:

Post a Comment