Sunday 15 July 2012

web scraping - removing multiple \n in python before sentence tokenizing -



web scraping - removing multiple \n in python before sentence tokenizing -

i'm brand new programming , teaching myself out of book , stack overflow. i'm trying remove multiple instances of \n in a chat corpus , tokenize sentences. if don't remove \n, strings this:

['answers 10-19-20suser139 ... hi 10-19-20suser101 ;)\n\n\n\n\n\n\n\n\n\ni when it, 10-19-20suser83\n\n\n\n\n\n\n\n\n\n\n\niamahotnipwithpics\n\n\n\n10-19-20suser20 go plan wedding!']

i've tried several different methods chomps, line, rstrip, etc , none of them seem work. using them wrong. whole code looks this:

import nltk, re, pprint nltk.corpus import nps_chat chat= nltk.text(nps_chat.words()) nltk.corpus import npschatcorpusreader bs4 import beautifulsoup chat=nltk.corpus.nps_chat.raw() soup= beautifulsoup(chat) soup.get_text() text =soup.get_text() print(text[:40]) print(len(text)) nltk.tokenize import sent_tokenize sent_chat = sent_tokenize(text) len(sent_chat) text[:] = [line.rstrip('\n') line in text] print(len(sent_chat)) print(sent_chat[:40])

when utilize line method error:

traceback (most recent phone call last): file "c:\python34\lib\idlelib\testsubjects\sentencelen.py", line 57, in <module> text[:] = [line.rstrip('\n') line in text] typeerror: 'str' object not back upwards item assignment

help?

>>> x = 'answers 10-19-20suser139 ... hi 10-19-20suser101 ;)\n\n\n\n\n\n\n\n\n\ni when it, 10-19-20suser83\n\n\n\n\n\n\n\n\n\n\n\niamahotnipwithpics\n\n\n\n10-19-20suser20 go plan wedding!' >>> y = "".join([i if !="\n" else "\t" in x]) >>> z = [i in y.split('\t') if i] >>> z ['answers 10-19-20suser139 ... hi 10-19-20suser101 ;)', 'i when it, 10-19-20suser83', 'iamahotnipwithpics', '10-19-20suser20 go plan wedding!']

python web-scraping nlp nltk data-cleaning

No comments:

Post a Comment