Sunday 15 March 2015

r - multiple results of one variable when applying tm method "stemCompletion" -



r - multiple results of one variable when applying tm method "stemCompletion" -

i have corpus containing journal info of 15 observations of 3 variables (id, title, abstract). using r studio read in info .csv file (one line per observation). when performing text mining operations got problem when using method stemcompletion. after applying stemcompletion observed results provided each stemmed line of .csv 3 times. other tm methods (e.g. stemdocument) produce single result. i'm wondering why happens , how prepare problem

i used code below:

data.corpus <- corpus(dataframesource(data)) data.corpuscopy <- data.corpus data.corpus <- tm_map(data.corpus, stemdocument) data.corpus <- tm_map(data.corpus, stemcompletion, dictionary=data.corpuscopy)

the single results after applying stemdocument e.g.

"> data.corpus[[1]] physic environ sourc innov investig attribut innov space investig physic space intersect innov innov relev attribut physic space innov reflect chang natur innov technolog advanc servic mean chang argu develop innov space similar embodi divers set valu collabor open sustain utilize literatur review interview benchmark examin relationship physic environ innov literatur review interview underlin innov communic human centr process result 5 attribut innov space nowadays collabor enabl modifi smart attract reflect provid perspect challeng back upwards innov creation develop physic space add together conceptu develop innov space outlin physic space innov servic"

and after using stemcompletion reults appear 3 times:

"$`1` physical environment source innovation investigation attributes innovation space investigation physical space intersect innovation innovation relevant attributes physical space innovation reflect changes nature innovation technological advancements service meanwhile changes argues develop innovation space similarity embodies diversified set valuable collaboration open sustainability used literature review interviews benchmarking examine relationships physical environment innovation literature review interviews underline innovation communicative human centred processes result 5 attributes innovation space nowadays collaboration enablers modifiability smartness attractiveness reflect provide perspectives challenge back upwards innovation creation develop physical space add-on conceptual develop innovation space outlines physical space innovation service physical environment source innovation investigation attributes innovation space investigation physical space intersect innovation innovation relevant attributes physical space innovation reflect changes nature innovation technological advancements service meanwhile changes argues develop innovation space similarity embodies diversified set valuable collaboration open sustainability used literature review interviews benchmarking examine relationships physical environment innovation literature review interviews underline innovation communicative human centred processes result 5 attributes innovation space nowadays collaboration enablers modifiability smartness attractiveness reflect provide perspectives challenge back upwards innovation creation develop physical space add-on conceptual develop innovation space outlines physical space innovation service physical environment source innovation investigation attributes innovation space investigation physical space intersect innovation innovation relevant attributes physical space innovation reflect changes nature innovation technological advancements service meanwhile changes argues develop innovation space similarity embodies diversified set valuable collaboration open sustainability used literature review interviews benchmarking examine relationships physical environment innovation literature review interviews underline innovation communicative human centred processes result 5 attributes innovation space nowadays collaboration enablers modifiability smartness attractiveness reflect provide perspectives challenge back upwards innovation creation develop physical space add-on conceptual develop innovation space outlines physical space innovation service"

below sample reproducable example:

a .csv file containing 3 observations of 3 variables:

id;text a;text b 1;below first title;innovation , knowledge management 2;and sec title;organizational performance , learning of import 3;the 3rd title;knowledge plays of import rule in organizations

and below stemming method i've used

data = read.csv2("test.csv") data[,2]=as.character(data[,2]) data[,3]=as.character(data[,3]) corpus <- corpus(dataframesource(data)) corpuscopy <- corpus corpus <- tm_map(corpus, stemdocument) corpus[[1]] corpus <- tm_map(corpus, stemcompletion, dictionary=corpuscopy) inspect(corpus[1:3])

it seems me depends on number of variables used in .csv have no thought why.

there seems odd stemcompletion function. it's not obvious how utilize stemcompletion in tm version 0.6. there nice workaround here i've used answer.

first, create csv file have:

dat <- read.csv2( text = "id;text a;text b 1;below first title;innovation , knowledge management 2;and sec title;organizational performance , learning of import 3;the 3rd title;knowledge plays of import rule in organizations") write.csv2(dat, "test.csv", row.names = false)

read in, transform corpus, , stem words:

data = read.csv2("test.csv") data[,2]=as.character(data[,2]) data[,3]=as.character(data[,3]) corpus <- corpus(dataframesource(data)) corpuscopy <- corpus library(snowballc) corpus <- tm_map(corpus, stemdocument)

have see it's worked:

inspect(corpus) <<vcorpus (documents: 3, metadata (corpus/indexed): 0/0)>> [[1]] <<plaintextdocument (metadata: 7)>> 1 below first titl innovat , knowledg manag [[2]] <<plaintextdocument (metadata: 7)>> 2 , sec titl organiz perform , larn veri import [[3]] <<plaintextdocument (metadata: 7)>> 3 3rd titl knowledg play import rule in organ

here's nice workaround stemcompletion working:

stemcompletion_mod <- function(x,dict=corpuscopy) { plaintextdocument(stripwhitespace(paste(stemcompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" "))) }

inspect output see if stems completed ok:

lapply(corpus, stemcompletion_mod) [[1]] <<plaintextdocument (metadata: 7)>> 1 below first title innovation , knowledge management [[2]] <<plaintextdocument (metadata: 7)>> 2 , sec title organizational performance , learning na of import [[3]] <<plaintextdocument (metadata: 7)>> 3 3rd title knowledge plays of import rule in organizations

success!

r rstudio tm stemming

No comments:

Post a Comment