Sunday 15 August 2010

r - dplyr count distinct readable way -



r - dplyr count distinct readable way -

i'm new using dplyr, need calculate distinct values in group. here's table example:

data=data.frame(aa=c(1,2,3,4,na), bb=c('a', 'b', 'a', 'c', 'c'))

i know can things like:

by_bb<-group_by(data, bb, add together = true) summarise(by_bb, mean(aa, na.rm=true), max(aa), sum(!is.na(aa)), length(aa))

but if want count of unique elements?

i can do:

> summarise(by_bb,length(unique(unlist(aa)))) bb length(unique(unlist(aa))) 1 2 2 b 1 3 c 2

and if want exclude nas cand do:

> summarise(by_bb,length(unique(unlist(aa[!is.na(aa)])))) bb length(unique(unlist(aa[!is.na(aa)]))) 1 2 2 b 1 3 c 1

but it's little unreadable me. there improve way kind of summarization?

how option:

data %>% # take data.frame "data" filter(!is.na(aa)) %>% # using "data", filter out rows nas in aa group_by(bb) %>% # then, filtered data, grouping "bb" summarise(unique_elements = n_distinct(aa)) # summarise unique elements per grouping #source: local info frame [3 x 2] # # bb unique_elements #1 2 #2 b 1 #3 c 1

use filter filter out rows aa has nas, grouping info column bb , summarise counting number of unique elements of column aa grouping of bb.

as can see i'm making utilize of pipe operator %>% can utilize "pipe" or "chain" commands when using dplyr. helps write readable code because it's more natural, e.g. write code left write , top bottom , not nested within out (as in illustration code).

edit:

in first part of question, wrote:

i know can things like:

by_bb<-group_by(data, bb, add together = true) summarise(by_bb, mean(aa, na.rm=true), max(aa), sum(!is.na(aa)), length(aa))

here's alternative (applying number of functions same column(s)):

data %>% filter(!is.na(aa)) %>% group_by(bb) %>% summarise_each(funs(mean, max, sum, n_distinct), aa) #source: local info frame [3 x 5] # # bb mean max sum n_distinct #1 2 3 4 2 #2 b 2 2 2 1 #3 c 4 4 4 1

r dplyr summarization

No comments:

Post a Comment