Breedlove: r - dplyr count distinct readable way -

Sunday 15 August 2010

r - dplyr count distinct readable way -

i'm new using dplyr, need calculate distinct values in group. here's table example:

data=data.frame(aa=c(1,2,3,4,na), bb=c('a', 'b', 'a', 'c', 'c'))

i know can things like:

by_bb<-group_by(data, bb,  add together = true) summarise(by_bb, mean(aa, na.rm=true), max(aa), sum(!is.na(aa)), length(aa))

but if want count of unique elements?

i can do:

  > summarise(by_bb,length(unique(unlist(aa))))    bb length(unique(unlist(aa))) 1                           2 2  b                          1 3  c                          2

and if want exclude nas cand do:

> summarise(by_bb,length(unique(unlist(aa[!is.na(aa)]))))    bb length(unique(unlist(aa[!is.na(aa)]))) 1                                       2 2  b                                      1 3  c                                      1

but it's little unreadable me. there improve way kind of summarization?

how option:

data %>%                    # take data.frame "data"   filter(!is.na(aa)) %>%    # using "data", filter out rows nas in aa    group_by(bb) %>%          # then, filtered data,  grouping "bb"   summarise(unique_elements = n_distinct(aa))   # summarise unique elements per  grouping  #source: local   info frame [3 x 2] # #  bb unique_elements #1                2 #2  b               1 #3  c               1

use filter filter out rows aa has nas, grouping info column bb , summarise counting number of unique elements of column aa grouping of bb.

as can see i'm making utilize of pipe operator %>% can utilize "pipe" or "chain" commands when using dplyr. helps write readable code because it's more natural, e.g. write code left write , top bottom , not nested within out (as in illustration code).

edit:

in first part of question, wrote:

i know can things like:

by_bb<-group_by(data, bb,  add together = true) summarise(by_bb, mean(aa, na.rm=true), max(aa), sum(!is.na(aa)), length(aa))

here's alternative (applying number of functions same column(s)):

data %>%   filter(!is.na(aa)) %>%   group_by(bb) %>%   summarise_each(funs(mean, max, sum, n_distinct), aa)  #source: local   info frame [3 x 5] # #  bb mean max sum n_distinct #1     2   3   4          2 #2  b    2   2   2          1 #3  c    4   4   4          1

r dplyr summarization

Breedlove

Sunday 15 August 2010

r - dplyr count distinct readable way -

No comments:

Post a Comment