Thursday 15 April 2010

r - Why does data.table group differently depending on whether I pass it the variable name directly or not? -



r - Why does data.table group differently depending on whether I pass it the variable name directly or not? -

if pass variable bloodpressure data.table, works fine.

tdt <- data.table(bloodpressure = rnorm(1000, mean=100, sd=15), male=rep(c(0,1))) strata.var <- with(tdt, get(c('male'))) tdt[,list( varname='bloodpressure', n=.n, mean=mean(bloodpressure, na.rm=true), sd=sd(bloodpressure, na.rm=true) ), by=(strata.var)]

i result

strata.var varname n mean sd 1: 0 bloodpressure 500 100.2821 15.13686 2: 1 bloodpressure 500 100.0392 15.02566

which matches grouping means

> mean(tdt$bloodpressure[tdt$male==0]) [1] 100.2821 > mean(tdt$bloodpressure[tdt$male==1]) [1] 100.0392

but if trying programmatically, , variable stored in variable (var)

var_as_string <- 'bloodpressure' var <- with(tdt, get(var_as_string)) tdt[,list( varname='bloodpressure', n=.n, mean=mean(var, na.rm=true), sd=sd(bloodpressure, na.rm=true) ), by=(strata.var)]

i different result.

strata.var varname n mean sd 1: 0 bloodpressure 500 100.1606 15.13686 2: 1 bloodpressure 500 100.1606 15.02566

notice mean identical (i.e. calculated across whole sample not group.

> mean(tdt$bloodpressure) [1] 100.1606

you can replace mean=mean(var, na.rm=true), mean=mean(get(var_as_string), na.rm=true) , should work - otherwise uses numeric vector in var rather info table column want utilize (and returns mean(var) both subgroups).

library(data.table) set.seed(1) tdt <- data.table(bloodpressure = rnorm(1000, mean=100, sd=15), male=rep(c(0,1))) strata.var <- with(tdt, get(c('male'))) tdt[,list( varname='bloodpressure', n=.n, mean=mean(bloodpressure, na.rm=true), sd=sd(bloodpressure, na.rm=true) ), by=(strata.var)] # strata.var varname n mean sd #1: 0 bloodpressure 500 99.58425 15.55735 #2: 1 bloodpressure 500 100.06630 15.50188 var_as_string <- 'bloodpressure' tdt[,list( varname='bloodpressure', n=.n, mean=mean(get(var_as_string), na.rm=true), sd=sd(bloodpressure, na.rm=true) ), by=(strata.var)] # strata.var varname n mean sd #1: 0 bloodpressure 500 99.58425 15.55735 #2: 1 bloodpressure 500 100.06630 15.50188

r data.table

No comments:

Post a Comment