i have searched similar questions, cannot find exact solution required. this question similar, deals issue of summarising multiple continuous variables, not factors.
i have dataframe consisting of 4 factor variables (sex, agegroup, hiv, group), e.g.
set.seed(20150710) df<-data.frame(sex=as.factor(c(sample(1:2, 10000, replace=t))), agegroup=as.factor(c(sample(1:5,10000, replace=t))), hiv=as.factor(c(sample(1:3,10000, replace=t))), group=as.factor(c(sample(1:2,10000, replace=t))) ) levels(df$sex)<- c("male", "female") levels(df$agegroup)<- c("16-24", "25-34", "35-44", "45-54", "55+") levels(df$hiv)<-c("positive", "negative", "not tested") levels(df$group)<-c("intervention", "control") i create summary table giving counts , proportions each level of exposure variables sex, agegroup , hiv, stratified group.
edit: aiming for:
x n_control percent_control n_intervention percent_intervention 1 sex_female 2517 0.5041057 2480 0.4953066 2 sex_male 2476 0.4958943 2527 0.5046934 3 agegroup_16-24 1005 0.2012818 992 0.1981226 4 agegroup_25-34 1001 0.2004807 996 0.1989215 5 agegroup_35-44 1010 0.2022832 997 0.1991212 6 agegroup_45-54 976 0.1954737 996 0.1989215 7 agegroup_55+ 1001 0.2004807 1026 0.2049131 8 hiv_negative 1679 0.3362708 1642 0.3279409 9 hiv_not tested 1633 0.3270579 1660 0.3315359 10 hiv_positive 1681 0.3366713 1705 0.3405233 but cannot work summarise_each in dplyr; overall variable counts , proportions, , not each factor level, given:
df.out<-df %>% group_by(group) %>% summarise_each(funs(n=n(), percent=n()/sum(n())), sex, agegroup, hiv) print(df.out) group sex_n agegroup_n hiv_n sex_percent agegroup_percent hiv_percent 1 1 4973 4973 4973 1 1 1 2 2 5027 5027 5027 1 1 1 finally, there way reshape table (e.g. using tidyr), exposure variables (sex, agegroup, hiv) reported rows?
thanks
doing in 2 steps give desired result. first, calculate n, calculate percentage group:
library(dplyr) df.out <- df %>% group_by(group, sex, agegroup, hiv) %>% tally() %>% group_by(group) %>% mutate(percent=n/sum(n)) a solution data.table:
library(data.table) dt.out <- setdt(df)[, .n, by=.(group, sex, agegroup, hiv)][, percent:=n/sum(n), by=group] library(microbenchmark) microbenchmark(df.out = df %>% group_by(group, sex, agegroup, hiv) %>% tally() %>% group_by(group) %>% mutate(percent=n/sum(n)), dt.out = df[,.n,by=.(group, sex, agegroup, hiv)][,percent:=n/sum(n),by=group]) # unit: milliseconds # expr min lq mean median uq max neval cld # df.out 8.299870 8.518590 8.894504 8.708315 8.931459 11.964930 100 b # dt.out 2.346632 2.394788 2.540132 2.441777 2.551235 4.344442 100 conclusion: data.table solution faster (3.5x).
to table requested after edit of question, can following:
library(data.table) setdt(df) dt.sex <- dcast(df[,.n, by=.(sex,group)][,percent:=n/sum(n)], sex ~ group, value.var = c("n", "percent")) dt.age <- dcast(df[,.n, by=.(agegroup,group)][,percent:=n/sum(n)], agegroup ~ group, value.var = c("n", "percent")) dt.hiv <- dcast(df[,.n, by=.(hiv,group)][,percent:=n/sum(n)], hiv ~ group, value.var = c("n", "percent")) dt.out.wide <- rbindlist(list(dt.sex, dt.age, dt.hiv), use.names=false) names(dt.out.wide) <- c("x","n_intervention","n_control","percent_intervention","percent_control") this gives:
> dt.out.wide x n_intervention n_control percent_intervention percent_control 1: male 2454 2488 0.2454 0.2488 2: female 2561 2497 0.2561 0.2497 3: 16-24 954 991 0.0954 0.0991 4: 25-34 1033 1002 0.1033 0.1002 5: 35-44 1051 1000 0.1051 0.1000 6: 45-54 983 978 0.0983 0.0978 7: 55+ 994 1014 0.0994 0.1014 8: positive 1717 1664 0.1717 0.1664 9: negative 1637 1659 0.1637 0.1659 10: not tested 1661 1662 0.1661 0.1662
Comments
Post a Comment