r - Summarising levels of multiple factor variables -

i have searched similar questions, cannot find exact solution required. this question similar, deals issue of summarising multiple continuous variables, not factors.

i have dataframe consisting of 4 factor variables (sex, agegroup, hiv, group), e.g.

set.seed(20150710) df<-data.frame(sex=as.factor(c(sample(1:2, 10000, replace=t))),             agegroup=as.factor(c(sample(1:5,10000, replace=t))),            hiv=as.factor(c(sample(1:3,10000, replace=t))),            group=as.factor(c(sample(1:2,10000, replace=t)))            )  levels(df$sex)<- c("male", "female") levels(df$agegroup)<- c("16-24", "25-34", "35-44", "45-54", "55+") levels(df$hiv)<-c("positive", "negative", "not tested") levels(df$group)<-c("intervention", "control")

i create summary table giving counts , proportions each level of exposure variables sex, agegroup , hiv, stratified group.

edit: aiming for:

                x n_control percent_control n_intervention     percent_intervention 1      sex_female      2517       0.5041057           2480            0.4953066 2        sex_male      2476       0.4958943           2527            0.5046934 3  agegroup_16-24      1005       0.2012818            992            0.1981226 4  agegroup_25-34      1001       0.2004807            996            0.1989215 5  agegroup_35-44      1010       0.2022832            997            0.1991212 6  agegroup_45-54       976       0.1954737            996            0.1989215 7    agegroup_55+      1001       0.2004807           1026            0.2049131 8    hiv_negative      1679       0.3362708           1642            0.3279409 9  hiv_not tested      1633       0.3270579           1660            0.3315359 10   hiv_positive      1681       0.3366713           1705            0.3405233

but cannot work summarise_each in dplyr; overall variable counts , proportions, , not each factor level, given:

df.out<-df %>%   group_by(group) %>%   summarise_each(funs(n=n(), percent=n()/sum(n())), sex, agegroup, hiv) print(df.out)  group sex_n agegroup_n hiv_n sex_percent agegroup_percent hiv_percent 1     1  4973       4973  4973           1                1           1 2     2  5027       5027  5027           1                1           1

finally, there way reshape table (e.g. using tidyr), exposure variables (sex, agegroup, hiv) reported rows?

thanks

doing in 2 steps give desired result. first, calculate n, calculate percentage group:

library(dplyr) df.out <- df %>%   group_by(group, sex, agegroup, hiv) %>%   tally() %>%   group_by(group) %>%   mutate(percent=n/sum(n))

a solution data.table:

library(data.table) dt.out <- setdt(df)[, .n, by=.(group, sex, agegroup, hiv)][, percent:=n/sum(n), by=group]

library(microbenchmark) microbenchmark(df.out = df %>%                  group_by(group, sex, agegroup, hiv) %>%                  tally() %>%                  group_by(group) %>%                  mutate(percent=n/sum(n)),                dt.out = df[,.n,by=.(group, sex, agegroup, hiv)][,percent:=n/sum(n),by=group])  # unit: milliseconds #   expr      min       lq     mean   median       uq       max neval cld # df.out 8.299870 8.518590 8.894504 8.708315 8.931459 11.964930   100   b # dt.out 2.346632 2.394788 2.540132 2.441777 2.551235  4.344442   100

conclusion: data.table solution faster (3.5x).

to table requested after edit of question, can following:

library(data.table)  setdt(df) dt.sex <- dcast(df[,.n, by=.(sex,group)][,percent:=n/sum(n)], sex ~ group, value.var = c("n", "percent")) dt.age <- dcast(df[,.n, by=.(agegroup,group)][,percent:=n/sum(n)], agegroup ~ group, value.var = c("n", "percent")) dt.hiv <- dcast(df[,.n, by=.(hiv,group)][,percent:=n/sum(n)], hiv ~ group, value.var = c("n", "percent"))  dt.out.wide <- rbindlist(list(dt.sex, dt.age, dt.hiv), use.names=false) names(dt.out.wide) <- c("x","n_intervention","n_control","percent_intervention","percent_control")

this gives:

> dt.out.wide              x n_intervention n_control percent_intervention percent_control  1:       male           2454      2488               0.2454          0.2488  2:     female           2561      2497               0.2561          0.2497  3:      16-24            954       991               0.0954          0.0991  4:      25-34           1033      1002               0.1033          0.1002  5:      35-44           1051      1000               0.1051          0.1000  6:      45-54            983       978               0.0983          0.0978  7:        55+            994      1014               0.0994          0.1014  8:   positive           1717      1664               0.1717          0.1664  9:   negative           1637      1659               0.1637          0.1659 10: not tested           1661      1662               0.1661          0.1662

WIKI

Search This Blog

r - Summarising levels of multiple factor variables -

Comments

Post a Comment