i'm trying identify 'unique' , 'near unique' cases or records dataset disclosure control project. particularly combinations on variables appear once, twice etc.
the records appear in:
table(age,sex,ethnicity) i interested in elements (which true) of:
table(age,sex,ethnicity)==1 table(age,sex,ethnicity)==2 i know there 150 cases looking from:
sum(table(age,sex,ethnicity)==1) there identifier in dataset nice output or number 1:length(age)*length(sex)*length(ethnicity) good. hoping return list like:
[1] 103 207 218.... [41] * * * [81] * * * where 'identifier' = 103, 207 , 218 first 3 of 150 cases where:
table(age,sex,ethnicity)==1 i naively hoping like:
data$identifier[table(age,sex,mar,emp,edu) == 1] names(table(age,sex,ethnicity) would work no such luck. i've looked unique() returns every combination (that occurs once or more). or input appreciated.
added reproducible example (hopefully) example
set.seed(1234) <- 1+rpois(100,1) b <- 1+rpois(100,1) c <- 1+rpois(100,1) a[a >= 5] <- 4 b[b >= 5] <- 4 c[c >= 5] <- 4 eg <- cbind(1:100,a,b,c) (sum(table(a,b,c)==1)) should have 12 'unique' combinations, identify using first column of eg (or identifier dataset)
i think easiest way using data.table package:
library(data.table) eg.dt <- as.data.table(eg) eg.dt[, list(n=.n), by=.(a,b,c)][n==1] how works: eg.dt[, list(n=.n), by=.(a,b,c)] counts number of occurences of each (a,b,c) combination. [n==1] filters out occur precisely once.
or if want stick dataframes (not data.table) try plyr:
library(plyr) eg <- data.frame(eg) subset(ddply(eg, .(a, b, c), nrow), v1 == 1) this works in same way: ddply(eg, .(a, b, c), nrow) makes dataframe column "v1" being number of times combination occurs; subset combinations occur once only.
i think there might way table(a,b,c) can't think of 1 isn't convoluted.
Comments
Post a Comment