r - Aggregate dataframe by user, keeping rows for each user prior to first occurrence of treatment -
there similar problems mine elsewhere on site, none of answers encompass need do.
i have dataframe i'm trying change time varying. subjects in study can change non-treatment treatment, not other way. subjects have multiple rows of treatment information, , want find first occurrence of treatment, simple enough. snag not has occurrence of treatment, , hence whenever run algorithm finding first occurrence these people deleted. make question clearer:
id treatment start.date stop.date 1 0 01/01/2002 01/02/2002 1 0 01/02/2002 01/03/2002 1 1 01/03/2002 01/04/2002 1 0 01/04/2002 01/05/2002 2 0 01/01/2002 01/02/2002 2 0 01/02/2002 01/03/2002 3 0 01/01/2002 01/02/2002 3 1 01/02/2002 01/03/2002 3 0 01/03/2002 01/04/2002 as can see, 2 never has treatment. when run following algorithm, 2 removed.
data$keep <- with(data, ave(treatment==1, id ,fun=function(x) if(1 %in% x) cumsum(x) else 2)) with(data, data[keep==0 | (treatment==1 & keep==1),]) is there way extend code keeps don't have first occurrence and keeps every row until first occurrence have it?
to summarise want data this:
id treatment start.date stop.date 1 0 01/01/2002 01/02/2002 1 0 01/02/2002 01/03/2002 1 1 01/03/2002 01/04/2002 2 0 01/01/2002 01/02/2002 2 0 01/02/2002 01/03/2002 3 0 01/01/2002 01/02/2002 3 1 01/02/2002 01/03/2002
we in different ways. 1 option data.table using if/else condition on 'treatment' column grouped 'id' column. check if there no values in treatment equal '1', return subset of data.table (.sd) i.e. (if(!any(treatment==1)) .sd) or else i.e. if '1' values in 'treatment' return position index of first value in treatment equal 1 (which(treatment==1)[1l]), sequence (seq) , use numeric index subset datatable. (.sd)
library(data.table)#v1.9.5+ setdt(data)[, if(!any(treatment==1)) .sd else .sd[seq(which(treatment==1)[1l])], = id] # id treatment start.date stop.date #1: 1 0 01/01/2002 01/02/2002 #2: 1 0 01/02/2002 01/03/2002 #3: 1 1 01/03/2002 01/04/2002 #4: 2 0 01/01/2002 01/02/2002 #5: 2 0 01/02/2002 01/03/2002 #6: 3 0 01/01/2002 01/02/2002 #7: 3 1 01/02/2002 01/03/2002 or more compact method rely on difference between current , previous values in 'treatment' , check whether difference greater or equal 0. can use diff or -. in case, getting difference between treatment , lag of treatment (shift default gives 'lag' values. new function in devel version of data.table)
setdt(data)[, .sd[(treatment-shift(treatment, fill=0))>=0], = id] or similar approach using dplyr. group 'id' , filter rows based on difference between current , previous values in 'treatment'.
library(dplyr) data %>% group_by(id) %>% filter(c(0, diff(treatment)) >=0) # id treatment start.date stop.date #1 1 0 01/01/2002 01/02/2002 #2 1 0 01/02/2002 01/03/2002 #3 1 1 01/03/2002 01/04/2002 #4 2 0 01/01/2002 01/02/2002 #5 2 0 01/02/2002 01/03/2002 #6 3 0 01/01/2002 01/02/2002 #7 3 1 01/02/2002 01/03/2002 or ave base r
data[with(data, as.logical(ave(treatment, id, fun=function(x) c(0, diff(x))>=0))),]
Comments
Post a Comment