r - Aggregate dataframe by user, keeping rows for each user prior to first occurrence of treatment -


there similar problems mine elsewhere on site, none of answers encompass need do.

i have dataframe i'm trying change time varying. subjects in study can change non-treatment treatment, not other way. subjects have multiple rows of treatment information, , want find first occurrence of treatment, simple enough. snag not has occurrence of treatment, , hence whenever run algorithm finding first occurrence these people deleted. make question clearer:

id    treatment    start.date    stop.date   1        0         01/01/2002    01/02/2002   1        0         01/02/2002    01/03/2002   1        1         01/03/2002    01/04/2002   1        0         01/04/2002    01/05/2002   2        0         01/01/2002    01/02/2002   2        0         01/02/2002    01/03/2002   3        0         01/01/2002    01/02/2002   3        1         01/02/2002    01/03/2002 3        0         01/03/2002    01/04/2002   

as can see, 2 never has treatment. when run following algorithm, 2 removed.

data$keep <- with(data,                       ave(treatment==1, id ,fun=function(x) if(1 %in% x) cumsum(x) else 2)) with(data, data[keep==0 | (treatment==1 & keep==1),])  

is there way extend code keeps don't have first occurrence and keeps every row until first occurrence have it?

to summarise want data this:

id    treatment    start.date    stop.date     1        0         01/01/2002    01/02/2002    1        0         01/02/2002    01/03/2002     1        1         01/03/2002    01/04/2002    2        0         01/01/2002    01/02/2002     2        0         01/02/2002    01/03/2002   3        0         01/01/2002    01/02/2002   3        1         01/02/2002    01/03/2002 

we in different ways. 1 option data.table using if/else condition on 'treatment' column grouped 'id' column. check if there no values in treatment equal '1', return subset of data.table (.sd) i.e. (if(!any(treatment==1)) .sd) or else i.e. if '1' values in 'treatment' return position index of first value in treatment equal 1 (which(treatment==1)[1l]), sequence (seq) , use numeric index subset datatable. (.sd)

library(data.table)#v1.9.5+ setdt(data)[, if(!any(treatment==1)) .sd                else .sd[seq(which(treatment==1)[1l])], = id] #     id treatment start.date  stop.date #1:  1         0 01/01/2002 01/02/2002 #2:  1         0 01/02/2002 01/03/2002 #3:  1         1 01/03/2002 01/04/2002 #4:  2         0 01/01/2002 01/02/2002 #5:  2         0 01/02/2002 01/03/2002 #6:  3         0 01/01/2002 01/02/2002 #7:  3         1 01/02/2002 01/03/2002 

or more compact method rely on difference between current , previous values in 'treatment' , check whether difference greater or equal 0. can use diff or -. in case, getting difference between treatment , lag of treatment (shift default gives 'lag' values. new function in devel version of data.table)

setdt(data)[, .sd[(treatment-shift(treatment, fill=0))>=0], = id] 

or similar approach using dplyr. group 'id' , filter rows based on difference between current , previous values in 'treatment'.

library(dplyr) data %>%      group_by(id) %>%      filter(c(0, diff(treatment)) >=0)  #  id treatment start.date  stop.date #1  1         0 01/01/2002 01/02/2002 #2  1         0 01/02/2002 01/03/2002 #3  1         1 01/03/2002 01/04/2002 #4  2         0 01/01/2002 01/02/2002 #5  2         0 01/02/2002 01/03/2002 #6  3         0 01/01/2002 01/02/2002 #7  3         1 01/02/2002 01/03/2002 

or ave base r

data[with(data, as.logical(ave(treatment, id,                    fun=function(x) c(0, diff(x))>=0))),] 

Comments