each observations in dataframe contains different "before date" , "after date instance". problem dates overlap each id. instance, in table below, id's 1 , 4 contain overlapping date values.
id before date after date 1 10/1/1996 12/1/1996 1 1/1/1998 9/30/2003 1 1/1/2000 12/31/2004 2 1/1/2001 3/31/2006 3 1/1/2001 9/30/2006 4 1/1/2001 9/30/2005 4 10/1/2004 12/30/2004 4 10/3/2004 11/28/2004 i trying this:
id before date after date 1 10/1/1996 12/1/1996 1 1/1/1998 12/31/2004 2 1/1/2001 3/31/2006 3 1/1/2001 9/30/2006 4 1/1/2001 9/30/2005 basically, replace overlapping date values date range of values overlap, leave non-overlapping values alone, , delete unnecessary rows. not sure how go doing this
firstly, should convert string dates date-classed values, make comparison possible. here's how i've defined , coerced data:
df <- data.frame(id=c(1,1,1,2,3,4,4,4), before.date=c('10/1/1996','1/1/1998','1/1/2000','1/1/2001','1/1/2001','1/1/2001','10/1/2004','10/3/2004'), after.date=c('12/1/1996','9/30/2003','12/31/2004','3/31/2006','9/30/2006','9/30/2005','12/30/2004','11/28/2004') ); dcis <- grep('date$',names(df)); df[dcis] <- lapply(df[dcis],as.date,'%m/%d/%y'); df; ## id before.date after.date ## 1 1 1996-10-01 1996-12-01 ## 2 1 1998-01-01 2003-09-30 ## 3 1 2000-01-01 2004-12-31 ## 4 2 2001-01-01 2006-03-31 ## 5 3 2001-01-01 2006-09-30 ## 6 4 2001-01-01 2005-09-30 ## 7 4 2004-10-01 2004-12-30 ## 8 4 2004-10-03 2004-11-28 now, solution involves computing "overlapping grouping" vector i've called og. makes assumption input df ordered id , before.date, in example data. if not, achieved df[order(df$id,df$before.date),]. here's how compute og:
cummax.date <- function(x) as.date(cummax(as.integer(x)),'1970-01-01'); og <- with(df,c(0,cumsum(!(id[-length(id)]==id[-1] & ave(after.date,id,fun=cummax)[-length(after.date)]>before.date[-1])))); og; ## [1] 0 1 1 2 3 4 4 4 unfortunately, base r cummax() function doesn't work on date-classed objects, had write cummax.date() shim. i'll explain need ave() , cummax() business @ end of post.
as can see, above computation lags rhs of each of 2 vectorized comparisons excluding first element via [-1]. allows compare record's id equality following record's id, , compare if after.date after before.date of following record. resulting logical vectors anded (&) together. negation of logical vector represents adjacent pairs of records not overlap, , can cumsum() result (and prepend zero, first record must start zero) our grouping vector.
finally, final piece of solution, i've used by() work each overlapping group independently:
do.call(rbind,by(df,og,function(g) transform(g[1,],after.date=max(g$after.date)))); ## id before.date after.date ## 0 1 1996-10-01 1996-12-01 ## 1 1 1998-01-01 2004-12-31 ## 2 2 2001-01-01 2006-03-31 ## 3 3 2001-01-01 2006-09-30 ## 4 4 2001-01-01 2005-09-30 since records in group must have same id, , we've made assumption records ordered before.date (after being ordered id, no longer relevant), can correct id , before.date values first record in group. that's why started g[1,]. need greatest after.date group via max(g$after.date), , overwrite first record's after.date that, i've done transform().
a word performance: assumption ordering aids performance, because allows compare each record against following record via lagged vectorized comparisons, rather comparing every record in group every other record.
now, ave() , cummax() business. realized after writing initial version of answer there flaw in solution, happens not exposed example data. there 3 records in group. if first record has range overlaps both of following 2 records, , middle record not overlap third record, (original) code fail identify third record part of same overlapping group of previous 2 records.
the solution not use after.date of current record when comparing against following record, instead use cumulative maximum after.date within group. if earlier record sprawled beyond following record, overlapped record, , after.date what's important in considering overlapping groups subsequent records.
here's demonstration of input data requires fix, using df base:
df2 <- df; df2[7,'after.date'] <- '2004-10-02'; df2; ## id before.date after.date ## 1 1 1996-10-01 1996-12-01 ## 2 1 1998-01-01 2003-09-30 ## 3 1 2000-01-01 2004-12-31 ## 4 2 2001-01-01 2006-03-31 ## 5 3 2001-01-01 2006-09-30 ## 6 4 2001-01-01 2005-09-30 ## 7 4 2004-10-01 2004-10-02 ## 8 4 2004-10-03 2004-11-28 now record 6 overlaps both records 7 , 8, record 7 not overlap record 8. solution still works:
cummax.date <- function(x) as.date(cummax(as.integer(x)),'1970-01-01'); og <- with(df2,c(0,cumsum(!(id[-length(id)]==id[-1] & ave(after.date,id,fun=cummax)[-length(after.date)]>before.date[-1])))); og; ## [1] 0 1 1 2 3 4 4 4 do.call(rbind,by(df2,og,function(g) transform(g[1,],after.date=max(g$after.date)))); ## id before.date after.date ## 0 1 1996-10-01 1996-12-01 ## 1 1 1998-01-01 2004-12-31 ## 2 2 2001-01-01 2006-03-31 ## 3 3 2001-01-01 2006-09-30 ## 4 4 2001-01-01 2005-09-30 here's proof og calculation wrong without ave()/cummax() fix:
og <- with(df2,c(0,cumsum(!(id[-length(id)]==id[-1] & after.date[-length(after.date)]>before.date[-1])))); og; ## [1] 0 1 1 2 3 4 4 5 minor adjustment solution, overwrite after.date in advance of og computation, , avoid max() call (makes more sense if you're planning on overwriting original df new aggregation):
cummax.date <- function(x) as.date(cummax(as.integer(x)),'1970-01-01'); df$after.date <- ave(df$after.date,df$id,fun=cummax); df; ## id before.date after.date ## 1 1 1996-10-01 1996-12-01 ## 2 1 1998-01-01 2003-09-30 ## 3 1 2000-01-01 2004-12-31 ## 4 2 2001-01-01 2006-03-31 ## 5 3 2001-01-01 2006-09-30 ## 6 4 2001-01-01 2005-09-30 ## 7 4 2004-10-01 2005-09-30 ## 8 4 2004-10-03 2005-09-30 og <- with(df,c(0,cumsum(!(id[-length(id)]==id[-1] & after.date[-length(after.date)]>before.date[-1])))); og; ## [1] 0 1 1 2 3 4 4 4 df <- do.call(rbind,by(df,og,function(g) transform(g[1,],after.date=g$after.date[nrow(g)]))); df; ## id before.date after.date ## 0 1 1996-10-01 1996-12-01 ## 1 1 1998-01-01 2004-12-31 ## 2 2 2001-01-01 2006-03-31 ## 3 3 2001-01-01 2006-09-30 ## 4 4 2001-01-01 2005-09-30
Comments
Post a Comment