r - Combining observations with overlapping dates -


each observations in dataframe contains different "before date" , "after date instance". problem dates overlap each id. instance, in table below, id's 1 , 4 contain overlapping date values.

id  before date after date 1   10/1/1996   12/1/1996 1   1/1/1998    9/30/2003 1   1/1/2000    12/31/2004 2   1/1/2001    3/31/2006 3   1/1/2001    9/30/2006 4   1/1/2001    9/30/2005 4   10/1/2004   12/30/2004 4   10/3/2004   11/28/2004 

i trying this:

id  before date after date 1   10/1/1996   12/1/1996 1   1/1/1998    12/31/2004 2   1/1/2001    3/31/2006 3   1/1/2001    9/30/2006 4   1/1/2001    9/30/2005 

basically, replace overlapping date values date range of values overlap, leave non-overlapping values alone, , delete unnecessary rows. not sure how go doing this

firstly, should convert string dates date-classed values, make comparison possible. here's how i've defined , coerced data:

df <- data.frame(id=c(1,1,1,2,3,4,4,4), before.date=c('10/1/1996','1/1/1998','1/1/2000','1/1/2001','1/1/2001','1/1/2001','10/1/2004','10/3/2004'), after.date=c('12/1/1996','9/30/2003','12/31/2004','3/31/2006','9/30/2006','9/30/2005','12/30/2004','11/28/2004') ); dcis <- grep('date$',names(df)); df[dcis] <- lapply(df[dcis],as.date,'%m/%d/%y'); df; ##   id before.date after.date ## 1  1  1996-10-01 1996-12-01 ## 2  1  1998-01-01 2003-09-30 ## 3  1  2000-01-01 2004-12-31 ## 4  2  2001-01-01 2006-03-31 ## 5  3  2001-01-01 2006-09-30 ## 6  4  2001-01-01 2005-09-30 ## 7  4  2004-10-01 2004-12-30 ## 8  4  2004-10-03 2004-11-28 

now, solution involves computing "overlapping grouping" vector i've called og. makes assumption input df ordered id , before.date, in example data. if not, achieved df[order(df$id,df$before.date),]. here's how compute og:

cummax.date <- function(x) as.date(cummax(as.integer(x)),'1970-01-01'); og <- with(df,c(0,cumsum(!(id[-length(id)]==id[-1] & ave(after.date,id,fun=cummax)[-length(after.date)]>before.date[-1])))); og; ## [1] 0 1 1 2 3 4 4 4 

unfortunately, base r cummax() function doesn't work on date-classed objects, had write cummax.date() shim. i'll explain need ave() , cummax() business @ end of post.

as can see, above computation lags rhs of each of 2 vectorized comparisons excluding first element via [-1]. allows compare record's id equality following record's id, , compare if after.date after before.date of following record. resulting logical vectors anded (&) together. negation of logical vector represents adjacent pairs of records not overlap, , can cumsum() result (and prepend zero, first record must start zero) our grouping vector.

finally, final piece of solution, i've used by() work each overlapping group independently:

do.call(rbind,by(df,og,function(g) transform(g[1,],after.date=max(g$after.date)))); ##   id before.date after.date ## 0  1  1996-10-01 1996-12-01 ## 1  1  1998-01-01 2004-12-31 ## 2  2  2001-01-01 2006-03-31 ## 3  3  2001-01-01 2006-09-30 ## 4  4  2001-01-01 2005-09-30 

since records in group must have same id, , we've made assumption records ordered before.date (after being ordered id, no longer relevant), can correct id , before.date values first record in group. that's why started g[1,]. need greatest after.date group via max(g$after.date), , overwrite first record's after.date that, i've done transform().

a word performance: assumption ordering aids performance, because allows compare each record against following record via lagged vectorized comparisons, rather comparing every record in group every other record.

now, ave() , cummax() business. realized after writing initial version of answer there flaw in solution, happens not exposed example data. there 3 records in group. if first record has range overlaps both of following 2 records, , middle record not overlap third record, (original) code fail identify third record part of same overlapping group of previous 2 records.

the solution not use after.date of current record when comparing against following record, instead use cumulative maximum after.date within group. if earlier record sprawled beyond following record, overlapped record, , after.date what's important in considering overlapping groups subsequent records.

here's demonstration of input data requires fix, using df base:

df2 <- df; df2[7,'after.date'] <- '2004-10-02'; df2; ##   id before.date after.date ## 1  1  1996-10-01 1996-12-01 ## 2  1  1998-01-01 2003-09-30 ## 3  1  2000-01-01 2004-12-31 ## 4  2  2001-01-01 2006-03-31 ## 5  3  2001-01-01 2006-09-30 ## 6  4  2001-01-01 2005-09-30 ## 7  4  2004-10-01 2004-10-02 ## 8  4  2004-10-03 2004-11-28 

now record 6 overlaps both records 7 , 8, record 7 not overlap record 8. solution still works:

cummax.date <- function(x) as.date(cummax(as.integer(x)),'1970-01-01'); og <- with(df2,c(0,cumsum(!(id[-length(id)]==id[-1] & ave(after.date,id,fun=cummax)[-length(after.date)]>before.date[-1])))); og; ## [1] 0 1 1 2 3 4 4 4 do.call(rbind,by(df2,og,function(g) transform(g[1,],after.date=max(g$after.date)))); ##   id before.date after.date ## 0  1  1996-10-01 1996-12-01 ## 1  1  1998-01-01 2004-12-31 ## 2  2  2001-01-01 2006-03-31 ## 3  3  2001-01-01 2006-09-30 ## 4  4  2001-01-01 2005-09-30 

here's proof og calculation wrong without ave()/cummax() fix:

og <- with(df2,c(0,cumsum(!(id[-length(id)]==id[-1] & after.date[-length(after.date)]>before.date[-1])))); og; ## [1] 0 1 1 2 3 4 4 5 

minor adjustment solution, overwrite after.date in advance of og computation, , avoid max() call (makes more sense if you're planning on overwriting original df new aggregation):

cummax.date <- function(x) as.date(cummax(as.integer(x)),'1970-01-01'); df$after.date <- ave(df$after.date,df$id,fun=cummax); df; ##   id before.date after.date ## 1  1  1996-10-01 1996-12-01 ## 2  1  1998-01-01 2003-09-30 ## 3  1  2000-01-01 2004-12-31 ## 4  2  2001-01-01 2006-03-31 ## 5  3  2001-01-01 2006-09-30 ## 6  4  2001-01-01 2005-09-30 ## 7  4  2004-10-01 2005-09-30 ## 8  4  2004-10-03 2005-09-30 og <- with(df,c(0,cumsum(!(id[-length(id)]==id[-1] & after.date[-length(after.date)]>before.date[-1])))); og; ## [1] 0 1 1 2 3 4 4 4 df <- do.call(rbind,by(df,og,function(g) transform(g[1,],after.date=g$after.date[nrow(g)]))); df; ##   id before.date after.date ## 0  1  1996-10-01 1996-12-01 ## 1  1  1998-01-01 2004-12-31 ## 2  2  2001-01-01 2006-03-31 ## 3  3  2001-01-01 2006-09-30 ## 4  4  2001-01-01 2005-09-30 

Comments