regex - Identifying substrings based on complex rules -


assume have text strings this:

a-b-c-i1-i2-d-e-f-i1-i3-d-d-d-d-i1-i1-i2-i1-i1-i3-i3 

here want identify sequences of markers (a marker, i3 marker etc.) leads up subsequence consisting only of ix markers (i.e. i1, i2, or i3) contains i3. subsequence can have length of 1 (i.e. single i3 marker) or can of unlimited length, needs contain @ least 1 i3 marker, , can contain ix markers. in subsequence leads ix subsequence, i1 , i2 can included, never i3.

in string above need identify:

a-b-c-i1-i2-d-e-f 

which leads i1-i3 subsequence contains i3

and

d-d-d-d 

which leads i1-i1-i2-i1-i1-i3-i3 subsequence contains @ least 1 i3.

here few additional examples:

a-b-i3-c-i3 

from string should identify a-b because followed subsequence of 1 contains i3, , c, because followed subsequence of 1 contains i3.

and:

i3-a-i3 

here a should identified because followed subsequence of 1 contains i3. first i3 not identified, because interested in subsequences followed subsequence of ix markers contains i3.

how can write generic function/regex accomplishes task?

use strsplit

> x <- "a-b-c-i1-i2-d-e-f-i1-i3-d-d-d-d-i1-i1-i2-i1-i1-i3-i3" > strsplit(x, "(?:-?i\\d+)*-?\\bi3-?(?:i\\d+-?)*") [[1]] [1] "a-b-c-i1-i2-d-e-f" "d-d-d-d"  > strsplit("a-b-i3-c-i3", "(?:-?i\\d+)*-?\\bi3\\b-?(?:i\\d+-?)*") [[1]] [1] "a-b" "c"  

or

> strsplit("a-b-i3-c-i3", "(?:-?i\\d+)*-?\\bi3\\b-?(?:i3-?)*") [[1]] [1] "a-b" "c" 

Comments