i have sequence of elements (usually std::vector<t>), , method overlap_range(iterator, iterator, t) obtaining subset of these elements overlaped t.
if sequence of elements satisfy criteria, result of overlap_range contiguous, , determinable in logarithmic time. otherwise there may multiple discontiguous such sub-ranges, can determined in linear time. takes linear time determine if criteria satisfied.
i following:
overlap_rangepolymorphic, in sense algorithm operates in logarithmic time if criteria known satisfied.- the returned sub-range polymorphic, able take advantage of known properties of sub-range if criteria satisfied.
my proposed solution use default valued flag parameter in overlap_range, takes care of (1). thought solve (2) using boost::filter_iterator<std::function<bool(t)>, iterator> std::function<bool(t)> returns true if criteria known satisfied. however, turned out approx. 50x slower using functor test overlaped in both cases. logic overlap not particularly complex, number of elements in range sufficiently large there significant advantage not having evaluate unnecessarily, if possible.
are there other approaches might me solve problem?
more details
although think above sufficient understand problem, here more background useful.
tobject can reducedgenomicregion, defines set of co-ordinates on single contiguous sequence (i.e. 2 co-ordinates). suchtmappable<t>requires methodget_region.overlapsdefined asinline genomicregion::differencetype overlap_size(const genomicregion& lhs, const genomicregion& rhs) noexcept { return static_cast< genomicregion::differencetype>(std::min(lhs.get_end(), rhs.get_end())) - static_cast< genomicregion::differencetype>(std::max(lhs.get_begin(), rhs.get_begin())); } inline bool overlaps(const genomicregion& lhs, const genomicregion& rhs) noexcept { auto num_bases_overlaped = overlap_size(lhs, rhs); return (num_bases_overlaped == 0) ? !are_adjacent(lhs, rhs) || empty(std::min(lhs, rhs)) : num_bases_overlaped > 0; }- any
mappableclass can sorted begin co-ordinate, , end co-ordinate (if begin co-ordinates equal). - the "criteria" logarithmic overlap searching (and contiguous resulting region), that: if
a <= bend(a) <= end(b), true if exampletsame size. the naive filter iterator defined as
template <typename mappabletype> class isoverlapped { public: isoverlapped() = delete; template <typename mappabletype_> isoverlapped(const mappabletype_& mappable) : region_ {get_region(mappable)} {} bool operator()(const mappabletype& mappable) { return overlaps(mappable, region_); } private: genomicregion region_; }; template <typename iterator> using overlapiterator = boost::filter_iterator<isoverlapped<typename iterator::value_type>, iterator>; template <typename iterator> using overlaprange = boost::iterator_range<overlapiterator<iterator>>;overlap_rangecan declared astemplate <typename forwarditerator, typename mappabletype> overlaprange<forwarditerator> overlap_range(forwarditerator first, forwarditerator last, const mappabletype& mappable, bool is_bidirectional=false)
of course, return range overlaprange<forwarditerator>(first, last), can lot better efficiently finding right bound, , left bound if range is_bidirectional.
the biggest slow down because of dynamic dispatch in std::function<>. try replace custom, non-type-erased predicate type (function object?).
i should asking means elements in range "be overlapped t". maps of numbers called quantifiers but, let me jump bit of conclusion because excuse show off boost interval container library.
note demo hinges on fact boost icl has builtin combining styles , comparers many things. docs map concept:
- maps of sets called collectors and
- maps of numbers called quantifiers
for sense of implemented out-of-the-box chose demo collectors, i.e. t in terminology happens std::set<int>.
#include <boost/icl/interval_map.hpp> #include <set> namespace icl = boost::icl; using container = icl::interval_map< size_t, // mirrors vector index std::set<int> >; // set intersection models our "overlap" container generate_random_data(); #include <iostream> int main() { auto haystack = generate_random_data(); // let's find ranges t overlaps -2: using seive = container::value_type; std::cout << "haystack: " << haystack << "\n"; std::cout << "overlaps -2: \n\t-> " << (haystack & seive {{ 0, 1024 }, { -2 }}) << "\n"; std::cout << "overlaps +7: \n\t-> " << (haystack & seive {{ 0, 1024 }, { +7 }}) << "\n"; // can "multi-sieve" in 1 pass: std::cout << "overlaps of +7 or -2: \n\t-> " << (haystack & seive {{ 0, 1024 }, { -2, +7 }}) << "\n"; // , more interesting, can narrow of "overlap" targets input range: container combined; combined += seive {{ 128, 512 }, { -2 }}; combined += seive {{ 512, 768 }, { +7 }}; std::cout << "complex query " << combined << ": \n\t-> " << (haystack & combined) << "\n"; } #include <random> #include <functional> container generate_random_data() { using namespace std; mt19937 prng {random_device{}()}; // mt19937 prng {3236629322}; auto card = bind(uniform_int_distribution<>(1, 5), ref(prng)); auto domain = bind(uniform_int_distribution<size_t>(0, 1024), ref(prng)); auto codomain = bind(uniform_int_distribution<>(-10, 10), ref(prng)); auto gen_val = [&] { container::value_type::second_type v; generate_n(inserter(v, end(v)), card(), codomain); return v; }; auto gen_range = [&]() -> container::value_type { auto left = domain(), right = domain(); if (left>right) swap(left, right); return { container::interval_type::right_open ( left, right ), gen_val() }; }; { container data; generate_n(icl::inserter(data, end(data)), prng() % 1024, gen_range); return data; } } which prints (for particular seed)
haystack: {([0,1)->{-10 3 })([1,5)->{-10 })([5,13)->{-9 -1 3 7 10 })([13,17)->{-4 1 6 10 })([17,69)->{0 2 4 8 })([69,81)->{-7 -4 9 })([81,97)->{1 8 10 })([97,102)->{6 })([102,701)->{-10 -2 1 3 7 })([701,987)->{3 })([987,1002)->{-9 6 7 8 10 })([1002,1003)->{2 3 4 5 9 })([1003,1004)->{-9 -7 3 9 })([1004,1024)->{-10 -2 2 5 })} overlaps -2: -> {([102,701)->{-2 })([1004,1024)->{-2 })} overlaps +7: -> {([5,13)->{7 })([102,701)->{7 })([987,1002)->{7 })} overlaps of +7 or -2: -> {([5,13)->{7 })([102,701)->{-2 7 })([987,1002)->{7 })([1004,1024)->{-2 })} complex query {([128,512)->{-2 })([512,768)->{7 })}: -> {([128,512)->{-2 })([512,701)->{7 })} you can run sample more times see results other randomly generated data sets.
Comments
Post a Comment