scala - Spark caching strategy -


i have spark driver goes this:

edit - earlier version of code different & didn't work

var totalresult = ... // rdd[(key, value)] var stageresult = totalresult  {   stageresult = stageresult.flatmap(     // code returns 0 or more outputs per input,     // , updates `acc` number of outputs     ...   ).reducebykey((x, y) => x.sum(y))    totalresult = totalresult.union(stageresult) } while(stageresult.count() > 0) 

i know properties of data terminate (i'm aggregating nodes in dag).

i'm not sure of reasonable caching strategy here - should cache stageresult each time through loop? setting horrible tower of recursion, since each totalresult depends on previous incarnations of itself? or spark figure out me? or should put each rdd result in array , take 1 big union @ end?

suggestions welcome here, thanks.

i rewrite follows:

do {   stageresult = stageresult.flatmap(     //some code returns 0 or more outputs per input   ).reducebykey(_+_).cache    totalresult = totalresult.union(stageresult) } while(stageresult.count > 0) 

i certain(95%) stageresult dag used in union correct reference (especially since count should trigger it), might need double checked.

then when call totalresult.action, put of cached data together.

answer based on updated question

as long have memory space, indeed cache along way stores data of each stageresult, unioning of data points @ end. in fact, each union not rely on past not semantics of rdd.union, merely puts them @ end. change code use val due rdd immutability.

as final note, maybe dag visualization understand why there not recursive ramifications:

dag


Comments