join - Shuffling on Spark cartesian product -


assume problem have rdd x, calculate mean m in single worker node , want calculate x-m e.g. calculate stdevs. want happen in cluster, not driver node i.e. want m distributed. thought of implementing cartesian product of 2 rdds m gets calculated, propagates workers , calculate x-m. fear spark shuffle x's m lives , subtraction there. there guarantee on shuffled in case of x.cartesian(m)?

the mean/stedev problem above illustration purposes - know it's not excellent it's simple enough.


Comments