i trying run following. trying divide data in multiple parts, apply operations each part , join results. while "take , foreach" works fine, "count" operation fails stack overflow exception.
// studenttablerdd rdd of data read student table // student table contains data related each student val studentscoringlist = studenttablerdd.map(data => data(student_id_idx)).distinct.collect.map{studentid => {studenttablerdd.filter(x => x(student_id_idx) == studentid)}} val studentprofilingrdd = studentscoringlist.map(data => scorestudentdata(1,data,trained_studentmodellist)).filter(_!=null).reduce(_.union(_)) studentprofilingrdd.take(10).foreach(println(_)) studentprofilingrdd.count // throws stack overflow exception
val studentscoringlist = studenttablerdd.map(data => data(student_id_idx)).distinct.collect.map{studentid => {studenttablerdd.filter(x => x(student_id_idx) == studentid)}}you've got list[rdd] source rdd. each rdd has data 1 unique studentid, , sum set of rdd equals studenttablerdd of course. strange @ least. there no work data there 1 hard operation (collect) , lot of lazy transformations. (useless splitting , computation?)val studentrdd = studentscoringlist.map(data => scorestudentdata(1,data,trained_studentmodellist))transform datum, ok (1 step useless while)filter(_!=null)if scorestudentdata can return null wrong code. bad style. (1 step useless while)reduce(_.union(_))joins rdd back. , again, 1 step useless.
this code gets same result:
studenttablerdd map { data => val score = scorestudentdata(1,data,trained_studentmodellist) if (score == null) none else some(score) } collect { case some(score) => score } but suppose it's not purpose.
Comments
Post a Comment