apache spark - Which is efficient, Dataframe or RDD or hiveql? -


i newbie apache spark.

my job read 2 csv files, select specific columns it, merge it, aggregate , write result single csv file.

for example,

csv1

name,age,deparment_id 

csv2

department_id,deparment_name,location 

i want third csv file with

name,age,deparment_name 

i loading both csv dataframes. , able third dataframe using several methods join,select,filter,drop present in dataframe

i able same using several rdd.map()

and able same using executing hiveql using hivecontext

i want know efficient way if csv files huge , why?

both dataframes , spark sql queries optimized using catalyst engine, guess produce similar performance (assuming using version >= 1.3)

and both should better simple rdd operations, because rdds, spark don't have knowledge types of data, can't special optimizations


Comments