i newbie apache spark.
my job read 2 csv files, select specific columns it, merge it, aggregate , write result single csv file.
for example,
csv1
name,age,deparment_id csv2
department_id,deparment_name,location i want third csv file with
name,age,deparment_name i loading both csv dataframes. , able third dataframe using several methods join,select,filter,drop present in dataframe
i able same using several rdd.map()
and able same using executing hiveql using hivecontext
i want know efficient way if csv files huge , why?
both dataframes , spark sql queries optimized using catalyst engine, guess produce similar performance (assuming using version >= 1.3)
and both should better simple rdd operations, because rdds, spark don't have knowledge types of data, can't special optimizations
Comments
Post a Comment