pyspark - What are the best practices to partition Parquet files by timestamp in Spark? -


i'm pretty new spark (2 days) , i'm pondering best way partition parquet files.

my rough plan atm is:

  • read in source tsv files com.databricks.spark.csv (these have timestamptype column)
  • write out parquet files, partitioned year/month/day/hour
  • use these parquet files queries that'll occurring in future

it's been ludicrously easy (kudos spark devs) simple version working - except partitioning way i'd to. in python btw:

input = sqlcontext.read.format('com.databricks.spark.csv').load(source, schema=myschema) input.write.partitionby('type').format("parquet").save(dest, mode="append") 

is best approach map rdd, adding new columns year, month, day, hour , use partitionby? queries have manually add year/month etc? given how elegant i've found spark far, seems little odd.

thanks

i've found few ways now, not yet run performance tests on them, caveat emptor:

first need create derived dataframe (three ways shown below) , write out.

1) sql queries (inline functions)

sqlcontext.registerfunction("day",lambda f: f.day, integertype()) input.registertemptable("input") input_ts = sqlcontext.sql(   "select day(inserted_at) inserted_at_day, * input") 

2) sql queries (non-inline) - similar

def day(ts):   return f.day sqlcontext.registerfunction("day",day, integertype()) ... rest before 

3) withcolumn

from pyspark.sql.functions import udf day = udf(lambda f: f.day, integertype()) input_ts = input.withcolumn('inserted_at_day',day(input.inserted_at)) 

to write out just:

input_ts.write.partitionby(['inserted_at_day']).format("parquet").save(dest, mode="append") 

Comments