i'm pretty new spark (2 days) , i'm pondering best way partition parquet files.
my rough plan atm is:
- read in source tsv files com.databricks.spark.csv (these have timestamptype column)
- write out parquet files, partitioned year/month/day/hour
- use these parquet files queries that'll occurring in future
it's been ludicrously easy (kudos spark devs) simple version working - except partitioning way i'd to. in python btw:
input = sqlcontext.read.format('com.databricks.spark.csv').load(source, schema=myschema) input.write.partitionby('type').format("parquet").save(dest, mode="append") is best approach map rdd, adding new columns year, month, day, hour , use partitionby? queries have manually add year/month etc? given how elegant i've found spark far, seems little odd.
thanks
i've found few ways now, not yet run performance tests on them, caveat emptor:
first need create derived dataframe (three ways shown below) , write out.
1) sql queries (inline functions)
sqlcontext.registerfunction("day",lambda f: f.day, integertype()) input.registertemptable("input") input_ts = sqlcontext.sql( "select day(inserted_at) inserted_at_day, * input") 2) sql queries (non-inline) - similar
def day(ts): return f.day sqlcontext.registerfunction("day",day, integertype()) ... rest before 3) withcolumn
from pyspark.sql.functions import udf day = udf(lambda f: f.day, integertype()) input_ts = input.withcolumn('inserted_at_day',day(input.inserted_at)) to write out just:
input_ts.write.partitionby(['inserted_at_day']).format("parquet").save(dest, mode="append")
Comments
Post a Comment