i have simple setup of sparksql connecting postgres db , i'm trying dataframe table, dataframe number x of partitions (lets 2). code following:
map<string, string> options = new hashmap<string, string>(); options.put("url", db_url); options.put("driver", postgres_driver); options.put("dbtable", "select id, other table limit 1000"); options.put("partitioncolumn", "id"); options.put("lowerbound", "100"); options.put("upperbound", "500"); options.put("numpartitions","2"); dataframe housingdataframe = sqlcontext.read().format("jdbc").options(options).load(); for reason, 1 partition of dataframe contains rows.
for can understand lowerbound/upperbound parameters used finetune this. in sparksql's documentation (spark 1.4.0 - spark-sql_2.11) says used define stride, not filter/range partition column. raises several questions:
- the stride frequency (number of elements returned each query) spark query db each executor (partition)?
- if not, purpose of parameters, depend on , how can balance dataframe partitions in stable way (not asking partitions contain same number of elements, there equilibrium - example 2 partitions 100 elements 55/45 , 60/40 or 65/35 do)
can't seem find clear answer these questions around , wondering if maybe of clear points me, because right affecting cluster performance when processing x million rows , heavy lifting goes 1 single executor.
cheers , time.
essentially lower , upper bound , number of partitions used calculate increment or split each parallel task.
let's table has partition column "year", , has data 2006 2016.
if define number of partitions 10, lower bound 2006 , higher bound 2016, have each task fetching data own year - ideal case.
even if incorrectly specify lower , / or upper bound, e.g. set lower = 0 , upper = 2016, there skew in data transfer, but, not "lose" or fail retrieve data, because:
the first task fetch data year < 0.
the second task fetch data year between 0 , 2016/10.
the third task fetch data year between 2016/10 , 2*2016/10.
...
and last task have condition year->2016.
t.
Comments
Post a Comment