apache spark - SparkSQL PostgresQL Dataframe partitions -


i have simple setup of sparksql connecting postgres db , i'm trying dataframe table, dataframe number x of partitions (lets 2). code following:

map<string, string> options = new hashmap<string, string>(); options.put("url", db_url); options.put("driver", postgres_driver); options.put("dbtable", "select id, other table limit 1000"); options.put("partitioncolumn", "id"); options.put("lowerbound", "100"); options.put("upperbound", "500"); options.put("numpartitions","2"); dataframe housingdataframe = sqlcontext.read().format("jdbc").options(options).load(); 

for reason, 1 partition of dataframe contains rows.

for can understand lowerbound/upperbound parameters used finetune this. in sparksql's documentation (spark 1.4.0 - spark-sql_2.11) says used define stride, not filter/range partition column. raises several questions:

  1. the stride frequency (number of elements returned each query) spark query db each executor (partition)?
  2. if not, purpose of parameters, depend on , how can balance dataframe partitions in stable way (not asking partitions contain same number of elements, there equilibrium - example 2 partitions 100 elements 55/45 , 60/40 or 65/35 do)

can't seem find clear answer these questions around , wondering if maybe of clear points me, because right affecting cluster performance when processing x million rows , heavy lifting goes 1 single executor.

cheers , time.

essentially lower , upper bound , number of partitions used calculate increment or split each parallel task.

let's table has partition column "year", , has data 2006 2016.

if define number of partitions 10, lower bound 2006 , higher bound 2016, have each task fetching data own year - ideal case.

even if incorrectly specify lower , / or upper bound, e.g. set lower = 0 , upper = 2016, there skew in data transfer, but, not "lose" or fail retrieve data, because:

the first task fetch data year < 0.

the second task fetch data year between 0 , 2016/10.

the third task fetch data year between 2016/10 , 2*2016/10.

...

and last task have condition year->2016.

t.


Comments