Elasticsearch + Apache Spark performance -


i trying use apache spark query data in elasticsearch spark job taking 20 hours aggregation , still running. same query in es takes 6 sec.

i understand data has move elasticsearch cluster spark cluster , data shuffling in spark.

the data inside es index approx. 300 million documents , each document has 400 fields (1.4terrabyte).

i've got 3 node spark cluster(1 master, 2 workers) 60gb of memory , 8 cores in total.

the time takes run not acceptable, there way make spark job run faster ?

here spark configuration:

sparkconf sparkconf = new sparkconf(true).setappname("sparkqueryapp")                  .setmaster("spark://10.0.0.203:7077")                      .set("es.nodes", "10.0.0.207")                  .set("es.cluster", "wp-es-reporting-prod")                               .setjars(javasparkcontext.jarofclass(demo.class))                 .set("spark.serializer", "org.apache.spark.serializer.kryoserializer")                 .set("spark.default.parallelism", string.valueof(cpus * 2))                 .set("spark.executor.memory", "8g"); 

edited

    sparkcontext sparkctx = new sparkcontext(sparkconf);      sqlcontext sqlcontext = new sqlcontext(sparkctx);     dataframe df = javaessparksql.esdf(sqlcontext, "customer-rpts01-201510/sample");      dataframe dfcleaned = cleanschema(sqlcontext, df);      dfcleaned.registertemptable("rpt");      dataframe sqldftest = sqlcontext.sql("select agent, count(request_type) rpt group agent");      (row row : sqldftest.collect()) {         system.out.println(">> " + row);     } 

i figured out going on, basically, trying manipulate dataframe schema because have fields dot e.g user.firstname. seems cause problem in collect phase of spark. resolve this, had re-index data fields no longer have dot underscore e.g user_firstname.


Comments