i trying use apache spark query data in elasticsearch spark job taking 20 hours aggregation , still running. same query in es takes 6 sec.
i understand data has move elasticsearch cluster spark cluster , data shuffling in spark.
the data inside es index approx. 300 million documents , each document has 400 fields (1.4terrabyte).
i've got 3 node spark cluster(1 master, 2 workers) 60gb of memory , 8 cores in total.
the time takes run not acceptable, there way make spark job run faster ?
here spark configuration:
sparkconf sparkconf = new sparkconf(true).setappname("sparkqueryapp") .setmaster("spark://10.0.0.203:7077") .set("es.nodes", "10.0.0.207") .set("es.cluster", "wp-es-reporting-prod") .setjars(javasparkcontext.jarofclass(demo.class)) .set("spark.serializer", "org.apache.spark.serializer.kryoserializer") .set("spark.default.parallelism", string.valueof(cpus * 2)) .set("spark.executor.memory", "8g"); edited
sparkcontext sparkctx = new sparkcontext(sparkconf); sqlcontext sqlcontext = new sqlcontext(sparkctx); dataframe df = javaessparksql.esdf(sqlcontext, "customer-rpts01-201510/sample"); dataframe dfcleaned = cleanschema(sqlcontext, df); dfcleaned.registertemptable("rpt"); dataframe sqldftest = sqlcontext.sql("select agent, count(request_type) rpt group agent"); (row row : sqldftest.collect()) { system.out.println(">> " + row); }
i figured out going on, basically, trying manipulate dataframe schema because have fields dot e.g user.firstname. seems cause problem in collect phase of spark. resolve this, had re-index data fields no longer have dot underscore e.g user_firstname.
Comments
Post a Comment