apache spark - "KeyError: 'SPARK_HOME' ", "can't load main class from JAR" in running PySpark as an Oozie workflow job -


this issue continuation of previous question here, seemingly resolved leads here issue.

i using spark 1.4.0 on cloudera quickstartvm chd-5.4.0. when run pyspark script sparkaction in oozie, encounter error in oozie job / container logs:

keyerror: 'spark_home' 

then came across this solution , this spark 1.3.0, although still did try. documentations seem issue fixed spark version 1.3.2 , 1.4.0 (but here am, encountering same issue).

the suggested solution in link need set spark.yarn.appmasterenv.spark_home , spark.executorenv.spark_home anything, if it's path not point actual spark_home (i.e., /bogus, although did set these actual spark_home).

here's workflow after:

    <spark xmlns="uri:oozie:spark-action:0.1">         <job-tracker>${resourcemanager}</job-tracker>         <name-node>${namenode}</name-node>         <master>local[2]</master>         <mode>client</mode>         <name>${name}</name>         <jar>${workflowrootlocal}/lib/my_pyspark_job.py</jar>         <spark-opts>--conf spark.yarn.appmasterenv.spark_home=/usr/lib/spark spark.executorenv.spark_home=/usr/lib/spark</spark-opts>     </spark> 

which seems solve original problem above. however, leads error when try inspect stderr of oozie container log:

error: cannot load main class jar file:/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/cloudera/appcache/application_1437103727449_0011/container_1437103727449_0011_01_000001/spark.executorenv.spark_home=/usr/lib/spark 

if using python, should not expect main class right? please note in previous related post oozie job example shipped cloudera quickstartvm cdh-5.4.0, features sparkaction written in java working in tests. seems issue in python.

appreciate can help.

rather setting spark.yarn.appmasterenv.spark_home , spark.executorenv.spark_home variables, try , add following lines of code python script before setting sparkconf()

os.environ["spark_home"] = "/path/to/spark/installed/location" 

found reference here

this helped me resolve error face, faced following error afterwards

traceback (most recent call last):   file "/usr/hdp/current/spark-client/analyticsjar/boxplot_outlier.py", line 129, in <module>     main()   file "/usr/hdp/current/spark-client/analyticsjar/boxplot_outlier.py", line 60, in main     sc = sparkcontext(conf=conf)   file "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 107, in __init__   file "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 155, in _do_init   file "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 201, in _initialize_context   file "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/py4j/java_gateway.py", line 701, in __call__   file "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/py4j/protocol.py", line 300, in get_return_value py4j.protocol.py4jjavaerror: error occurred while calling none.org.apache.spark.api.java.javasparkcontext. : java.lang.securityexception: class "javax.servlet.filterregistration"'s signer information not match signer information of other classes in same package 

Comments