the original question trying deploy spark 1.4 on google cloud. after downloaded , set
spark_hadoop2_tarball_uri='gs://my_bucket/my-images/spark-1.4.1-bin-hadoop2.6.tgz' deployment bdutil fine; however, when trying call sqlcontext.parquetfile("gs://my_bucket/some_data.parquet"), runs following exception:
java.lang.classcastexception: com.google.cloud.hadoop.fs.gcs.googlehadoopfilesystem cannot cast org.apache.hadoop.fs.filesystem @ org.apache.hadoop.fs.filesystem.createfilesystem(filesystem.java:2595) @ org.apache.hadoop.fs.filesystem.access$200(filesystem.java:91) @ org.apache.hadoop.fs.filesystem$cache.getinternal(filesystem.java:2630) @ org.apache.hadoop.fs.filesystem$cache.get(filesystem.java:2612) @ org.apache.hadoop.fs.filesystem.get(filesystem.java:370) @ org.apache.hadoop.fs.filesystem.get(filesystem.java:169) @ org.apache.hadoop.fs.filesystem.get(filesystem.java:354) @ org.apache.hadoop.fs.path.getfilesystem(path.java:296) @ org.apache.hadoop.hive.metastore.warehouse.getfs(warehouse.java:112) @ org.apache.hadoop.hive.metastore.warehouse.getdnspath(warehouse.java:144) @ org.apache.hadoop.hive.metastore.warehouse.getwhroot(warehouse.java:159) and confused me googlehadoopfilesystem should subclass of org.apache.hadoop.fs.filesystem, , verified in same spark-shell instance:
scala> var gfs = new com.google.cloud.hadoop.fs.gcs.googlehadoopfilesystem() gfs: com.google.cloud.hadoop.fs.gcs.googlehadoopfilesystem = com.google.cloud.hadoop.fs.gcs.googlehadoopfilesystem@46f105c scala> gfs.isinstanceof[org.apache.hadoop.fs.filesystem] res3: boolean = true scala> gfs.asinstanceof[org.apache.hadoop.fs.filesystem] res4: org.apache.hadoop.fs.filesystem = com.google.cloud.hadoop.fs.gcs.googlehadoopfilesystem@46f105c did miss anything, workaround? in advance!
update: bdutil (version 1.3.1) setting deployment:
import_env hadoop2_env.sh import_env extensions/spark/spark_env.sh configbucket="my_conf_bucket" project="my_proj" gce_image='debian-7-backports' gce_machine_type='n1-highmem-4' gce_zone='us-central1-f' gce_network='my-network' gce_master_machine_type='n1-standard-2' preemptible_fraction=1.0 prefix='my-hadoop' num_workers=8 use_attached_pds=true worker_attached_pds_size_gb=200 master_attached_pd_size_gb=200 hadoop_tarball_uri="gs://hadoop-dist/hadoop-2.6.0.tar.gz" spark_mode="yarn-client" spark_hadoop2_tarball_uri='gs://my_conf_bucket/my-images/spark-1.4.1-bin-hadoop2.6.tgz'
short answer
indeed related isolatedclientloader, , we've tracked down root cause , verified fix. filed https://issues.apache.org/jira/browse/spark-9206 track issue, , built clean spark tarball fork simple fix: https://github.com/apache/spark/pull/7549
there few short-term options:
- use spark 1.3.1 now.
- in bdutil deployment, use hdfs default filesystem (
--default_fs=hdfs); you'll still able directly specifygs://paths in jobs, hdfs used intermediate data , staging files. there minor incompatibilities using raw hive in mode, though. - use raw
val sqlcontext = new org.apache.spark.sql.sqlcontext(sc)instead of hivecontext if don't need hivecontext features. git clone https://github.com/dennishuo/spark, run./make-distribution.sh --name my-custom-spark --tgz --skip-java-test -pyarn -phadoop-2.6 -dhadoop.version=2.6.0 -phive -phive-thriftserverfresh tarball can specify in bdutil'sspark_env.sh.
long answer
we've verified manifests when fs.default.name , fs.defaultfs set gs:// path regardless of whether trying load path parquetfile("gs://...") or parquetfile("hdfs://..."), , when fs.default.name , fs.defaultfs set hdfs path, loading data both hdfs , gcs works fine. specific spark 1.4+ currently, , not present in spark 1.3.1 or older.
the regression appears have been introduced in https://github.com/apache/spark/commit/9ac8393663d759860c67799e000ec072ced76493 fixes prior related classloading issue, spark-8368. while fix correct normal cases, there's method isolatedclientloader.issharedclass used determine classloader use, , interacts aforementioned commit break googlehadoopfilesystem classloading.
the following lines in file include under com.google.* "shared class" because of guava , possibly protobuf dependencies indeed loaded shared libraries, unfortunately googlehadoopfilesystem should loaded "hive class" in case, org.apache.hadoop.hdfs.distributedfilesystem. happen unluckily share com.google.* package namespace.
protected def issharedclass(name: string): boolean = name.contains("slf4j") || name.contains("log4j") || name.startswith("org.apache.spark.") || name.startswith("scala.") || name.startswith("com.google") || name.startswith("java.lang.") || name.startswith("java.net") || sharedprefixes.exists(name.startswith) ... /** classloader used load isolated version of hive. */ protected val classloader: classloader = new urlclassloader(alljars, rootclassloader) { override def loadclass(name: string, resolve: boolean): class[_] = { val loaded = findloadedclass(name) if (loaded == null) doloadclass(name, resolve) else loaded } def doloadclass(name: string, resolve: boolean): class[_] = { ... } else if (!issharedclass(name)) { logdebug(s"hive class: $name - ${getresource(classtopath(name))}") super.loadclass(name, resolve) } else { // shared classes, delegate baseclassloader. logdebug(s"shared class: $name") baseclassloader.loadclass(name) } } } this can verified adding following line ${spark_install}/conf/log4j.properties:
log4j.logger.org.apache.spark.sql.hive.client=debug and output shows:
... 15/07/20 20:59:14 debug isolatedclientloader: hive class: org.apache.hadoop.hdfs.distributedfilesystem - jar:file:/home/hadoop/spark-install/lib/spark-assembly-1.4.1-hadoop2.6.0.jar!/org/apache/hadoop/hdfs/distributedfilesystem.class ... 15/07/20 20:59:14 debug isolatedclientloader: shared class: com.google.cloud.hadoop.fs.gcs.googlehadoopfilesystem java.lang.runtimeexception: java.lang.classcastexception: com.google.cloud.hadoop.fs.gcs.googlehadoopfilesystem cannot cast org.apache.hadoop.fs.filesystem
Comments
Post a Comment