apache spark - GoogleHadoopFileSystem cannot be cast to hadoop FileSystem? -


the original question trying deploy spark 1.4 on google cloud. after downloaded , set

spark_hadoop2_tarball_uri='gs://my_bucket/my-images/spark-1.4.1-bin-hadoop2.6.tgz' 

deployment bdutil fine; however, when trying call sqlcontext.parquetfile("gs://my_bucket/some_data.parquet"), runs following exception:

 java.lang.classcastexception: com.google.cloud.hadoop.fs.gcs.googlehadoopfilesystem cannot cast org.apache.hadoop.fs.filesystem @ org.apache.hadoop.fs.filesystem.createfilesystem(filesystem.java:2595) @ org.apache.hadoop.fs.filesystem.access$200(filesystem.java:91) @ org.apache.hadoop.fs.filesystem$cache.getinternal(filesystem.java:2630) @ org.apache.hadoop.fs.filesystem$cache.get(filesystem.java:2612) @ org.apache.hadoop.fs.filesystem.get(filesystem.java:370) @ org.apache.hadoop.fs.filesystem.get(filesystem.java:169) @ org.apache.hadoop.fs.filesystem.get(filesystem.java:354) @ org.apache.hadoop.fs.path.getfilesystem(path.java:296) @ org.apache.hadoop.hive.metastore.warehouse.getfs(warehouse.java:112) @ org.apache.hadoop.hive.metastore.warehouse.getdnspath(warehouse.java:144) @ org.apache.hadoop.hive.metastore.warehouse.getwhroot(warehouse.java:159) 

and confused me googlehadoopfilesystem should subclass of org.apache.hadoop.fs.filesystem, , verified in same spark-shell instance:

scala> var gfs = new com.google.cloud.hadoop.fs.gcs.googlehadoopfilesystem() gfs: com.google.cloud.hadoop.fs.gcs.googlehadoopfilesystem = com.google.cloud.hadoop.fs.gcs.googlehadoopfilesystem@46f105c  scala> gfs.isinstanceof[org.apache.hadoop.fs.filesystem] res3: boolean = true  scala> gfs.asinstanceof[org.apache.hadoop.fs.filesystem] res4: org.apache.hadoop.fs.filesystem = com.google.cloud.hadoop.fs.gcs.googlehadoopfilesystem@46f105c 

did miss anything, workaround? in advance!

update: bdutil (version 1.3.1) setting deployment:

import_env hadoop2_env.sh import_env extensions/spark/spark_env.sh configbucket="my_conf_bucket" project="my_proj" gce_image='debian-7-backports' gce_machine_type='n1-highmem-4' gce_zone='us-central1-f' gce_network='my-network' gce_master_machine_type='n1-standard-2' preemptible_fraction=1.0 prefix='my-hadoop' num_workers=8 use_attached_pds=true worker_attached_pds_size_gb=200 master_attached_pd_size_gb=200 hadoop_tarball_uri="gs://hadoop-dist/hadoop-2.6.0.tar.gz" spark_mode="yarn-client" spark_hadoop2_tarball_uri='gs://my_conf_bucket/my-images/spark-1.4.1-bin-hadoop2.6.tgz' 

short answer

indeed related isolatedclientloader, , we've tracked down root cause , verified fix. filed https://issues.apache.org/jira/browse/spark-9206 track issue, , built clean spark tarball fork simple fix: https://github.com/apache/spark/pull/7549

there few short-term options:

  1. use spark 1.3.1 now.
  2. in bdutil deployment, use hdfs default filesystem (--default_fs=hdfs); you'll still able directly specify gs:// paths in jobs, hdfs used intermediate data , staging files. there minor incompatibilities using raw hive in mode, though.
  3. use raw val sqlcontext = new org.apache.spark.sql.sqlcontext(sc) instead of hivecontext if don't need hivecontext features.
  4. git clone https://github.com/dennishuo/spark , run ./make-distribution.sh --name my-custom-spark --tgz --skip-java-test -pyarn -phadoop-2.6 -dhadoop.version=2.6.0 -phive -phive-thriftserver fresh tarball can specify in bdutil's spark_env.sh.

long answer

we've verified manifests when fs.default.name , fs.defaultfs set gs:// path regardless of whether trying load path parquetfile("gs://...") or parquetfile("hdfs://..."), , when fs.default.name , fs.defaultfs set hdfs path, loading data both hdfs , gcs works fine. specific spark 1.4+ currently, , not present in spark 1.3.1 or older.

the regression appears have been introduced in https://github.com/apache/spark/commit/9ac8393663d759860c67799e000ec072ced76493 fixes prior related classloading issue, spark-8368. while fix correct normal cases, there's method isolatedclientloader.issharedclass used determine classloader use, , interacts aforementioned commit break googlehadoopfilesystem classloading.

the following lines in file include under com.google.* "shared class" because of guava , possibly protobuf dependencies indeed loaded shared libraries, unfortunately googlehadoopfilesystem should loaded "hive class" in case, org.apache.hadoop.hdfs.distributedfilesystem. happen unluckily share com.google.* package namespace.

protected def issharedclass(name: string): boolean =   name.contains("slf4j") ||   name.contains("log4j") ||   name.startswith("org.apache.spark.") ||   name.startswith("scala.") ||   name.startswith("com.google") ||   name.startswith("java.lang.") ||   name.startswith("java.net") ||   sharedprefixes.exists(name.startswith)  ...  /** classloader used load isolated version of hive. */ protected val classloader: classloader = new urlclassloader(alljars, rootclassloader) {   override def loadclass(name: string, resolve: boolean): class[_] = {     val loaded = findloadedclass(name)     if (loaded == null) doloadclass(name, resolve) else loaded   }    def doloadclass(name: string, resolve: boolean): class[_] = {     ...     } else if (!issharedclass(name)) {       logdebug(s"hive class: $name - ${getresource(classtopath(name))}")       super.loadclass(name, resolve)     } else {       // shared classes, delegate baseclassloader.       logdebug(s"shared class: $name")       baseclassloader.loadclass(name)     }   } } 

this can verified adding following line ${spark_install}/conf/log4j.properties:

log4j.logger.org.apache.spark.sql.hive.client=debug 

and output shows:

... 15/07/20 20:59:14 debug isolatedclientloader: hive class: org.apache.hadoop.hdfs.distributedfilesystem - jar:file:/home/hadoop/spark-install/lib/spark-assembly-1.4.1-hadoop2.6.0.jar!/org/apache/hadoop/hdfs/distributedfilesystem.class ... 15/07/20 20:59:14 debug isolatedclientloader: shared class: com.google.cloud.hadoop.fs.gcs.googlehadoopfilesystem java.lang.runtimeexception: java.lang.classcastexception: com.google.cloud.hadoop.fs.gcs.googlehadoopfilesystem cannot cast org.apache.hadoop.fs.filesystem 

Comments