Spark Word2Vec example using text8 file -


i'm trying run example apache.spark.org (code below & entire tutorial here: https://spark.apache.org/docs/latest/mllib-feature-extraction.html) using text8 file reference on site (http://mattmahoney.net/dc/text8.zip):

import org.apache.spark._ import org.apache.spark.rdd._ import org.apache.spark.sparkcontext._ import org.apache.spark.mllib.feature.{word2vec, word2vecmodel}  val input = sc.textfile("/users/rkita/documents/learning/random/spark/mllib/examples/text8",4).map(line => line.split(" ").toseq)  val word2vec = new word2vec()  val model = word2vec.fit(input)  val synonyms = model.findsynonyms("china", 40)  for((synonym, cosinesimilarity) <- synonyms) {   println(s"$synonym $cosinesimilarity") }  // save , load model model.save(sc, "mymodelpath") val samemodel = word2vecmodel.load(sc, "mymodelpath") 

i working on spark on mac (2 cores, 8gb ram), , think i've set memory allocations correctly in spark-env.sh file following:

export spark_executor_memory=4g export spark_worker_memory=4g 

when try fit model, keep getting java heap errors. got same result in python well. increased java memory sizes using java_opts well.

the file 100mb, think somehow memory settings not correct, i'm not sure if that's root cause.

has else tried example on laptop?

i can't put file on our company servers because we're not supposed import external data, i'm reduced working on personal laptop. if have suggestions, i'd appreciate hearing them. thx!

first of all, newcomer spark, others may have quicker or better solutions. ran same difficulties run sample code. manage make work, by:

  1. running own spark cluster on machine: use start scripts in /sbin/ directory of spark installation. so, have configure conf/spark-env.sh file according needs. not use 127.0.0.1 ip spark.
  2. compile , package scala code jar (sbt package), provide cluster (see addjar(...) in scala code). seems possible provide compiled code spark using classpath / classpath, did not try yet.
  3. set executor memory , driver memory (see scala code)

spark-env.sh:

export spark_master_ip=192.168.1.53 export spark_master_port=7077 export spark_master_webui_port=8080  export spark_daemon_memory=1g # worker : 1 server # number of worker instances run on each machine (default: 1).  # can make more 1 if have have large machines , multiple spark worker processes.  # if set this, make sure set spark_worker_cores explicitly limit cores per worker,  # or else each worker try use cores. export spark_worker_instances=2 # total number of cores allow spark applications use on machine (default: available cores). export spark_worker_cores=7  #total amount of memory allow spark applications use on machine, e.g. 1000m, 2g  # (default: total memory minus 1 gb);  # note each application's individual memory configured using spark.executor.memory property. export spark_worker_memory=8g export spark_worker_dir=/tmp  # executor : 1 application run on server # export spark_executor_instances=4 # export spark_executor_memory=4g  export spark_scala_version="2.10" 

scala file run example:

import org.apache.spark.sparkcontext import org.apache.spark.sparkcontext._ import org.apache.spark.sparkconf import org.apache.log4j.logger import org.apache.log4j.level import org.apache.spark.mllib.feature.{word2vec, word2vecmodel}  object sparkdemo {    def log[a](key:string)(job : =>a) = {     val start = system.currenttimemillis     val output = job     println("===> %s in %s seconds"       .format(key, (system.currenttimemillis - start) / 1000.0))     output   }    def main(args: array[string]):unit ={      val modelname ="w2vmodel"      val sc = new sparkcontext(       new sparkconf()       .setappname("sparkdemo")       .set("spark.executor.memory", "8g")       .set("spark.driver.maxresultsize", "16g")       .setmaster("spark://192.168.1.53:7077") // ip of spark master.       // .setmaster("local[2]") // not work... workers loose contact master after 120s     )      // take target folder if unsure how jar named     // onliner compile / run : sbt package && sbt run     sc.addjar("./target/scala-2.10/sparkling_2.10-0.1.jar")      val input = sc.textfile("./text8").map(line => line.split(" ").toseq)      val word2vec = new word2vec()      val model = log("compute model") { word2vec.fit(input) }     log ("save model") { model.save(sc, modelname) }      val synonyms = model.findsynonyms("china", 40)     for((synonym, cosinesimilarity) <- synonyms) {       println(s"$synonym $cosinesimilarity")     }      val model2 = log("reload model") { word2vecmodel.load(sc, modelname) }   } } 

Comments