i working ~120gb of csv files (from 1gb 20gb each). using 220gb ram computer 36 theads.
i wondering if makes sense use spark in stand-alone mode analysis? natural concurrency of spark plus (with pyspark) have nice notebook environment use.
i want joins/aggregation type stuff , run machine learning on transformed dataset. python tools pandas want use 1 thread seems massive waste since using 36 threads must faster..
to answer question, yes, if have 1 node available, 1 powerful describe (as long can handle size of data) make sense.
i recommend running application in "local" mode, since using 1 node. when run ./spark-submit, specify:
--master local[*] as in:
./spark-submit --master local[*] <your-app-name> <your-apps-args> this run application on local node using available cores.
remember in application must specify amount of executor memory want application use; default 512m. if want take advantage of of memory, can change either parameter spark-submit or in application code when making sparkconf object.
Comments
Post a Comment