python - Does using spark in stand-alone on 1 large computer make sense? -


i working ~120gb of csv files (from 1gb 20gb each). using 220gb ram computer 36 theads.

i wondering if makes sense use spark in stand-alone mode analysis? natural concurrency of spark plus (with pyspark) have nice notebook environment use.

i want joins/aggregation type stuff , run machine learning on transformed dataset. python tools pandas want use 1 thread seems massive waste since using 36 threads must faster..

to answer question, yes, if have 1 node available, 1 powerful describe (as long can handle size of data) make sense.

i recommend running application in "local" mode, since using 1 node. when run ./spark-submit, specify:

--master local[*] 

as in:

./spark-submit --master local[*] <your-app-name> <your-apps-args> 

this run application on local node using available cores.

remember in application must specify amount of executor memory want application use; default 512m. if want take advantage of of memory, can change either parameter spark-submit or in application code when making sparkconf object.


Comments