google cloud dataflow - INTERNAL: IO error: /var/shuffle/sorted-dataset-4/1011: No space left on device when talking to tcp://localhost:12345 -
i'm trying process 2.5tb of data bigquery. code beginning of pipeline:
pipeline p = pipeline.create(options); p.apply(bigqueryio.read.fromquery( "select * table_query(events, 'table_id contains \"20150601\"') id not null")) .apply(pardo.of(new dofn<tablerow, kv<string, tablerow>>() { @override public void processelement(processcontext c) throws exception { c.output(kv.of((string) c.element().get("id"), c.element())); } })).apply(groupbykey.<string, tablerow>create()) for dataflowpipelineoptions set staging location folder on gcs , project.
job started on gcp run while. final job status failed due internal io errors.
jul 16, 2015, 8:45:47 pm(297a156f6f2a50b2): java.lang.runtimeexception:java.io.ioexception: internal: io error: /var/shuffle/sorted-dataset-4/1011: no space left on device when talking tcp://localhost:12345 @ com.google.cloud.dataflow.sdk.repackaged.com.google.common.base.throwables.propagate(throwables.java:160) @ com.google.cloud.dataflow.sdk.runners.worker.pardofnbase$1.output(pardofnbase.java:154) @ com.google.cloud.dataflow.sdk.runners.worker.pardofnbase$1.output(pardofnbase.java:117) @ com.google.cloud.dataflow.sdk.util.dofnrunner$dofncontext.outputwindowedvalue(dofnrunner.java:314) @ com.google.cloud.dataflow.sdk.util.dofnrunner$dofnprocesscontext.output(dofnrunner.java:475) @ com.google.cloud.dataflow.sdk.util.reifytimestampandwindowsdofn.processelement(reifytimestampandwindowsdofn.java:40) caused by: java.io.ioexception: internal: io error: /var/shuffle/sorted-dataset-4/1011: no space left on device when talking tcp://localhost:12345 @ com.google.cloud.dataflow.sdk.runners.worker.applianceshufflewriter.write(native method) @ com.google.cloud.dataflow.sdk.runners.worker.chunkingshuffleentrywriter.writechunk(chunkingshuffleentrywriter.java:72) @ com.google.cloud.dataflow.sdk.runners.worker.chunkingshuffleentrywriter.put(chunkingshuffleentrywriter.java:56) @ com.google.cloud.dataflow.sdk.runners.worker.shufflesink$shufflesinkwriter.add(shufflesink.java:258) @ com.google.cloud.dataflow.sdk.runners.worker.shufflesink$shufflesinkwriter.add(shufflesink.java:169) @ com.google.cloud.dataflow.sdk.util.common.worker.writeoperation.process(writeoperation.java:90) @ com.google.cloud.dataflow.sdk.util.common.worker.outputreceiver.process(outputreceiver.java:147) @ com.google.cloud.dataflow.sdk.runners.worker.pardofnbase$1.output(pardofnbase.java:152) @ com.google.cloud.dataflow.sdk.runners.worker.pardofnbase$1.output(pardofnbase.java:117) @ com.google.cloud.dataflow.sdk.util.dofnrunner$dofncontext.outputwindowedvalue(dofnrunner.java:314) @ com.google.cloud.dataflow.sdk.util.dofnrunner$dofnprocesscontext.output(dofnrunner.java:475) @ com.google.cloud.dataflow.sdk.util.reifytimestampandwindowsdofn.processelement(reifytimestampandwindowsdofn.java:40) @ com.google.cloud.dataflow.sdk.util.dofnrunner.invokeprocesselement(dofnrunner.java:167) @ com.google.cloud.dataflow.sdk.util.dofnrunner.processelement(dofnrunner.java:152) @ com.google.cloud.dataflow.sdk.runners.worker.pardofnbase.processelement(pardofnbase.java:188) @ com.google.cloud.dataflow.sdk.util.common.worker.pardooperation.process(pardooperation.java:52) @ com.google.cloud.dataflow.sdk.util.common.worker.outputreceiver.process(outputreceiver.java:147) @ com.google.cloud.dataflow.sdk.runners.worker.pardofnbase$1.output(pardofnbase.java:152) @ com.google.cloud.dataflow.sdk.runners.worker.pardofnbase$1.output(pardofnbase.java:117) @ com.google.cloud.dataflow.sdk.util.dofnrunner$dofncontext.outputwindowedvalue(dofnrunner.java:314) @ com.google.cloud.dataflow.sdk.util.dofnrunner$dofnprocesscontext.output(dofnrunner.java:475) @ com.outfit7.dataflow.ante.example$5.processelement(example.java:41) @ com.google.cloud.dataflow.sdk.util.dofnrunner.invokeprocesselement(dofnrunner.java:167) @ com.google.cloud.dataflow.sdk.util.dofnrunner.processelement(dofnrunner.java:152) @ com.google.cloud.dataflow.sdk.runners.worker.pardofnbase.processelement(pardofnbase.java:188) @ com.google.cloud.dataflow.sdk.util.common.worker.pardooperation.process(pardooperation.java:52) @ com.google.cloud.dataflow.sdk.util.common.worker.outputreceiver.process(outputreceiver.java:147) @ com.google.cloud.dataflow.sdk.util.common.worker.readoperation.runreadloop(readoperation.java:171) @ com.google.cloud.dataflow.sdk.util.common.worker.readoperation.start(readoperation.java:117) @ com.google.cloud.dataflow.sdk.util.common.worker.maptaskexecutor.execute(maptaskexecutor.java:66) @ com.google.cloud.dataflow.sdk.runners.worker.dataflowworker.executework(dataflowworker.java:220) @ com.google.cloud.dataflow.sdk.runners.worker.dataflowworker.dowork(dataflowworker.java:167) @ com.google.cloud.dataflow.sdk.runners.worker.dataflowworker.getandperformwork(dataflowworker.java:134) @ com.google.cloud.dataflow.sdk.runners.worker.dataflowworkerharness$workerthread.call(dataflowworkerharness.java:146) @ com.google.cloud.dataflow.sdk.runners.worker.dataflowworkerharness$workerthread.call(dataflowworkerharness.java:131) @ java.util.concurrent.futuretask.run(futuretask.java:266) @ java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1142) @ java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor.java:617) @ java.lang.thread.run(thread.java:745) is there way sure job finish successfully? how should set num of workers , disksizegb or make estimation on size used on 1 worker? groupbykey executed on 1 worker or shared/sharded? understand groupbykey "waits" process data before pass pcollection next element in pipeline.
from https://cloud.google.com/dataflow/faq#question-45:
this error indicates local disk has insufficient space process job. if running job default settings, job running on 3 workers, each 250 gb of local disk space, , no auto-scaling. consider modifying default settings increase number of workers available job, increase default disk size per worker, or enable auto-scaling.
Comments
Post a Comment