in input hdfs have file1, file2, file3, file4. have written mapper file in python read data these files using:
for lines in sys.stdin the issue facing is reading line randomly files, want code read first file1 file2 , go on.
mapperfile:
import sys import os s ='' = 0 line in sys.stdin: if(line[i%40]=='|' , s[len(s)-1] != '\n'): s = s + '\n' elif(line[i%40]!='|'): s = s + line[i%40] = i+1 print s running via streaming jar:
hadoop jar $hadoop_home//share/hadoop/tools/lib/hadoop-streaming-2.7.0.jar -input /home/test/ -output /home/output6/ -mapper /home/eshobsa/hadoop_work/mapper.py can please guide me how can make read lines file in sequence, i.e. first file1, lines file2 , on.
what desire sequential scan of input, not case in mapreduce. in mapreduce, files scanned in parallel, means mapper cannot know in state other mappers , so, read next file when others done reading previous one. actually, there no such thing 'previous' or 'next' file in mapreduce.
are sure need mapreduce? judging description, not... otherwise, sure, need sequential scan of input?
Comments
Post a Comment