Friday, 15 April 2011

Hadoop streaming with multiple python files -



Hadoop streaming with multiple python files -

i have hadoop streaming job. job makes utilize of python script imports python script. command works fine command line fails when using hadoop streaming. here illustration of hadoop streaming command

hadoop jar $streamingjar \ -d mapreduce.map.memory.mb=4096 \ -files preprocess.py,parse.py \ -input $input \ -output $output \ -mapper "python parse.py" \ -reducer none

and here first line in parse.py

from preprocess import normalize_large_text, normalize_small_text

when run command through hadoop streaming see next output in logs

traceback (most recent phone call last): file "preprocess.py", line 1, in <module> preprocess import normalize_large_text, normalize_small_text, normalize_skill_cluster importerror: no module named preprocess

my understanding hadoop set files in same directory. if true don't see how fail. know what's going on?

thanks

you need add together scripts same directory , add together them using files flag.

hadoop jar $streamingjar -d mapreduce.map.memory.mb=4096 -files python_files -input $input -output $output -mapper "python_files\python parse.py" -reducer none

python hadoop hadoop-streaming

No comments:

Post a Comment