In an age of multi-core machines, the value of a map function resides principally in one feature:
- map functions define operations that are inherently parallel
While languages have map functions (it's 'map' in Perl, Python, Haskell, and Lisp dialects like Clojure; 'lapply' in R; accessible via 'select' in LinQ), to my disappointment, no built-in map function exists for Bash.
So I took it upon myself to cook up something that implements a bare-bones map function (note, there does exist an amazing set of scripts called bashreduce, which implement both map and reduce).
My map function is simple: it applies a given function f to a set of files one directory, and returns a transformed set of files to an output directory. It runs in parallel, spawning as many mappers as CPU cores (or as you explicitly set).
Of note is that, should you need to pass user-defined functions in Bash to other functions -- you must mark them for export to sub-shell environments using the "declare -fx" command (formerly 'typeset' in Korn).
running parallel jobs with map
Here's a basic example that shows off how to use this map bash function.
## make a test 'in' directory consisting of some words mkdir in cd in split -C 500k /usr/share/dict/words ## you should have five or six files, I get ## > ls in ## xaa xab xac xad xae xaf xag xah xai xaj cd .. map 'wc -l' in counts ## output results into files with same names, but in dir 'counts' ## > ls counts ## xaa xab xac xad xae xaf xag xah xai xajA more sophisticated example includes defining a user-defined function -- one that unzips, sorts, uniqifies (N.B. the -u flag on sort), and rezips the file -- and mapping it over a set of files.
gzsort () { gunzip -c $1 | sort -u | gzip --fast; } declare -fx gzsort ## export to subshell ## test on the same example above map 'gzip -c' in gzin map gzsort gzin gzoutThe most recent version of this map function for Bash will always available at my GitHub repository for Big Data tools, but I've also posted the code below to save you a click. In an earlier version I had implemented a queue in which files were pulled from (as they finished). But it turns out that the xargs function now supports the "-P" flag, which allow one to specify how many processes to run in parallel. In essence, my script is a crufty wrapper around xargs, but with a cleaner syntax.
map: not for industrial use
One final comment about where I've found this map script useful: I don't suggest using this for heavy lifting of Big Data -- by which I mean data transforms over terabytes. For that, Hadoop is your friend. But for those in-between tasks, where you'd like to crunch 100GB of log files on an Amazon EC2 c1.xlarge instance -- just for example -- this little tool can be the difference between 8 hours (a day) and 1 hour (lunch).
#!/bin/bash # set -e CMD=$(basename $0) HELP=$(cat <<EOF Usage: $CMD [FUNCTION] [INDIR] [OUTDIR] [MAPPERS] ... \n \n Map a FUNCTION over a set of files in the directory, INDIR, and output \n results to a directory, OUTDIR, with the same base name. FUNCTION \n must accept a file name as its first parameter (such as 'grep', \n 'sort', or 'awk'). A set of parallel processes are launched, equal to \n MAPPERS. If MAPPERS is not given, it defaults to the number of CPUs \n detected on the system, or 2 otherwise. \n \n Examples: $CMD sort /tmp/files /tmp/sorted 4 \n \n # using map with a user-defined function \n gzsort () { gunzip -c $1 | sort | gzip --fast; } \n declare -fx gzsort ## export to subshell \n $CMD gzsort ingz outgz EOF ) if [ $# -eq 4 ]; then nmap=$4 elif [ $# -eq 3 ]; then ## guess no. CPUs, default to 2 nmap=`grep '^processor' /proc/cpuinfo | wc -l` if [ $? -eq 1 ]; then nmap=2 fi elif [ $# -lt 3 ]; then ## too few args echo -e $HELP exit 1 fi func=$1 in=$2 out=$3 export func in out nmap ## make output directory if [ -d $out ]; then echo "output dir $out exists" exit 1 else mkdir $out fi ls $in | xargs -P $nmap -I{} sh -c '$func "$in"/"$1" > "$out"/"$1"' -- {} ## cleanup in event of any failure if [ $? -eq 1 ]; then rm -fr $out fi
No comments:
Post a Comment