Thursday, December 10, 2009

Hacking Up a Map Function for Bash

The map function is an example of an elegant programming pattern loved by language purists, but equally useful to duct tape programmers. Map is an example of a "higher-order" function:  one that operates and creates yet other functions.

In an age of multi-core machines, the value of a map function resides principally in one feature:
  • map functions define operations that are inherently parallel
In practice, most for loops could be re-written as map tasks.  And map's parallel nature is why it is a key element in distributed processing patterns and platforms.

While languages have map functions (it's 'map' in Perl, Python, Haskell, and Lisp dialects like Clojure; 'lapply' in R; accessible via 'select' in LinQ), to my disappointment, no built-in map function exists for Bash.

So I took it upon myself to cook up something that implements a bare-bones map function (note, there does exist an amazing set of scripts called bashreduce, which implement both map and reduce).

My map function is simple: it applies a given function f to a set of files one directory, and returns a transformed set of files to an output directory.  It runs in parallel, spawning as many mappers as CPU cores (or as you explicitly set).

Of note is that, should you need to pass user-defined functions in Bash to other functions -- you must mark them for export to sub-shell environments using the "declare -fx" command (formerly 'typeset' in Korn).


running parallel jobs with map
 Here's a basic example that shows off how to use this map bash function.
## make a test 'in' directory consisting of some words
mkdir in
cd in
split -C 500k /usr/share/dict/words
## you should have five or six files, I get
## > ls in
## xaa  xab  xac  xad  xae  xaf  xag  xah  xai  xaj
cd ..
map 'wc -l' in counts
## output results into files with same names, but in dir 'counts' 
## > ls counts
## xaa  xab  xac  xad  xae  xaf  xag  xah  xai  xaj
A more sophisticated example includes defining a user-defined function -- one that unzips, sorts, uniqifies (N.B. the -u flag on sort), and rezips the file -- and mapping it over a set of files.
gzsort () { gunzip -c $1 | sort -u | gzip --fast; }
declare -fx gzsort  ##  export to subshell
## test on the same example above
map 'gzip -c' in gzin
map gzsort gzin gzout
The most recent version of this map function for Bash will always available at my GitHub repository for Big Data tools, but I've also posted the code below to save you a click. In an earlier version I had implemented a queue in which files were pulled from (as they finished). But it turns out that the xargs function now supports the "-P" flag, which allow one to specify how many processes to run in parallel. In essence, my script is a crufty wrapper around xargs, but with a cleaner syntax.

map:  not for industrial use
One final comment about where I've found this map script useful:  I don't suggest using this for heavy lifting of Big Data -- by which I mean data transforms over terabytes.  For that, Hadoop is your friend.  But for those in-between tasks, where you'd like to crunch 100GB of log files on an Amazon EC2 c1.xlarge instance -- just  for example -- this little tool can be the difference between 8 hours (a day) and 1 hour (lunch).

#!/bin/bash
# set -e
CMD=$(basename $0)
HELP=$(cat <<EOF 
Usage: $CMD [FUNCTION] [INDIR] [OUTDIR] [MAPPERS] ...
\n
\n Map a FUNCTION over a set of files in the directory, INDIR, and output
\n results to a directory, OUTDIR, with the same base name.  FUNCTION
\n must accept a file name as its first parameter (such as 'grep',  
\n 'sort', or 'awk').  A set of parallel processes are launched, equal to
\n MAPPERS.  If MAPPERS is not given, it defaults to the number of CPUs 
\n detected on the system, or 2 otherwise. 
\n
\n Examples:  $CMD sort /tmp/files /tmp/sorted 4
\n
\n           # using map with a user-defined function
\n           gzsort () { gunzip -c $1 | sort | gzip --fast; } 
\n           declare -fx gzsort  ##  export to subshell
\n           $CMD gzsort ingz outgz                    
EOF
)

if [ $# -eq 4 ]; then
    nmap=$4       
elif [ $# -eq 3 ]; then   ## guess no. CPUs, default to 2
    nmap=`grep '^processor' /proc/cpuinfo | wc -l`
    if [ $? -eq 1 ];  then
 nmap=2
    fi
elif [ $# -lt 3 ]; then  ## too few args
    echo -e $HELP
    exit 1
fi
 
func=$1
in=$2
out=$3
export func in out nmap
 
## make output directory
if [ -d $out ]; then 
    echo "output dir $out exists"
    exit 1
else 
    mkdir $out
fi
 
ls $in |  xargs -P $nmap -I{} sh -c '$func "$in"/"$1" > "$out"/"$1"' -- {}
 
## cleanup in event of any failure
if [ $? -eq 1 ]; then
    rm -fr $out
fi

No comments: