The map function is an example of an elegant programming pattern loved by
language purists, but equally useful to
duct tape programmers. Map is an example of a
"higher-order" function: one that operates and creates yet other functions.
In an age of multi-core machines, the value of a map function resides principally in one feature:
- map functions define operations that are inherently parallel
In practice, most for loops could be re-written as map tasks. And map's parallel nature is why it is a key element in distributed processing patterns and platforms.
While languages have map functions (it's 'map' in Perl, Python, Haskell, and Lisp dialects like Clojure; 'lapply' in R; accessible via 'select' in LinQ), to my disappointment, no built-in map function exists for Bash.
So I took it upon myself to cook up something that implements a bare-bones map function (note, there does exist an amazing set of scripts called
bashreduce, which implement both map and reduce).
My map function is simple: it applies a given function f to a set of files one directory, and returns a transformed set of files to an output directory. It runs in parallel, spawning as many mappers as CPU cores (or as you explicitly set).
Of note is that, should you need to pass user-defined functions in Bash to other functions -- you must mark them for export to sub-shell environments using the "
declare -fx" command (formerly 'typeset' in Korn).
running parallel jobs with map
Here's a basic example that shows off how to use this map bash function.
## make a test 'in' directory consisting of some words
mkdir in
cd in
split -C 500k /usr/share/dict/words
## you should have five or six files, I get
## > ls in
## xaa xab xac xad xae xaf xag xah xai xaj
cd ..
map 'wc -l' in counts
## output results into files with same names, but in dir 'counts'
## > ls counts
## xaa xab xac xad xae xaf xag xah xai xaj
A more sophisticated example includes defining a user-defined function -- one that unzips, sorts, uniqifies (N.B. the -u flag on sort), and rezips the file -- and mapping it over a set of files.
gzsort () { gunzip -c $1 | sort -u | gzip --fast; }
declare -fx gzsort ## export to subshell
## test on the same example above
map 'gzip -c' in gzin
map gzsort gzin gzout
The most recent version of this
map function for Bash will always available at my
GitHub repository for Big Data tools, but I've also posted the code below to save you a click. In an earlier version I had implemented a queue in which files were pulled from (as they finished). But it turns out that the xargs function now supports the "-P" flag, which allow one to specify how many processes to run in parallel. In essence, my script is a crufty wrapper around
xargs, but with a cleaner syntax.
map: not for industrial use
One final comment about where I've found this map script useful: I don't suggest using this for heavy lifting of Big Data -- by which I mean data transforms over terabytes. For that, Hadoop is your friend. But for those in-between tasks, where you'd like to crunch 100GB of log files on an Amazon EC2 c1.xlarge instance -- just for example -- this little tool can be the difference between 8 hours (a day) and 1 hour (lunch).
#!/bin/bash
# set -e
CMD=$(basename $0)
HELP=$(cat <<EOF
Usage: $CMD [FUNCTION] [INDIR] [OUTDIR] [MAPPERS] ...
\n
\n Map a FUNCTION over a set of files in the directory, INDIR, and output
\n results to a directory, OUTDIR, with the same base name. FUNCTION
\n must accept a file name as its first parameter (such as 'grep',
\n 'sort', or 'awk'). A set of parallel processes are launched, equal to
\n MAPPERS. If MAPPERS is not given, it defaults to the number of CPUs
\n detected on the system, or 2 otherwise.
\n
\n Examples: $CMD sort /tmp/files /tmp/sorted 4
\n
\n # using map with a user-defined function
\n gzsort () { gunzip -c $1 | sort | gzip --fast; }
\n declare -fx gzsort ## export to subshell
\n $CMD gzsort ingz outgz
EOF
)
if [ $# -eq 4 ]; then
nmap=$4
elif [ $# -eq 3 ]; then ## guess no. CPUs, default to 2
nmap=`grep '^processor' /proc/cpuinfo | wc -l`
if [ $? -eq 1 ]; then
nmap=2
fi
elif [ $# -lt 3 ]; then ## too few args
echo -e $HELP
exit 1
fi
func=$1
in=$2
out=$3
export func in out nmap
## make output directory
if [ -d $out ]; then
echo "output dir $out exists"
exit 1
else
mkdir $out
fi
ls $in | xargs -P $nmap -I{} sh -c '$func "$in"/"$1" > "$out"/"$1"' -- {}
## cleanup in event of any failure
if [ $? -eq 1 ]; then
rm -fr $out
fi