Wednesday, December 23, 2009

Three easy functions for manipulating data frames in R

I do a lot of basic manipulation of data, so here are two built-in functions ( transform , subset ) and one library ( sqldf ) that I use daily (this post originally appeared as an answer I gave on StackOverflow).

create sample sales data

sales <- expand.grid(country = c('USA', 'UK', 'FR'),
                     product = c(1, 2, 3))
sales$revenue <- rnorm(dim(sales)[1], mean=100, sd=10)

> sales
  country product   revenue
1     USA       1 108.45965
2      UK       1  97.07981
3      FR       1  99.66225
4     USA       2 100.34754
5      UK       2  87.12262
6      FR       2 112.86084
7     USA       3  95.87880
8      UK       3  96.43581
9      FR       3  94.59259

use transform() to add a column

## transform currency to euros
usd2eur <- 1.434
transform(sales, euro = revenue * usd2eur)

>

  country product   revenue     euro
1     USA       1 108.45965 155.5311
2      UK       1  97.07981 139.2125
3      FR       1  99.66225 142.9157
...

use subset() to slice the data

subset(sales,
       country == 'USA' & product %in% c(1, 2),
       select = c('product', 'revenue'))

>
  product  revenue
1       1 108.4597
4       2 100.3475

use sqldf() to slice and aggregate with SQL

The sqldf package provides an SQL interface to R data frames

##  recast the previous subset() expression in SQL
sqldf('SELECT product, revenue FROM sales \
       WHERE country = "USA" \
       AND product IN (1,2)')

>
  product  revenue
1       1 108.4597
2       2 100.3475

Perform an aggregation or GROUP BY

sqldf('select country, sum(revenue) revenue \
       FROM sales \
       GROUP BY country')

>
  country  revenue
1      FR 307.1157
2      UK 280.6382
3     USA 304.6860

For more sophisticated map-reduce-like functionality on data frames, check out the plyr package. And if find yourself wanting to pull your hair out, I recommend checking out Data Manipulation with R.

Thursday, December 10, 2009

Hacking Up a Map Function for Bash

The map function is an example of an elegant programming pattern loved by language purists, but equally useful to duct tape programmers. Map is an example of a "higher-order" function:  one that operates and creates yet other functions.

In an age of multi-core machines, the value of a map function resides principally in one feature:
  • map functions define operations that are inherently parallel
In practice, most for loops could be re-written as map tasks.  And map's parallel nature is why it is a key element in distributed processing patterns and platforms.

While languages have map functions (it's 'map' in Perl, Python, Haskell, and Lisp dialects like Clojure; 'lapply' in R; accessible via 'select' in LinQ), to my disappointment, no built-in map function exists for Bash.

So I took it upon myself to cook up something that implements a bare-bones map function (note, there does exist an amazing set of scripts called bashreduce, which implement both map and reduce).

My map function is simple: it applies a given function f to a set of files one directory, and returns a transformed set of files to an output directory.  It runs in parallel, spawning as many mappers as CPU cores (or as you explicitly set).

Of note is that, should you need to pass user-defined functions in Bash to other functions -- you must mark them for export to sub-shell environments using the "declare -fx" command (formerly 'typeset' in Korn).


running parallel jobs with map
 Here's a basic example that shows off how to use this map bash function.
## make a test 'in' directory consisting of some words
mkdir in
cd in
split -C 500k /usr/share/dict/words
## you should have five or six files, I get
## > ls in
## xaa  xab  xac  xad  xae  xaf  xag  xah  xai  xaj
cd ..
map 'wc -l' in counts
## output results into files with same names, but in dir 'counts' 
## > ls counts
## xaa  xab  xac  xad  xae  xaf  xag  xah  xai  xaj
A more sophisticated example includes defining a user-defined function -- one that unzips, sorts, uniqifies (N.B. the -u flag on sort), and rezips the file -- and mapping it over a set of files.
gzsort () { gunzip -c $1 | sort -u | gzip --fast; }
declare -fx gzsort  ##  export to subshell
## test on the same example above
map 'gzip -c' in gzin
map gzsort gzin gzout
The most recent version of this map function for Bash will always available at my GitHub repository for Big Data tools, but I've also posted the code below to save you a click. In an earlier version I had implemented a queue in which files were pulled from (as they finished). But it turns out that the xargs function now supports the "-P" flag, which allow one to specify how many processes to run in parallel. In essence, my script is a crufty wrapper around xargs, but with a cleaner syntax.

map:  not for industrial use
One final comment about where I've found this map script useful:  I don't suggest using this for heavy lifting of Big Data -- by which I mean data transforms over terabytes.  For that, Hadoop is your friend.  But for those in-between tasks, where you'd like to crunch 100GB of log files on an Amazon EC2 c1.xlarge instance -- just  for example -- this little tool can be the difference between 8 hours (a day) and 1 hour (lunch).

#!/bin/bash
# set -e
CMD=$(basename $0)
HELP=$(cat <<EOF 
Usage: $CMD [FUNCTION] [INDIR] [OUTDIR] [MAPPERS] ...
\n
\n Map a FUNCTION over a set of files in the directory, INDIR, and output
\n results to a directory, OUTDIR, with the same base name.  FUNCTION
\n must accept a file name as its first parameter (such as 'grep',  
\n 'sort', or 'awk').  A set of parallel processes are launched, equal to
\n MAPPERS.  If MAPPERS is not given, it defaults to the number of CPUs 
\n detected on the system, or 2 otherwise. 
\n
\n Examples:  $CMD sort /tmp/files /tmp/sorted 4
\n
\n           # using map with a user-defined function
\n           gzsort () { gunzip -c $1 | sort | gzip --fast; } 
\n           declare -fx gzsort  ##  export to subshell
\n           $CMD gzsort ingz outgz                    
EOF
)

if [ $# -eq 4 ]; then
    nmap=$4       
elif [ $# -eq 3 ]; then   ## guess no. CPUs, default to 2
    nmap=`grep '^processor' /proc/cpuinfo | wc -l`
    if [ $? -eq 1 ];  then
 nmap=2
    fi
elif [ $# -lt 3 ]; then  ## too few args
    echo -e $HELP
    exit 1
fi
 
func=$1
in=$2
out=$3
export func in out nmap
 
## make output directory
if [ -d $out ]; then 
    echo "output dir $out exists"
    exit 1
else 
    mkdir $out
fi
 
ls $in |  xargs -P $nmap -I{} sh -c '$func "$in"/"$1" > "$out"/"$1"' -- {}
 
## cleanup in event of any failure
if [ $? -eq 1 ]; then
    rm -fr $out
fi

Sunday, September 6, 2009

Reservoir Sampling Algorithm in Python and Perl

Algorithms that perform calculations on evolving data streams, but in fixed memory, have increasing relevance in the Age of Big Data.

The reservoir sampling algorithm outputs a sample of N lines from a file of undetermined size. It does so in a single pass, using memory proportional to N.

These two features -- (i) a constant memory footprint and (ii) a capacity to operate on files of indeterminate size -- make it ideal for working with very large data sets common to event processing.

While it has likely been multiply discovered and implemented, like many algorithms, it was codified by Knuth's The Art of Computer Programming.

The trick of this algorithm is to first fill up the sample buffer, and afterwards, to probabilistically replace it with additional lines of input.

Python version

#!/usr/bin/python
import sys
import random

if len(sys.argv) == 3:
input = open(sys.argv[2],'r')
elif len(sys.argv) == 2:
input = sys.stdin;
else:
sys.exit("Usage: python samplen.py <lines> <?file>")

N = int(sys.argv[1]);
sample = [];

for i,line in enumerate(input):
if i < N:
sample.append(line)
elif i >= N and random.random() < N/float(i+1):
replace = random.randint(0,len(sample)-1)
sample[replace] = line

for line in sample:
sys.stdout.write(line)


Perl version

#!/usr/bin/perl -sw

$IN = 'STDIN' if (@ARGV == 1);
open($IN, '<'.$ARGV[1]) if (@ARGV == 2);
die "Usage: perl samplen.pl <lines> <?file>\n" if (!defined($IN));

$N = $ARGV[0];
@sample = ();

while (<$IN>) {
if ($. <= $N) {
$sample[$.-1] = $_;
} elsif (($. > $N) && (rand() < $N/$.)) {
$replace = int(rand(@sample));
$sample[$replace] = $_;
}
}

print foreach (@sample);
close($IN);


For example, imagine we are to sample 5 lines randomly from a 6-line file. Call i the line number of the input, and N the size of sample desired. For the first 5 lines (where i < = N), our sample fills entirely. (For the non-Perl hackers: the current line number i is held by the variable $., just as the special variable $_ holds the current line value).

It's at successive lines of input that the probabilistic sampling starts: the 6th line has a 5/6th (N/i) chance of being sampled, and if chosen, it will replace one of the previously 5 chosen lines with a 1/5 chance: leaving them a (5/6 * 1/5) = 5/6 chance of being sampled. Thus all 6 lines have an equal chance of being sampled.

In general, as more lines are seen, the chance that any additional line is chosen for the sample falls; but the chance that any previously chosen line could be replaced grows. These two balance such that the probability for any given line of input to be sampled is identical.

A more sophisticated variation of this algorithm is one that can take into consideration a weighted sampling.

Thursday, March 12, 2009

How to build a dynamic DNS server with Slicehost

Or, How to give your local machine a permanent home on the interweb.

For years I had a method of accessing Linux boxes or Windows machines that lacked permanent IPs.   I would:  (1)  find my IP at some 'whats-my-ip.com' site, (2)  e-mail it to myself, and (3) pray that it wouldn't change while I was away.

Unfortunately, particularly for DSL connections, this faith-based method frequently failed.

In an ideal world, we wouldn't need local storage or local CPU power.  But the cold, hard truth of network latency -- which ever larger data sets don't help -- means that most of us need local machines.  And life is a lot easier when these local machines have -- or least appear to have -- permanent addresses on the interweb, like "bigdata.dataspora.com" -- even if their actual IPs are changing. 

What follows is a the solution that I've implemented for my setup.  Basically it's a cron script that runs on our office server ('bigdata') and checks its WAN IP address hourly.  If this IP address has changed, it updates the DNS record for 'bigdata.dataspora.com'  on our top-level server (hosted at Slicehost).  The script lives in the /etc/cron.hourly directory of my local server.

Slicehost has a nice API that makes this process relatively painless, and this script was modified from their documentation (link can be found in the comments).  Three important prerequisites for this code to work:  

  • you need to Enable API access via  https://manage.slicehost.com/api/   
  • you must have a 'bigdata.dataspora.com' (insert your domain) Type A record at slicehost pointing to some IP address, and... 
  • you must have Jared Kuolt's pyactiveresource Python library installed.

So, without further ado, here's the little script that makes your local machine a self-updating DNS dynamo!



#!/usr/bin/python
## dynamic-dns.py
## author: Michael Driscoll

## 12 mar 2009
##
## A script that dynamically updates DNS records on Slicehost
##
## For more detailed information, see Slicehost's excellent API docs
## http://articles.slicehost.com/2008/5/13/slicemanager-api-documentation
##
## This code requires Jared Kuolt's pyactiveresource library, via
## http://code.google.com/p/pyactiveresource/
##


from pyactiveresource.activeresource import ActiveResource
import urllib
import os

## Let's define some constants for securely connecting to the Slicehost API
## API access key and resulting URL string
api_password = 'your-API-key-goes-here'
api_site = 'https://%s@api.slicehost.com/' % api_password

## the type and name of the single record I'm updating
record_type = 'A'
record_name = 'bigdata'

## The URL for any server that can return your IP in plaintext
## I hand-rolled mine with a one line file called 'myip.php':
##

ip_site = 'http://labs.dataspora.com/tools/myip.php'

## where to store the last ip detected
ip_path = '/etc/lastip'

## if last ip exists, read it in
if (os.path.exists(ip_path)):
   f = open(ip_path, 'r')
   lastip = f.read()
   f.close()
else:
   lastip = ''

## Let's get our current ip
ip = urllib.urlopen('http://labs.dataspora.com/tools/myip.php').read()

## If they differ, update our last ip file and Slicehost
if (ip != lastip):
## update our file
   f = open(ip_path, 'w')
   f.write(ip)
   f.close()

## update Slicehost
class Record(ActiveResource):
   _site = api_site

results = Record.find(record_type="A", name=record_name)
record = results[0]
record.data = ip
record.save()

Thursday, February 26, 2009

Four Simple Steps to a Secure Samba Server

In this post I describe how you can get Samba to securely serve files from your Linux box in four easy steps, with a 'smb.conf' file just 14 lines long.

I've always found Samba to be unnecessarily complex, and until now, my minimal effort hack was to set up a world-writeable '/share' folder.  But necessity calls (my hard drive clucked the click of death last week), and I decided to find a basic Samba setup that does the following:  (i) makes my  Linux box the unique home for all my files, (ii) allows access to that box from any other OS client, and (iii) manages security and file stamp permissions properly.  For the approach I take below, you don't need some GUI control panel, just your text editor of choice.

This procedure was modified from the Samba HOWTO and Reference Guide, specifically the "Secure Read-Write File and Print Server" section. This is the clearest documentation I have found anywhere on the subject.

So here's my setup and goals.  My setup is a Linux server named 'kube' running Ubuntu 8 (8.04) with the Samba 3.0 package already installed.  My goal is to read and write to my home directory on kube, /home/mdriscol, from my Windows machine (XP or Vista) -- or any other client (Mac OS X) that lives inside my LAN.

Command-line code should be executed by a root-privileged user (via sudo or directly).

1. Create your smb.conf file (typically found at /etc/samba/smb.conf)

Unlike the needlessly complex smb.conf examples you'll find on the web, mine is just a handful of lines.  Its split into three sections:  the most relevant is the 'homes' section, which contains directives about how the server's home directories are shared (I've commented the file below):

[global]               ##  global settings
workgroup = WORKGROUP  ##  sometimes MSHOME
netbios name = KUBE    ##  name that server is broadcast as

[homes]                ##  how /home directories are shared
comment = Home Directories
valid users = %S       ##  %S means 'all Samba users'
read only = No
browseable = No

[public]               ## public dir w/ global read/write
comment = Data
path = /export         ## make sure this exists
force user = mdriscol  ## writes will be assigned this user
force group = mdriscol ## and this group
read only = No

The last section is optional.  It's for a public folder that any user can write to, but files will be stamped with a default user and group (in this case, me).

2.  Create Samba users.  Because Samba keeps its own list of users and passwords, separate from the server's, you must assign Samba passwords to the users in /home  (I keep them the same for sanity's sake) by executing the following as root:

  smbpasswd -a mdriscol

Repeat this for any other users whose home directories you wish to make accessible.

3.  Restart the Samba service

  /etc/init.d/samba restart

4.  Login from your client of choice

For Windows, open up the run prompt with [Windows-Key]-run and enter "\\KUBE" - the netbios name you gave your server. Login with your Samba password.  Huzzah - it works!

Now you have seamless, secure access to a centralized file server. 

Future steps for me:  (i) Since this entire process was motivated by a hard disk crash, I plan to set up nightly incremental backups of my Ubuntu file server, and (ii) some simple jiggering should allow me to mount this volume remotely anywhere I go.











Friday, October 10, 2008

Why Django is not a pure MVC framework

In the acronym jungle of web development, much has been made of the Model-View-Controller (MVC) design pattern.   It's lauded as one of Ruby on Rails' strengths and has clearly influenced Django's design.
Django's creators, Holovaty and Kaplan-Moss, write in Chapter 1 of The Definitive Guide to Django that:
Simply put, MVC defines a way of developing software so that the code for defining and accessing data (the model) is separate from request routing logic (the controller), which in turn is separate from the user interface (the view).
This didn't make intuitive sense to me  -- shouldn't the "user interface" be considered the "controller"?  And there's no mention of templates here at all-- a fundamental component of Django.  Those seemed most naturally part of the "view" in Django. 

Looking for some clarity, I went back and read a highly-cited 1988 paper by Glenn Krasner and Stephen Pope (where the above image derives from), entitled "A cookbook for using the model-view controller user interface paradigm in Smalltalk-80", where they describe the MVC concept as it was originally implemented.   They write:
Models
The model of an application is the domain-specific software simulation or implementation of the application's central structure.  This can be as simple as an integer (as the model of a counter) or string (as the model of a text editor), or it can be a complex object that is an instance of a subclass [...]

Views
In this metaphor, views deal with everything graphical; they request data from their model, and display the data. [...]

Controllers
Controllers contain the interface between their associated models and views and the input devices (e.g. keyboard, pointing device, time).  Controllers also deal with scheduling interactions with other view-controller pairs [...].
A-ha!  This makes more sense.  The views handle how data looks, and the user interface is the controller -- in contrast to what Django's creators stated above.  

However, thinking I had stumped the Django folks, I came across this concession in Chapter 5 of The Definitive Guide to Django:
If you’re familiar with other MVC Web-development frameworks, such as Ruby on Rails, you may consider Django views to be the “controllers” and Django templates to be the “views.” 
Indeed, I think this is an unfortunate accident, a minor (yet confusing) sin of semantics on Django's part.  Quite clearly MVC controllers are Django views, and MVC views are Django templates.  This understanding (and ordering) is reflected in the acronym "MTV" -- Model-Template-View -- that some have used to describe Django.

Python web frameworks -- who's winning?

After reading Jeff Croft's back to the great frameworks debate post last week, and as someone who is wading into the Django framework - I decided to look at the hard evidence for which Python web frameworks are gaining currency, at least in the eyes of Google.  I compared worldwide search volume for django, zope, web2py, and turbogears.

The result is pictured -- and it turns out that Django is at the top and gaining.  The recent release of the 1.0 version of the Django, as well as getting an explicit nod of approval from Guido van Rossum over at Google's App Engine certainly helps.

Alas, Python's Django is a long way from having the kind of mindshare that Ruby on Rails still enjoys, but as Robert F. Kennedy said about politics- what's right isn't always popular, and what's popular isn't always right.