Assignment 1 - 11/4/08

This assignment looked into both the subtleties and significant differences in terms of how the presidential campaigns communicate with their audience online. I explored the language used in both presidential candidate's websites, specifically the noun phrases in their campaign blogs. Considering the nature of this election, I figured the blog content of both candidates would be contrasting to say the least. I also had a few questions: are the campaigns covering the same topics? Is one candidate mentioning the Iraq War more often than the other? Is one candidate mentioning his opponent significantly more than the other?

To be able to answer these questions, I needed perl scripts to retrieve the RSS feeds of the blogs, Monty Lingua to process the text, and Many Eyes to visualize my data. Based on my findings, I determined that the blogs may have different purposes; it appears that Obama uses his blog to mobilize his supporters, while McCain uses his as another medium for presenting his talking points.


Using perl scripts ( &, I parsed the RSS feeds from the Barack Obama campaign blog and the John McCain campaign blog. I retrieved as much data as I could from both feeds; I managed to get 26,000 words from Obama's feed and 14,000 words from McCain's feed. I ended up with two large, plain text files; one for Obama and one for McCain. To extract the noun phrases, I used a script that implements Monty Lingua, a natural language processor for English developed by MIT. The resulting files for each candidate looked like the following:

NX	Virginia
VX	have not go
AX	blue
NX	many year
NX	campaign trip
VX	be
AX	symbolic
NX	significant change
NX	that
VX	be take
NX	place
NX	electoral map

NX = noun chunk; VX = verb chunk; AX = adjective chunk

I then used the following unix command to get rid of the verb and adjective chunks. I also sorted the noun phrases alphabetically.

cat barack_chunks | grep 'NX' | cut -f2 | sort > barack_NXs

I ended up with files of only noun phrases for both Obama and McCain that looked like this:

ABC News
absentee ballot
absentee voting
abundant evidence
academic year prepare student
accurate mortgage information
actual spot

NOTE: I removed the candidates' own names from their respective files

To finish, I added the data to Many Eyes, an online data visualization tool provided by ibm. I decided to make a wordle for each candidate's noun phrases.

Which wordle corresponds to Obama? and which to McCain?
What do these mean?

The first wordle was generated from Obama's campaign blog RSS feed, while the second was from McCain's. From Obama's wordle, we can see that his campaign mostly discussed a day, presumably election day. Obama's campaign also uses the words people, phone, and volunteer quite often - this goes along with their extensive phone banking, canvassing and overall grassroots campaign. State and battleground are also prominent - Obama's campaign is focused on the battleground states such as Ohio and Florida, which also appear in the wordle. Obama's poll is larger than McCain's - maybe his campaign is discussing the polls because he is leading in many of them?

The most frequent word in John McCain's wordle is senator - this probably means that the title of Senator precedes his name at all times in the blog. Naturally, this word implies authority and experience. Indeed, one of John McCain's most common talking points is his extensive experience in office and in the armed forces, and the inexperience of his opponent. McCain's Iowa is larger than Obama's - is he targeting that state more than Obama? Well, if he isn't, he should be. According to this Poll, he is trailing Obama by roughly 15 points.

I would say that Obama's blog has a different purpose than McCain's. I say this because Obama's wordle highlights words that are meant to mobilize people to take action: people, phone, supporter, volunteer, vote. McCain's wordle contains America, debate, immigration, and Iraq (which curiously does not show up in Obama's). Maybe McCain's campaign blog is aimed more at presenting talking points rather than rallying supporters.

Want more?

Check out this visualization that an anonymous person created from my data set.

to top

Assignment 2 - 11/11/08

The purpose of this assignment was to replicate the first assignment using an sqlite database instead of plain text files. I imported the noun phrases (Obama, McCain) from the Barack Obama campaign blog and the John McCain campaign blog (which has been removed as of 11/5/08) into an sqlite database. From there, I used perl scripts to access the database and retrieve the noun phrases along with the frequency at which they appear.


I started by importing the noun phrases into two tables; one for Obama and one for McCain. Here are the commands I used in unix and sqlite:

sqlite3 campaign.db
sqlite> create table barack (np varchar(50));
sqlite> .import barack_NXs barack
sqlite> create table mccain (np varchar(50));
sqlite> .import mccain_NXs mccain

From there, I used a perl script for each table to read the phrases and group them by their frequency. Here is the script for Obama:

use warnings;
use strict;
use POSIX;
use DBI;

my $q;		# query handle
my $b_np; 	# obama's noun phrase
my $count; 	# frequency of phrase

# connect to database
my $db = DBI->connect("dbi:SQLite:dbname=campaign.db","","",{AutoCommit => 0});

# prepare my query
$q = $db->prepare("SELECT np, COUNT(*) AS num FROM barack GROUP BY np ORDER BY num");

# execute query and print the output
$q->bind_columns(\$b_np, \$count);
  printf"%s\t%d\n", $b_np, $count;

Now we can see each noun phrase in Obama's campaign blog along with how many times it appears. The script above outputs the following:

weekend	40
election	43
phone	44
that	44
country	47
poll	47
them	48
It	52
people	53
You	57
Barack	63
what	66
Obama	67
us	68
he	79
who	84
they	87
We	101
it	130
I	194
we	221
you	246

Uh oh. There is both "You" and "you" in the list. No worries, I can change all the words to lower case by modifying the query to the following:

SELECT lower(np), COUNT(*) AS num FROM barack GROUP BY lower(np) ORDER BY num;

Now I get:

country	47
them	48
change	51
poll	51
vote	51
people	62
barack	63
obama	67
us	68
there	74
what	76
who	85
they	103
he	115
it	182
i	194
you	303
we	322

That's better. Now I can do the same thing for McCain's noun phrases and I get the following:

click	15
crowd	16
debate	18
him	19
president	22
there	22
that	28
what	32
senator mccain	33
who	33
they	35
us	39
you	58
i	82
john mccain	82
it	91
he	105
we	117

With these lists of noun phrases and the frequency at which they appear, I have the ability to make the same wordles that appear above. Mission accomplished.

to top

Assignment 3 - 11/18/08

The purpose of this assignment was to combine all the data I've used so far and put it into R, an application used for statistics and graphics. First, I connected to my SQLite database from assignment 2 and retrieved the word phrases that are common to both Obama's and McCain's campaign blogs. From there, using two barplots in R, I compared the frequency at which these words appear in both Obama's and McCain's blogs. While the Many Eyes Wordle is capable of comparing word frequencies, simple barplots allow us to focus on certain words and compare them in a more side-by-side manner.


First, I wanted to know which phrases showed up in both candidates' blogs. This would allow me to compare the frequencies at which these phrases appear in both campaign blogs. Using the SQL below, I determined the phrases common to both Obama's and McCain's campaign blogs. I then used these phrases to create a new table called common.

SELECT lower( b_np FROM barack AS b, mccain AS m WHERE b_np = lower( GROUP BY b_np;

With the common phrases, I now wanted to compare the frequency at which they appear in both blogs. I accomplished this with the following SQL:

SELECT word_phrase, b_num barack_count, (m_num * 1.89) mccain_count 
FROM common AS c 
	(SELECT mccain_np, COUNT( AS m_num 
	 FROM mccain AS m, common AS c 
	 WHERE lower( = 
ON = mccain_np 
	(SELECT barack_np, COUNT( AS b_num 
	FROM barack AS b, common AS c 
	WHERE lower( = 
ON = barack_np 

I multiplied mccain_count by 1.89; this was to account for the fact that Obama had roughly 89% more phrases than McCain. For this reason, McCain's numbers are floats from now on.

This SQL gave me the output below. The word phrase is followed by the frequency at which it appears in Obama's and McCain's campaign blog, respectively.


From there, I created a script that connects this SQL statement with R to create the following barplots:


Looking back, I should've created two sets of barplots; one for McCain's most popular phrases, in addition to the one for Obama. There are a few things of interest in these barplots. For example, notice that Obama says we, you, people, and so on. Again, like I touched on above, the Obama campaign blog uses words that center on us, the citizens that would ultimately elect him into office. We also see that Obama's blog uses phone and weekend a lot more than McCain, which may coincide with Obama's overpowering grassroots campaign. Finally, Obama's blog contains a lot more she's than McCain's - maybe this is referring to Michelle?

to top

Assignment 4 - 11/25/08

For this assignment, I analyzed a server log of a website that contained a gallery of pictures. First, I wanted to know which pictures were the most popular among the site's visitors. Then, I figured out which users had clicked on these pictures. It was also late at night, and I was feeling a bit shady, so I decided to figure out where these users are located. To accomplish this, I used the MaxMind GeoIP Perl API, which gave me cities, states, countries, and even the latitude and longitude of their location based on their IP address. Finally, I plotted each of the users onto a Google Map.


I began by writing a simple perl script to narrow down the 13 most popular pictures the users had clicked on. This gave me the following output:

IMG_5667.jpg 	19
IMG_5634.jpg 	19
IMG_5657.jpg 	19
IMG_5663.jpg 	17
IMG_5717.jpg 	16
IMG_5652.jpg 	16
IMG_5699.jpg 	16
IMG_5606.jpg 	16
IMG_5601.jpg 	15
IMG_5718.jpg 	15
IMG_5675.jpg 	15
IMG_5638.jpg 	15
IMG_5622.jpg 	15

Above are the most popular images along with the number of times they were clicked on. From here, I wrote another perl script to capture information on the users that had clicked on the most popular pictures. Then, I used the MaxMind GeoIP Perl API to figure out their latitude, longitude, along each of the popular pictures that they had clicked on. I now had the following information:

  • The user's host (e.g.
  • The picture(s) the user clicked on
  • The time they clicked on each picture
  • The approximate geographic location of the user

With these data, combined with basic HTML, I created user_data.txt. It looked like the following for each user:

URL(s): 05:30:14 04/May/2006 - IMG_5634.jpg 05:31:43 04/May/2006 - IMG_5652.jpg 05:40:20 04/May/2006 - IMG_5699.jpg 05:44:08 04/May/2006 - IMG_5717.jpg 05:45:06 04/May/2006 - IMG_5718.jpg

The first two numbers (separated by pipes) are the latitude and longitude of the user.

I used this data to plot each user into the following Google Map:


I had a little bit too much fun doing this one. It was my first attempt at putting together a google map and I'm pleased with the results. Naturally, most of the users can be found close to SI. I was most curious, however, to see what the Russian and South African were clicking on! It is also interesting to note that some users revisited some pictures twice, even three times. Finally, it appears that our friend in Russia had problems loading a few of the pictures! Look at the following:

06:13:10 05/May/2006 - IMG_5634.jpg
06:13:11 05/May/2006 - IMG_5634.jpg
06:13:31 05/May/2006 - IMG_5634.jpg

So I'm not the only one that hammers the refresh button when I have a slow connection.. (note: this could also be due to the browser if it is sending multiple requests)

to top

Assignment 5 - 12/2/08

The purpose of this assignment was to revisit assignment 2 and use regular expressions to filter out certain stopwords and characters from Obama's and McCain's campaign RSS feeds. This was an attempt to pull the most important information from the RSS feeds to facilitate data analysis.


I wrote a script (shown below) that filtered out basic stopwords, along with other unnecessary characters such as quotes and dashes. I also chose to replace some characters, such as "!", "?" and "," with a period to maintain congruity among the filtered text. The script below shows how I did this.

# open stopwords text file
open(IN,"stopwords.txt") || die("Could not open file!");

# go each stopword, place in hash
while ($line = ) {
	$stopwords{$line} = $line;

# open mccain.txt
open(IN,"mccain.txt") || die("Could not open file!");

# print header
print "McCain Filtered\n############################\n";
while ($line = ) {
  $line = lc($line);            # to lower case
  $line =~ s/[\'\"\-\:]//g;     # remove ' " - :
  $line =~ s/[\,\!\?]/\./g;     # replace , ! ? with .  
  $line =~ s/\.\s?\.//g;        # get rid of . . . 
  $line =~ s/\s+/ /g;           # get rid of unnecessary whitespace
  # split each line into words, filter out stopwords
  for my $word (split /(\s+)/, $line) {
    if ($stopwords{$word}) {next;}
    print $word;

# open obama.txt
open(IN,"obama.txt") || die("Could not open file!");

# print header
print "\n\n\nObama Filtered\n############################\n";
while ($line = ) {
  $line = lc($line);            # to lower case
  $line =~ s/[\'\"\-\:]//g;     # remove ' " - :
  $line =~ s/[\,\!\?]/\./g;     # replace , ! ? with .  
  $line =~ s/\.\s?\.//g;        # get rid of . . . 
  $line =~ s/\s+/ /g;           # get rid of unnecessary whitespace
  # split each line into words, filter out stopwords
  for my $word (split /(\s+)/, $line) {
    if ($stopwords{$word}) {next;}
    print $word;

This script gave me a text file with filtered campaign content from both presidential candidates.


It's nice how I can read through the resulting text file and get the gist of the content on the blogs. In order to better visualize the data, I left the multiple white spaces because each represented a character that had been removed. The stopword list was also helpful in filtering out words that were of no interest to me. In addition, the stopword list is merely a text file that can be easily revised.

to top

Assignment 6 - 12/9/08

For this assignment, I visualized the relationship between a dog's size/weight and its life expectancy. I have always heard that smaller dogs live longer than large ones, so this was a perfect time to find out. I gathered the relevant data from by scraping the HTML and then exporting the resulting data to R, an application used for statistics and graphics. To display the data, I used a heat map, which is a two-dimensional graphic that displays data in multiple colors. The heatmap in R also used a mechanism to cluster the dogs based on statistical similarity.


To begin, I needed to retrieve some data on dogs, specifically basic physical characteristics (height and weight) along with their expected lifespan. I wrote a perl script that scraped this page from I then parsed through each of the links, and picked out the ones that went to a specific breed, such as labrador retriever. I ended up with a text file that looked like the following:

Breed  Min-height  Max-height  Min-weight  Max-weight  Life-exp
Chihuahua	6	9	2	6	14
Affenpinscher	9	12	6	9	15
Cairn terrier	10	12	13	16	14
Norfolk terrier	9	10	11	12	14
Skye terrier	9	10	19	25	13
West highland white terrier	10	11	15	22	14
Australian terrier	10	11	9	14	14
Bolognese	9	12	5	9	15
Border terrier	10	15	13	20	14
Brussels griffon	7	8	6	12	14

The website had information on more than 200 dog breeds, but I only used 55 since the other breeds had incomplete information.

My next step was to export the data to R in order to make a heatmap of the data, and hopefully see some relationship between size/weight and life expectancy. To accomplish this, I wrote the following code in R:

# change working directory for convenience

# read in text file on dogs (the result of
dogs <- read.table("dogs2.txt", sep="\t", header=F)

# create matrix from numerical dog data 
mydata <- as.matrix(dogs[2:6]);

# create labels for dog variables
labels=c("Min-height", "Max-height", "Min-weight", "Max-weight", "Life-exp")

# create dendrogram/rainbow/heatmap 
rc <- rainbow(nrow(mydata), start=0, end=.3)
hv <- heatmap(mydata, Colv=NA,col = cm.colors(256), scale="column",
              RowSideColors = rc, margin=c(11,10),
              xlab = "Dog Variables", ylab= "Breed",
              main = "Dogs: Their weight, height, and lifespan", 
              labRow=dogs$V1, labCol=labels)

The previous code gave me this visualization (click to enlarge):


At first glance, I'm not sure if smaller dogs are expected to live longer. While I'm thinking, let's observe some other things to take away from the heatmap. For instance, it appears the dendrogram on the left-hand side divides the group in half. I would cut the group into two or three chunks by using averages as well, but why are the Shiloh Shepherd and King Shepherd excluded from the pack on the bottom? For one, they are by far the heaviest, which is why the color of their minimum and maximum weight is dark pink. Weighing in at more than 120 lbs each, the Shiloh and King are expected to live until 11 and 13 respectively. Moving on, the breeds with dark pink on the right-bottom are also of interest to me since they must live the longest. The two that really stick out are the Coton de Tuleár, the Dachshund and the Godland Hound, which live to 18, 17 and 17 respectively. They are also quite small, weighing in at less than 30 lbs each. Maybe small dogs actually do live longer...

Regardless, may my new lab puppy, Samson, live a long and happy life with us.

to top