How to read files … one line at a time

Imagine that we have a file (.txt, .csv, .dat, etc) and that we need to do some kind of process for each line, and then save the results in another file. The only problem is that the size of the file is so huge that it is not possible to import it directly to R. What can we do? We can simply read line by a line while applying the required operations and export the output to another file. Here’s how:

We need the input file (where the data is), and the output file (where the results will be stored). The general idea is to:
1) Read the input file line by line
2) For each line:
2.2) Apply operations
2.1) Export the outcome to the output file
3) Repeat steps 2.2 and 2.3

Let’s say we have an input_file with a lot of lines. We need to apply some function to each line, and then export the outcome to an output_file. Let’s suppose that the outcome will be stored in a vector of length 6 with the following elements:
Id: id number, Name: some name, Var1-Var3 are values for three variables, Status: some label. Here’s one way to do it.


# specify output_file with column names in the first line
 cat(c("Id", "Name", "Var1", "Var2", "Var3", "Status"), "\n",
     file="/path_output_file/output_file")

# define the location of the input file (to be read line by line)
 file_con = file("/path_input_file/input_file")

# open connection
 open(file_con)
 # create a line counter
 count_line = 1
 # let's process the data
 while(length(oneline <- readLines(file_con, n=1, warn=FALSE)) > 0)
 {
     # apply some function
     outcome = myfunction(oneline)
     # print results to output file
     cat(outcome, "\n",
         file="/path_output_file/output_file", append=TRUE)
     # increase counter
     count_line = count_line + 1
 }
 # close the connection
 close(file_con)
 # how many lines did we read?
 count_line

Replicating google line chart in R

R is great for graphics although sometimes it can be really tricky to replicate other software charts. Take for example some of the graphics displayed with google charts such as the line chart displayed below.

This google chart took me a dozen of clicks and a couple of minutes to create it. Honestly, it’s not that bad at all. I bet we could also create something similar in excel. However, if we want to produce the same graphic in R we’re going to need a good amount of patience and some guide to get the right color palette (I used the following website to get some colors http://www.color-hex.com). So, let me show you how to mimic the google line chart with R. I guarantee you it is possible.

Here’s the code:


# here's the data
numbers = c(0.61, 0.644, 0.586, 0.598, 0.596, 0.584,
 0.67, 0.712, 0.646, 0.68, 0.638, 0.66,
 0.618, 0.638, 0.58, 0.608, 0.592, 0.596,
 0.844, 0.87, 0.85, 0.844, 0.87, 0.838,
 0.342, 0.352, 0.316, 0.382, 0.328, 0.312,
 0.636, 0.704, 0.624, 0.658, 0.63, 0.646,
 0.676, 0.724, 0.666, 0.696, 0.656, 0.666)

# put the data in a matrix
some_data = matrix(numbers, nrow=7, ncol=6, byrow=TRUE)
rownames(some_data) = c("A", "B", "C", "D", "E", "F", "G")
colnames(some_data) = paste("Time", 1:6)

# define color palette
cols =c("#1664d9", "red", "orange", "green4", "purple2", "#1f9eb3", "#d93572")

# choose a graphic layout
layout(rbind(c(1,2), c(1,2)),
    height = c(lcm(11), 1),
    width = c(lcm(18), 1))

# set plot window for lines
par(mar = c(5,5,4,1))
plot(1:6, some_data[i,], type="n", xaxt="n", yaxt="n",
    xlab="x label", ylab="y label", bty="n", ylim=c(0.2,1))
# add major grid lines
abline(h = seq(0.2, 1, 0.2), col=c("black", rep("gray80",4)))
# add title
mtext(side=3, "Google Line Chart in R", at=1.5, line=0.5, font=2)
# add x-axis tick labels
mtext(side=1, colnames(some_data), at=1:6)
# add y-axis tick labels
mtext(side=2, seq(0.2, 1, 0.2), line = 1, at=seq(0.2, 1, 0.2), las=2)
# plot lines
for (i in 1:nrow(some_data))
{
    lines(1:6, some_data[i,], col=cols[i], lwd=3)
}
# plot legend
par(mar = c(5, 0.1, 4, 2))
plot(rep(0.5,7), seq(0, 1, length=7), type="n",
    xlab="", ylab="", bty="n", xaxt="n", yaxt="n",
    xlim=c(0.4,0.7), ylim=c(.2,1))
points(rep(0.5,7), seq(0.98, 0.6, length=7),
    pch=15, col=cols, cex=2.5)
text(rep(0.6,7), seq(0.98, 0.6, length=7),
    rownames(some_data), cex=1.2)

That’s it!

Catching errors when using tolower

When I’m working in R with text data parsed from online opinion forums and social webs (e.g. twitter), I need to do some cleaning and pre-processing such as removing punctuation marks, striping extra white spaces, or converting text to lower case. More often than not, when using the tolower function I encounter myself with a really annoying error that is a truly pain in the butt.

Consider the following example. Let’s say we have the text from a tweet in an object called some_text. When we print the object in the console, we get a warning message (in red) like this:

> some_text
[1] "No work today, slept through the classes I wanted at the gym. 
Now I need to find something to occupy my time \ud83d\udc4d\ud83d\ude09"
Warning message:
it is not known that wchar_t is Unicode on this platform

But that’s not the main problem. The real nightmare is when we try to convert the text to lower case using the function tolower. R complains and returns this ugly error message

> tolower(some_text)
Error in tolower(some_text) :
invalid input 'No work today, slept through the classes I wanted at the gym. 
Now I need to find something to occupy my time 👍😉' in 'utf8towcs'

So, how can we solve this error? Meet the tryCatch function! This function will help us to catch possible errors. We’ll make a new function combining tryCatch and tolower so we can identify any undesirable text without returning any ugly message and without stopping our programs. Here’s my tryTolower function


tryTolower = function(x)
{
   # create missing value
   # this is where the returned value will be
   y = NA
   # tryCatch error
   try_error = tryCatch(tolower(x), error=function(e) e)
   # if not an error
   if (!inherits(try_error, "error"))
      y = tolower(x)
   return(y)
}

Let’s test tryTolower
Suppose you have a character vector with five elements

# vector with text
text_vector = c(
"Motivation, philosophy and technique in activism. #Assange and #Occupy: http://t.co/89PFkyjh via @RT_com",
"No work today, slept through the classes I wanted at the gym. Now I need to find something to occupy my time \ud83d\udc4d\ud83d\ude09",
"RT @jdavis4100: The Spirit of God and fear never occupy the same space. The presence of one automatically implies the absence of the other...",
"Police given powers to enter homes http://t.co/VXmtfPV5 and tear down anti- #Olympics posters during Games #Occupy #Anonymous #wakeup #fb",
"RT @OccupyWallSt: RT @WSOASP12: I quit my job to join the occupy movement. Time to stand up and speak out, I'm not here to make another man rich @Occupy #OWS")

# apply tolower (you should get an error message)
tolower(text_vector)

# now apply tryTolower with sapply
# (you should get a missing value when tryTolower finds an error) 
sapply(text_vector, function(x) tryTolower(x))

Using Geocoding Google API

Ever wanted to transform an address into longitude and latitude coordinates?

When working with data that can be projected to geographic places, there’s nothing more tempting than trying to visualize it in a map. You just need a set of coordinates (longitude and latitude) and some map visualization tool. So far, so good. The problem arises when the only thing you have is addresses but no longitude and latitude. What can be done in such cases? Meet Geocoding!

According to wikipedia, geocoding “is the process of finding associated geographic coordinates from other geographic data, such as street addresses or zip codes“. The idea is very simple. Let’s say we have some address like “1600 Amphitheatre Parkway, Mountain View, CA”, and that we ant to get its geographic coordinates. By using a geocoding tool we would find that the previous address has latitude 37.423021 and longitude -122.083739.

What can we use to geocodify our data in R?

An interesting -but not the only- option is to use the Geocoding Google API from Google Maps API Web Services. If you check the geocoding documentation, you’ll see that we can do requests and obtain the answers in XML format. The required url format that we would need is something like this:

http://maps.googleapis.com/maps/api/geocode/xml?address=
1600+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=false

Code in R of a Geocoding API request with XML response


# load package XML
library(XML)

# Let's say we have a vector with 3 addresses
address = c("Newark Bl and Mayhews Landing Rd, Newark, CA",
  "Powell and North Point Dr, San Francisco, CA",
  "Sonoma Bl and Yolano Dr, Vallejo, CA")

# define geocoding google api url
geo_url = "http://maps.googleapi.com/maps/api/geocode/xml?address="

# empty vectors to store latitude and longitude
lat = rep("", 3)
lon = rep("", 3)

# parsing
for (i in 1:length(address))
{
  # replace spaces by '+' symbols in current address
  query = gsub(" ", "+", address[i])
  # create query
  geo_query = paste(geo_url, query, "&sensor=false", sep="")
  doc = htmlParse(geo_query)
  # extract latitude and longitude
  lat[i] = xpathSApply(doc, "//location/lat", xmlValue)
  lon[i] = xpathSApply(doc, "//location/lat", xmlValue)
}

# convert as numeric
lat = as.numeric(lat)
lon = as.numeric(lon)

# lat and lon in a matrix
cbind(lat, lon)

The obtained results should be:

lat = 37.5396808, 37.8068281, 38.1323594
lon = -122.0336495, -122.412118, -122.2561071

Bart Ridership 1

Ever wonder how many people exit at BART stations on a typical weekday? You can check the numbers in the BART reports page where there are a lot of data files to play with. As part of the visualization of BART reidership, I created this graphic with the help of the R package ggmap and its useful functions to get maps from openstreetmap (and from other options like google maps). The average number of people exiting most BART stations on a typical weekday is around 5000. However, numbers rocket sky in the stations located on Market street (San Francisco): Embarcadero, Montgomery, and Powell. Just look at the big red circles on the map; that’s where riders concentrate.

SFO airport by BART? Watch your wallet!

In one of my visualization projects, I felt curiosity to visualize the fares between BART stations. I was shocked by how much it costs to go to San Francisco airport by BART. One possibility to have a visual representation is by using a graph network. In this case, the nodes of the graph represent the stations, while the fares are represented by the edges. The pricier a fare, the thicker and brighter its corresponding edge. As you can tell from the previous graph, the SFO Int’l Airport station has the thicker and reddish edges, meaning it is the most expensive one. Why is so fu&#ing expensive to go to SFO airport by Bart?

Tweets comparison: illumina -vs- Affymetrix

Affymetrix and illumina are two US based companies offering technologies for high-throughput genotyping, and they both compete with regards to their product lines. They also have presence in twitter, and they actively post about a number of different topics. Since I was curious about the illumina tweets and theAffymetrix tweets, I decided to do a quick twitter analysis in R and get a comparative wordcloud. The size of the words reflect their frequency. The color indicates that a given word is used more by the corresponding company.

In addition, we can also make a commonality cloud to see what terms are shared in the tweets of both firms.