R 101: Summarizing Data

When working with large amounts of data that is structured in a tabular format, a common operation is to summarize that data in different ways using specific variables. In Microsoft Excel, pivot tables are a nice feature that is used for this purpose. While not as “efficient” in relation to Excel pivot tables, R also has similar calculations that can be used to summarize large amount of data. In the following R code, I utilize R to summarize a data frame by specific variables.

## CREATE DATA
 
dat = data.frame(
  name=c("Tony","James","Sara","Alice","David","Angie","Don","Faith","Becky","Jenny",
         "Kristi","Neil","Brandon","Kara","Kendra","Liz","Gina","Amber","Alice","George"),
  state=c("KS","IA","CA","FL","MI","CO","KA","CO","KS","CA","MN","FL","NM","MS","GA",
          "IA","IL","ID","NY","NJ"),
  gender=c("M","M","F","F","F","M","F","M","F","F","F","M","M","F","F","F","F","F","F","M"),
  marital_status=c("M","S","S","S","M","M","S","M","S","M","M","S","S","S","M","M","S","M","S","M"),
  credit=c("good","good","poor","fair","poor","fair","fair","fair","good","fair",
           "good","good","poor","fair","poor","fair","fair","fair","good","fair"),
  owns_home=c(0,1,0,0,1,0,1,1,1,1,0,1,0,0,1,0,1,1,1,1),
  cost=c(500,200,300,150,200,300,400,450,250,150,500,200,300,150,200,300,400,450,250,150))
## AGGREGATE FUNCTION FROM BASE R
aggregate(cost ~ marital_status, data=dat, FUN=mean)
aggregate(cost ~ marital_status + gender, data=dat, FUN=mean)
aggregate(cost ~ marital_status + credit + gender, data=dat, FUN=mean)
 
## SUMMARY BY IN DOBY:
library(doBy)
summaryBy(cost ~ marital_status, data=dat, FUN=c(mean, sd))
summaryBy(cost ~ gender, data=dat, FUN=c(mean, sd))
summaryBy(cost ~ credit, data=dat, FUN=c(mean, sd))
 
## DDPLY IN PLYR
library(plyr)
ddply(dat, .(credit), "nrow")
ddply(dat, .(credit, gender), "nrow")
ddply(dat, .(marital_status), summarise, avg=mean(cost))
ddply(dat, .(marital_status, gender), summarise, avg=mean(cost))
ddply(dat, .(marital_status, gender, credit), summarise, avg=mean(cost))
 
## DPLYR PACKAGE
library(dplyr)
Good = filter(dat, credit=="good")
Good
arrange(Good, desc(cost))
select(Good, owns_home, cost)
mutate(Good, New_Value=cost/5)
by.type <- group_by(Good, gender)
summarise(by.type, num.types = n(), counts = sum(cost))
 
## SQLDF PACKAGE
library(sqldf)
sqldf("SELECT gender, COUNT(*) FROM dat GROUP BY gender")
sqldf("SELECT gender, credit, COUNT(*) FROM dat GROUP BY gender, credit")
sqldf("SELECT gender, credit, COUNT(*), AVG(cost) FROM dat GROUP BY gender, credit")

## DATA.TABLE PACKAGE **
library(data.table)
dat<-data.table(dat)
dat[,list(mean=mean(cost),sd=sd(cost)),by=c("gender","credit")]

While that is all fine and dandy , data is often messy and we are forced to deal with missing values or manipulate strings in order to get the desired output. So how would one go about summarizing data when there is missing data or other “odd” characteristics in the data set. One option is to use complete.cases() to remove all rows with missing values. You can also utilize na.omit() or is.na() within a conditional statement to remove missing values from the data set or summary output.

## add a new row with missing values (NA)
dat[21,] = c("Tom", "GA", "M", NA, NA, 0, NA)
dat[complete.cases(dat),] # removes all rows with any NA values

ddat = ddply(dat, .(marital_status, gender), summarise, avg=mean(cost, na.rm=TRUE)) # oh no, now there are missing categories in my summary output ddat[complete.cases(ddat),]

Depending on what analysis that you’re performing, this may not be an acceptable task. When you can’t remove all rows containing missing values from your data, the main option that you have is to exclude that observation from the calculation using the na.rm parameter within the mean function.

mean(dat$cost, na.rm=TRUE)

Let’s take this one step further and add a new observation which contains a string that contains a few characters which it shouldn’t. Besides dealing with missing values, string manipulation is usually the main thing I deal with when summarizing data. There are an endless number of ways to clean up strings but here are the two that I use most frequently when I encounter a string with ‘characters’ that it shouldn’t contain.

dat[22,] = c("Terminat_/?or", "GA", "M", NA, NA, 0, NA)
 
library(qdap)
strip(dat$name, lower.case = FALSE)
 
str_replace_all(dat$name, "[[:punct:]]", "")

 

** Credit to Robert Adams for this contribution (see comments)

GraphViz by Example

GraphViz is an open-source software package developed by AT&T Labs for generating directed graphs and flowcharts. Outputs are created using Dot, a plain text graph description language that is part of the Graphviz package. GraphViz is a powerful application that allows users to create appealing flowcharts without getting hung up on the layout or positioning of the nodes. Graphs in GraphViz are comprosed of nodes and edges (lines), and can be directed (lines with arrows) or undirected (lines without arrows). To create flowcharts with dot, the user must specify the type of graph, the name of the plot, and then define the terms of the graph within braces.

graph name1 {
     Node1 [label="Node_1"];
}

graph4

digraph name1 { 
      Node1 [label="Node_1"];
      Node2 [label="Node_2"];
      Node1 -> Node2;
}

graph3

digraph name1 { 
     Node1 [label="Node_1"];
     Node2 [label="Node_2"];
     Node3 [label="Node_3"];
     Node1 -> Node2 -> Node3;
     Node1 -> Node3;
     Node2 -> Node1;
}

graph2

digraph name1 { 
   Node1 [label="Node_1"];
   Node2 [label="Node_2", shape=diamond];
   Node3 [label="Node_3", color=Blue, fontcolor=Navy, shape=box, style=filled, 
               color="#d3edea"]; 
   Node4 [label="Node_4"];
   Node1 -> Node2 -> Node3;
   Node1 -> Node3 [color=red, label="Best Choice", fontsize=10];
   Node2 -> Node1;
   Node2 -> Node4;
}

graph1

Creating ‘Tags’ For PPC Keywords

When performing search engine marketing, it is usually beneficial to construct a system for making sense of keywords and their performance. While one could construct Bayesian Belief Networks to model the process of consumers clicking on ads, I have found that using ‘tags’ to categorize keywords is just as useful for conducting post-hoc analysis on the effectiveness of marketing campaigns. By ‘tags,’ I mean identifiers which categorize keywords according to their characteristics. For example, in the following data frame, we have six keywords, our average bids, numbers of clicks, and tags for state, model, car, auto, save, and cheap. What we want to do now is set the boolean for each tag to 1 if and only if that tag is mentioned in the keyword.

# CREATE SOME DATA = 
df = data.frame(keyword=c("best car insurance",
                          "honda auto insurance",
                          "florida car insurance",
                          "cheap insurance online",
                          "free insurance quotes",
                          "iowa drivers save money"),
                average_bid=c(3.12, 2.55, 2.38, 5.99, 4.75, 4.59),
                clicks=c(15, 20, 30, 50, 10, 25),
                conversions=c(5, 2, 10, 15, 3, 5),
                state=0, model=0, car=0, auto=0, save=0, cheap=0)
df

# FUNCTION WHICH SETS EACH TAG TO 1 IF THE SPECIFIED TAG IS PRESENT IN THE KEYWORD
main <- function(df) {
  state <- c("michigan", "missouri", "florida", "iowa", "kansas")
  model <- c("honda", "toyota", "ford", "acura", "audi")
  car <- c("car")
  auto <- c("auto")
  save <- c("save")
  cheap <- c("cheap")
  for (i in 1:nrow(df)) {
    Words = strsplit(as.character(df[i, 'keyword']), " ")[[1]]
    if(any(Words %in% state)) df[i, 'state'] <- 1
    if(any(Words %in% model)) df[i, 'model'] <- 1 
    if(any(Words %in% car)) df[i, 'car'] <- 1
    if(any(Words %in% auto)) df[i, 'auto'] <- 1     
    if(any(Words %in% save)) df[i, 'save'] <- 1
    if(any(Words %in% cheap)) df[i, 'cheap'] <- 1
  }
  return(df)
}

one = main(df)

subset(one, state==TRUE | model==TRUE | auto==TRUE)

# AN ALTERNATE METHOD USING THE STRINGR PACKAGE

df

library(stringr)

# CREATE EACH TAG
state <- c("michigan", "missouri", "florida", "iowa", "kansas")
model <- c("honda", "toyota", "ford", "acura", "audi")
car <- c("car")
auto <- c("auto")
save <- c("save")
cheap <- c("cheap")

state_match <- str_c(state, collapse = "|")
model_match <- str_c(model, collapse = "|")
car_match <- str_c(car, collapse = "|")
auto_match <- str_c(auto, collapse = "|")
save_match <- str_c(save, collapse = "|")
cheap_match <- str_c(cheap, collapse = "|")

#FUNCTION TO SET TAG IF PRESENT IN THE KEYWORD
main <- function(df) {
  df$state <- str_detect(df$keyword, state_match)
  df$model <- str_detect(df$keyword, model_match)
  df$car <- str_detect(df$keyword, car_match)
  df$auto <- str_detect(df$keyword, auto_match)
  df$save <- str_detect(df$keyword, save_match)
  df$cheap <- str_detect(df$keyword, cheap_match)
  df
}

two = main(df2)

subset(two, state==TRUE | model==TRUE | auto==TRUE)

By now, some of you are probably wondering why we don’t just select the keyword directly from the original data frame based on the desired characteristic. Well, that works too, albeit I’ve found that the marketing professionals that I’ve worked with have preferred the ‘tagging’ method.

## Alternate approach - SELECT DIRECTLY

df

main <- function(df) {
  model <- c("honda", "toyota", "ford", "acura", "audi")
  for (i in 1:nrow(df)) {
    Words = strsplit(as.character(df[i, 'keyword']), " ")[[1]]
    if(any(Words %in% model)) return(df[i, c(1:4) ])    
  }}

three = main(df)

So there you have it, a method of ‘tagging’ strings according to a certain set of specified characteristics. The benefit of using ‘tags’ is that it provides you with a systematic way to document how the presence of certain words or phrases impacts performance.