Part 1: 2D Visuals

ggplot2

ggplot2is a versatile and visually optimized library:

  • Simplified API versus vanilla R graphics API
  • Staple of data analysis process
  • Can be used for data storytelling as well
  • Produces well formed static visualizations
  • Useful in quickly wireframing an interactive visualization

To get started, we’ll need to first install and load the package.

#install ggplot2
#install.packages('ggplot2')

#load library
library(ggplot2)

Within ggplot2, there are a number of datasets that are provided with the package to illustrate how to use the functionality. The economics dataset contains data from the US Bureau of Economic Analysis, US Census Bureau, and the US Bureau of Labor Statistics. economics is in wide form (e.g. each row represents a different day, each column contains a different variable). economics_long, a variant of the economics dataset, is provided in long or stacked form (e.g. each row represents a distinct combination of date and metric and variables are represented by two columns: a variable label and value). We’ll rely on the economics_long dataset for the ggplot2 tutorial. For more explanation about wide and long form, take a look at this UCLA tutorial.

#Wide form dataset
  head(economics)
##         date   pce    pop psavert uempmed unemploy
## 1 1967-07-01 507.4 198712    12.5     4.5     2944
## 2 1967-08-01 510.5 198911    12.5     4.7     2945
## 3 1967-09-01 516.3 199113    11.7     4.6     2958
## 4 1967-10-01 512.9 199311    12.5     4.9     3143
## 5 1967-11-01 518.1 199498    12.5     4.7     3066
## 6 1967-12-01 525.8 199657    12.1     4.8     3018
#Long form dataset
  head(economics_long)
##         date variable value      value01
## 1 1967-07-01      pce 507.4 0.0000000000
## 2 1967-08-01      pce 510.5 0.0002660008
## 3 1967-09-01      pce 516.3 0.0007636797
## 4 1967-10-01      pce 512.9 0.0004719369
## 5 1967-11-01      pce 518.1 0.0009181318
## 6 1967-12-01      pce 525.8 0.0015788435

The syntax for ggplot2 is as follows:

  • ggplot() initiates the function and accepts the dataset and variables
  • aes() controls aesthetics
  • geom_point() indicates the type of graph. This option can be swapped out for lines, points, bars among other forms
#create a ggplot area and add specific geoms
ggplot(data, aes(x,y,color,group)) + geom_point()

To put this to use, we can run the following examples:

#Line graph with economics_long
  #x = date variable
  #y = value to be graphed
  #group = variable label since data is in long form
ggplot(economics_long, aes(x=date, y=value01, color=variable, group=variable)) + geom_line()

#example as area graphs
ggplot(economics_long, aes(x=date, y=value01,color=variable, group=variable)) + geom_area()

#reshape your own tabular data
#install reshape2
#install.packages('reshape2')

#load library
library(reshape2)
#command to change to long form
new_var <- melt(dataframe, time_var)
#check example dataset
  head(economics)
##         date   pce    pop psavert uempmed unemploy
## 1 1967-07-01 507.4 198712    12.5     4.5     2944
## 2 1967-08-01 510.5 198911    12.5     4.7     2945
## 3 1967-09-01 516.3 199113    11.7     4.6     2958
## 4 1967-10-01 512.9 199311    12.5     4.9     3143
## 5 1967-11-01 518.1 199498    12.5     4.7     3066
## 6 1967-12-01 525.8 199657    12.1     4.8     3018
#pick a variable you want to plot
  head(economics[c(1,3)])
##         date    pop
## 1 1967-07-01 198712
## 2 1967-08-01 198911
## 3 1967-09-01 199113
## 4 1967-10-01 199311
## 5 1967-11-01 199498
## 6 1967-12-01 199657
#reshape data assigning to new dataset
  economics_melt <- melt(economics[c(1,3)],'date')

#check new set
  head(economics_melt)
##         date variable  value
## 1 1967-07-01      pop 198712
## 2 1967-08-01      pop 198911
## 3 1967-09-01      pop 199113
## 4 1967-10-01      pop 199311
## 5 1967-11-01      pop 199498
## 6 1967-12-01      pop 199657

datatables

The datatables package builds interactive, searchable tables from dataframes:

  • Present tabular data in a searchable paginated table
  • Useful for presenting the cleaned raw data to the client
  • Convey the shape of data
  • Useful in connecting to subject matter knowledge
  • Useful in discussing specific cases

To get started, let’s install and load the DT library.

#install DataTables
#install.packages('DT')

#load package
  library(DT)

The datatable allows for stylized tables. But at a minimum, it just needs data.

#create interactive table
datatable(data, options = list(), 
  rownames, colnames, container, caption = NULL, 
  filter = c("none", "bottom", "top"), 
  ...)

Using the economics example, we can render a simple table as well as one with a max number of entries of 50.

#simple example
  datatable(economics)
#customize page length
  datatable(economics, options = list(pageLength = 50))

Dygraphs

  • R interface to popular dygraphs javascript library
  • Useful in presenting time-series data interactive form
  • Automatically graphs xts time-series objects
  • Helpful in visualizing long time-series

To get started, we need to install the dygraphs and xts packages. Note that dygraphs only accepts xts time series objects.

#install dygraphs and xts
#install.packages('dygraphs')
#install.packages('xts')

#load packages
  library(dygraphs)
## Warning: package 'dygraphs' was built under R version 3.2.5
  library(xts)
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.2.5
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

Syntax is fairly straight forward.

#convert dataframe to xts
xts(x = NULL,
    order.by = index(x),
    frequency = NULL,
    unique = TRUE,
    tzone = Sys.getenv("TZ"),
    ...)

#create a dygraph
dygraph(data, main = NULL, xlab = NULL, ylab = NULL,
  periodicity = NULL,
  group = NULL, width = NULL, height = NULL)

In practice, we’ll need to take the economics dataset and placing it into xts form. That means specifying that the date value in the economics dataset is in fact a date (additional resource), then setting the date value in a dataframe and converting it into an xts format, as shown below.

#convert string date to date format
  date_me <- as.Date(economics$date, format='%Y-%m-%d')

#combine with variable of interest: PCE
  value_me <- data.frame(date_me, economics$pce)

#convert data frame to xts format
  plot_me <- xts(value_me, order.by=value_me[,1])

Once the data is in xts form, then we can drop the xts object into dygraphs.

#call dygraph
  dygraph(plot_me)

If we want to add some flare, we can add a range selector:

#add some flare
  dygraph(plot_me) %>% dyRangeSelector()

Part 2: 3D Visuals

Get data

In order to get the data for this tutorial, you’ll need to get an API key to use the Census API service here. Once you have it, assign the key to the variable api_key.

api_key <- "put api key here"

Now that you have the API key, run the following code that is available on the Storytelling repository on Github. This will automatically pull in code that and assemble the dataset for the visualizations in this section.

source("https://raw.githubusercontent.com/CommerceDataService/cda_storytelling_in_r/gh-pages/get_data.R")

What’s in the data?

Sometimes two dimensional visuals are not enough. There is a lot more to the data that can be used to contextualize latent patterns. Often times, many analysts tend to think in two-dimensions – like scatter plots. But there’s more to it. In the dataset that you’ve just imported, it has the following characteristics:

summary(data)
##     GEOID               state        geography             region     
##  Length:3142        Min.   : 1.00   Length:3142        Min.   :1.000  
##  Class :character   1st Qu.:18.00   Class :character   1st Qu.:2.000  
##  Mode  :character   Median :29.00   Mode  :character   Median :3.000  
##                     Mean   :30.28                      Mean   :2.669  
##                     3rd Qu.:45.00                      3rd Qu.:3.000  
##                     Max.   :56.00                      Max.   :4.000  
##                                                                       
##                name       pct_poverty      pct_unemp      pct_hs_grad   
##  South Region    :1422   Min.   : 0.00   Min.   : 0.00   Min.   :46.70  
##  Midwest Region  :1055   1st Qu.: 8.30   1st Qu.: 3.70   1st Qu.:80.70  
##  West Region     : 448   Median :11.50   Median : 4.90   Median :86.40  
##  Northeast Region: 217   Mean   :12.35   Mean   : 4.94   Mean   :84.98  
##  Alabama         :   0   3rd Qu.:15.30   3rd Qu.: 6.10   3rd Qu.:90.10  
##  Alaska          :   0   Max.   :45.50   Max.   :20.90   Max.   :98.70  
##  (Other)         :   0

Let’s say we were provided a nice clean set of data that contains the following:

  • county level data with county ID and region ID
  • variables: % unemployed, % in poverty, % with at least a HS degree

What can you do with that data? Well, turns out that that these quantities are related.

Threejs

How did we get to this? The threejs library can be used to:

  • Builds upon three.js visualization engine for web browsers
  • Accepts vectors, matrices and data frames to create different types of interactive visualizations:
#Load in Threejs library      
      library(threejs)

We can see that there are direct relationships between unemployment, poverty and education attainment. But there isn’t much detail and the graphs aren’t pretty.

        scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad)

Let’s stylize the plots. First let’s name the axes with axisLabels, which accepts a vector of axis names. The order matters and is as follows: x-axis, z-axis, y-axis

      #Note that axis Labels should follow this order= c(x, z, y)
        scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad,   
                     axisLabels=c("unemployment","hs degree or above","poverty rate"))    

Now let’s change the rendering engine to give more depth to the plot. We do so by changing renderer = “canvas”. This just tells R threejs to use a different package to render the points

      #Depth using render
        scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad, 
                       axisLabels=c("unemployment","hs degree or above","poverty rate"),
                       renderer="canvas")   

Now, let’s set the color of the points, resize the points, and flip the y axis so it’s ascending from the origin. To do so, we: - set col = “slategrey” - set flip.y = FALSE - set size = 0.5

      #Point size, color, don't flip y axis
        scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad, 
                       axisLabels=c("unemployment","hs degree or above","poverty rate"),
                       renderer="canvas",  flip.y=FALSE, col="slategrey",
                       size=0.5)   

Ultimately, we want to find more patterns. By using color, we can group regions by color. We can see some regions are worse off than others. But which? Turns out there are 4 regions:

   unique(data$region_name)     
## NULL
    unique(data$region)  
## [1] 1 2 3 4

First, let’s set each region to a different color by first creating a new variable for colors data$colors, then assign a hexcode to each region.

      #Set up colors by 
        data$colors <- ""
        data$colors[data$region==1] <- "#011efe0"
        data$colors[data$region==2] <- "#0bff01"
        data$colors[data$region==3] <- "#fe00f6"
        data$colors[data$region==4] <- "#fdfe02"

Now, let’s set col= data$colors so that R knows which color corresponds to each of the 3000 points.

      data <- data[order(data$region),]
      #Grouped patterns
        scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad, 
                       axisLabels=c("unemployment","hs degree or above","poverty rate"),
                       col=data$colors,  flip.y=FALSE, 
                       renderer="canvas", 
                       size=0.5)   

It’s a bit annoying to look at the chart without knowing which point corresponds to which county. Let’s add labels for each point that show up upon mousing over.

      #add labels to points
      scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad,   
                     axisLabels=c("unemployment","hs degree or above","poverty rate"),
                     col=data$colors,
                     labels=paste(data$region_name,": ",data$geography), 
                     size=0.5,
                    renderer="canvas")

In short, we can tell the following key insights from this graph.

Maps with Leaflet

Sometimes graphs don’t get the point across. Maps, while over used, can provide some better indication of patterns.

Based on our 3-d graphs, we could see clustering of regions’s economic performance. We can see the mess of points more clearly on a map.

## OGR data source with driver: ESRI Shapefile 
## Source: "cb_2014_us_county_20m.shp", layer: "cb_2014_us_county_20m"
## with 3220 features
## It has 9 fields