ggplot2is a versatile and visually optimized library:
To get started, we’ll need to first install and load the package.
#install ggplot2
#install.packages('ggplot2')
#load library
library(ggplot2)
Within ggplot2, there are a number of datasets that are provided with the package to illustrate how to use the functionality. The economics dataset contains data from the US Bureau of Economic Analysis, US Census Bureau, and the US Bureau of Labor Statistics. economics is in wide form (e.g. each row represents a different day, each column contains a different variable). economics_long, a variant of the economics dataset, is provided in long or stacked form (e.g. each row represents a distinct combination of date and metric and variables are represented by two columns: a variable label and value). We’ll rely on the economics_long dataset for the ggplot2 tutorial. For more explanation about wide and long form, take a look at this UCLA tutorial.
#Wide form dataset
head(economics)
## date pce pop psavert uempmed unemploy
## 1 1967-07-01 507.4 198712 12.5 4.5 2944
## 2 1967-08-01 510.5 198911 12.5 4.7 2945
## 3 1967-09-01 516.3 199113 11.7 4.6 2958
## 4 1967-10-01 512.9 199311 12.5 4.9 3143
## 5 1967-11-01 518.1 199498 12.5 4.7 3066
## 6 1967-12-01 525.8 199657 12.1 4.8 3018
#Long form dataset
head(economics_long)
## date variable value value01
## 1 1967-07-01 pce 507.4 0.0000000000
## 2 1967-08-01 pce 510.5 0.0002660008
## 3 1967-09-01 pce 516.3 0.0007636797
## 4 1967-10-01 pce 512.9 0.0004719369
## 5 1967-11-01 pce 518.1 0.0009181318
## 6 1967-12-01 pce 525.8 0.0015788435
The syntax for ggplot2 is as follows:
#create a ggplot area and add specific geoms
ggplot(data, aes(x,y,color,group)) + geom_point()
To put this to use, we can run the following examples:
#Line graph with economics_long
#x = date variable
#y = value to be graphed
#group = variable label since data is in long form
ggplot(economics_long, aes(x=date, y=value01, color=variable, group=variable)) + geom_line()
#example as area graphs
ggplot(economics_long, aes(x=date, y=value01,color=variable, group=variable)) + geom_area()
#reshape your own tabular data
#install reshape2
#install.packages('reshape2')
#load library
library(reshape2)
#command to change to long form
new_var <- melt(dataframe, time_var)
#check example dataset
head(economics)
## date pce pop psavert uempmed unemploy
## 1 1967-07-01 507.4 198712 12.5 4.5 2944
## 2 1967-08-01 510.5 198911 12.5 4.7 2945
## 3 1967-09-01 516.3 199113 11.7 4.6 2958
## 4 1967-10-01 512.9 199311 12.5 4.9 3143
## 5 1967-11-01 518.1 199498 12.5 4.7 3066
## 6 1967-12-01 525.8 199657 12.1 4.8 3018
#pick a variable you want to plot
head(economics[c(1,3)])
## date pop
## 1 1967-07-01 198712
## 2 1967-08-01 198911
## 3 1967-09-01 199113
## 4 1967-10-01 199311
## 5 1967-11-01 199498
## 6 1967-12-01 199657
#reshape data assigning to new dataset
economics_melt <- melt(economics[c(1,3)],'date')
#check new set
head(economics_melt)
## date variable value
## 1 1967-07-01 pop 198712
## 2 1967-08-01 pop 198911
## 3 1967-09-01 pop 199113
## 4 1967-10-01 pop 199311
## 5 1967-11-01 pop 199498
## 6 1967-12-01 pop 199657
The datatables package builds interactive, searchable tables from dataframes:
To get started, let’s install and load the DT library.
#install DataTables
#install.packages('DT')
#load package
library(DT)
The datatable allows for stylized tables. But at a minimum, it just needs data.
#create interactive table
datatable(data, options = list(),
rownames, colnames, container, caption = NULL,
filter = c("none", "bottom", "top"),
...)
Using the economics example, we can render a simple table as well as one with a max number of entries of 50.
#simple example
datatable(economics)
#customize page length
datatable(economics, options = list(pageLength = 50))
To get started, we need to install the dygraphs and xts packages. Note that dygraphs only accepts xts time series objects.
#install dygraphs and xts
#install.packages('dygraphs')
#install.packages('xts')
#load packages
library(dygraphs)
## Warning: package 'dygraphs' was built under R version 3.2.5
library(xts)
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.2.5
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
Syntax is fairly straight forward.
#convert dataframe to xts
xts(x = NULL,
order.by = index(x),
frequency = NULL,
unique = TRUE,
tzone = Sys.getenv("TZ"),
...)
#create a dygraph
dygraph(data, main = NULL, xlab = NULL, ylab = NULL,
periodicity = NULL,
group = NULL, width = NULL, height = NULL)
In practice, we’ll need to take the economics dataset and placing it into xts form. That means specifying that the date value in the economics dataset is in fact a date (additional resource), then setting the date value in a dataframe and converting it into an xts format, as shown below.
#convert string date to date format
date_me <- as.Date(economics$date, format='%Y-%m-%d')
#combine with variable of interest: PCE
value_me <- data.frame(date_me, economics$pce)
#convert data frame to xts format
plot_me <- xts(value_me, order.by=value_me[,1])
Once the data is in xts form, then we can drop the xts object into dygraphs.
#call dygraph
dygraph(plot_me)
If we want to add some flare, we can add a range selector:
#add some flare
dygraph(plot_me) %>% dyRangeSelector()
In order to get the data for this tutorial, you’ll need to get an API key to use the Census API service here. Once you have it, assign the key to the variable api_key.
api_key <- "put api key here"
Now that you have the API key, run the following code that is available on the Storytelling repository on Github. This will automatically pull in code that and assemble the dataset for the visualizations in this section.
source("https://raw.githubusercontent.com/CommerceDataService/cda_storytelling_in_r/gh-pages/get_data.R")
Sometimes two dimensional visuals are not enough. There is a lot more to the data that can be used to contextualize latent patterns. Often times, many analysts tend to think in two-dimensions – like scatter plots. But there’s more to it. In the dataset that you’ve just imported, it has the following characteristics:
summary(data)
## GEOID state geography region
## Length:3142 Min. : 1.00 Length:3142 Min. :1.000
## Class :character 1st Qu.:18.00 Class :character 1st Qu.:2.000
## Mode :character Median :29.00 Mode :character Median :3.000
## Mean :30.28 Mean :2.669
## 3rd Qu.:45.00 3rd Qu.:3.000
## Max. :56.00 Max. :4.000
##
## name pct_poverty pct_unemp pct_hs_grad
## South Region :1422 Min. : 0.00 Min. : 0.00 Min. :46.70
## Midwest Region :1055 1st Qu.: 8.30 1st Qu.: 3.70 1st Qu.:80.70
## West Region : 448 Median :11.50 Median : 4.90 Median :86.40
## Northeast Region: 217 Mean :12.35 Mean : 4.94 Mean :84.98
## Alabama : 0 3rd Qu.:15.30 3rd Qu.: 6.10 3rd Qu.:90.10
## Alaska : 0 Max. :45.50 Max. :20.90 Max. :98.70
## (Other) : 0
Let’s say we were provided a nice clean set of data that contains the following:
What can you do with that data? Well, turns out that that these quantities are related.
How did we get to this? The threejs library can be used to:
#Load in Threejs library
library(threejs)
We can see that there are direct relationships between unemployment, poverty and education attainment. But there isn’t much detail and the graphs aren’t pretty.
scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad)
Let’s stylize the plots. First let’s name the axes with axisLabels, which accepts a vector of axis names. The order matters and is as follows: x-axis, z-axis, y-axis
#Note that axis Labels should follow this order= c(x, z, y)
scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad,
axisLabels=c("unemployment","hs degree or above","poverty rate"))
Now let’s change the rendering engine to give more depth to the plot. We do so by changing renderer = “canvas”. This just tells R threejs to use a different package to render the points
#Depth using render
scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad,
axisLabels=c("unemployment","hs degree or above","poverty rate"),
renderer="canvas")
Now, let’s set the color of the points, resize the points, and flip the y axis so it’s ascending from the origin. To do so, we: - set col = “slategrey” - set flip.y = FALSE - set size = 0.5
#Point size, color, don't flip y axis
scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad,
axisLabels=c("unemployment","hs degree or above","poverty rate"),
renderer="canvas", flip.y=FALSE, col="slategrey",
size=0.5)
Ultimately, we want to find more patterns. By using color, we can group regions by color. We can see some regions are worse off than others. But which? Turns out there are 4 regions:
unique(data$region_name)
## NULL
unique(data$region)
## [1] 1 2 3 4
First, let’s set each region to a different color by first creating a new variable for colors data$colors, then assign a hexcode to each region.
#Set up colors by
data$colors <- ""
data$colors[data$region==1] <- "#011efe0"
data$colors[data$region==2] <- "#0bff01"
data$colors[data$region==3] <- "#fe00f6"
data$colors[data$region==4] <- "#fdfe02"
Now, let’s set col= data$colors so that R knows which color corresponds to each of the 3000 points.
data <- data[order(data$region),]
#Grouped patterns
scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad,
axisLabels=c("unemployment","hs degree or above","poverty rate"),
col=data$colors, flip.y=FALSE,
renderer="canvas",
size=0.5)
It’s a bit annoying to look at the chart without knowing which point corresponds to which county. Let’s add labels for each point that show up upon mousing over.
#add labels to points
scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad,
axisLabels=c("unemployment","hs degree or above","poverty rate"),
col=data$colors,
labels=paste(data$region_name,": ",data$geography),
size=0.5,
renderer="canvas")
In short, we can tell the following key insights from this graph.
Sometimes graphs don’t get the point across. Maps, while over used, can provide some better indication of patterns.
Based on our 3-d graphs, we could see clustering of regions’s economic performance. We can see the mess of points more clearly on a map.
## OGR data source with driver: ESRI Shapefile
## Source: "cb_2014_us_county_20m.shp", layer: "cb_2014_us_county_20m"
## with 3220 features
## It has 9 fields
We can use the leaflet library to bring a geographic spin to the data:
To initiate a map, we only need to open the leaflet library, then run the following:
library(leaflet)
leaflet()
You’ll see that the map is blank with a zoom control panel on the upper left. That’s because the map doesn’t have data in it. There are dozens on free layers we can use:
leaflet() %>%
addProviderTiles("Stamen.Toner")
leaflet() %>%
addProviderTiles("CartoDB.Positron")
Now let’s center and zoom in on the contiguous US at coordinates lon = -98.3 and lat = 39.5
leaflet() %>%
addProviderTiles("CartoDB.Positron") %>%
setView(lng = -98.3, lat = 39.5, zoom = 4)
We now need to download the shapefiles, which are a popular geospatial vector data format for geographic information system (GIS) software. Shapefiles allow for rendering of various types of data, including points (e.g. coordinates), polygons (e.g. county boundaries), and lines (e.g. streets, creeks). We’re going to use the US County Shapefile from the US Census: http://www2.census.gov/geo/tiger/GENZ2014/shp/cb_2014_us_county_20m.zip. To download it and load it into R, we’ll need to first install the rgdal library. We’ve also written a function shape_direct that can be run to import the shapefile and assign it to an object shp.
shape_direct <- function(url, shp) {
library(rgdal)
temp = tempfile()
download.file(url, temp) ##download the URL taret to the temp file
unzip(temp,exdir=getwd()) ##unzip that file
return(readOGR(paste(shp,".shp",sep=""),shp))
}
shp <- shape_direct(url="http://www2.census.gov/geo/tiger/GENZ2014/shp/cb_2014_us_county_20m.zip",
shp= "cb_2014_us_county_20m")
## OGR data source with driver: ESRI Shapefile
## Source: "cb_2014_us_county_20m.shp", layer: "cb_2014_us_county_20m"
## with 3220 features
## It has 9 fields
## Warning in readOGR(paste(shp, ".shp", sep = ""), shp): Z-dimension
## discarded
With the new shapefile now imported, we can now set data = shp.
leaflet(data=shp) %>%
addProviderTiles("CartoDB.Positron") %>%
setView(lng = -98.3, lat = 39.5, zoom = 4) %>%
addPolygons(fillColor = "blue",
fillOpacity = 0.8,
color = "white",
weight = 0.5)
In order to draw insight from a map, we’ll need to color code county polygons. This is known as a choropleth map – each county is color coded with respect to a certain value of a given variable like %poverty. The shapefile on its own doesn’t have the socioeconomic data and we’ll need to join the data to the shapefile. Let’s just quickly check the data formats of the primary key GEOID.
str(shp@data$GEOID)
## Factor w/ 3220 levels "01001","01003",..: 560 741 2222 2660 2471 2200 2392 2707 770 853 ...
str(data$GEOID)
## chr [1:3142] "09001" "09003" "09005" "09007" "09009" ...
Since the shp primary key is in a factor format and the data primary key is in string or character format, we’ll need to conform the formats, preferrably to strings. Then, we can merge the two datasets.
shp@data$GEOID <- as.character(shp@data$GEOID)
shp <- merge(shp,data,id="GEOID")
With the merged datasets, we’ll now need to specify a color scheme. Using colorQuantile, we can create a create a function that will slice any continuous variable into bins and assign colors to each bin. The syntax is as follows:
pal <- colorQuantile(<Color Code>, <variable>, n = <number of bins>)
For our example, we’ll use a Yellow-Green palette, leave
palette <- colorQuantile("YlGn", NULL, n = 30)
Next we will add popup text for when a user clicks on a county in a map.
county_popup <- paste0("<strong>County: </strong>",
shp@data$geography,
"<br><strong>Poverty Rate (%): </strong>",
shp@data$pct_poverty)
Now we’ll pull it all together.
leaflet(data = shp) %>%
addProviderTiles("CartoDB.Positron") %>%
setView(lng = -98.3, lat = 39.5, zoom = 4) %>%
addPolygons(fillColor = ~palette(pct_poverty),
fillOpacity = 0.8,
color = "#BDBDC3",
weight = 0.1,
popup = county_popup)
We can run the same graph for pct_unemp by swapping out pct_poverty
county_popup <- paste0("<strong>County: </strong>",
shp@data$geography,
"<br><strong>Unemp (%): </strong>",
shp@data$pct_unemp)
leaflet(data = shp) %>%
addProviderTiles("CartoDB.Positron") %>%
setView(lng = -98.3, lat = 39.5, zoom = 4) %>%
addPolygons(fillColor = ~palette(pct_unemp),
fillOpacity = 0.8,
color = "#BDBDC3",
weight = 0.1,
popup = county_popup)