R
is a popular data analysis and visualization environment, with a large number of user contributed packages, and the ability for interested individuals to easily contribute their own packages.
Cytoscape
is a popular network visualization software.
Python
is an alternative popular data analysis environment.
All of the code and data used is available on github.
There are many tools people can use to interact with their data, with Microsoft Excel (and other spreadsheet packages) being very popular. However, the number of different ways that data can be easily plotted or manipulated is very limiting in spreadsheet packages. Other commercial visualization packages such as Tableau and Spotfire are more powerful than spreadsheet software, and enable greater control, but reproducing a plot later can be difficult if one does not record the exact set of options and tick boxes selected in producing a visualization.
I believe that scripting languages such as R
and Python
that allow rather easy data manipulation and visualization provide the best of both worlds. For the Python
section of this, please go here
R
is useful because of the various statistical utilities available. R
has a set of base graphics utilities that allow one to do various types of plots. As an example, we can create a scatterplot of two variables:
var1 <- rnorm(100)
var2 <- rnorm(100, 1)
plot(var1, var2)
Or do a histogram of one of the variables:
hist(var1)
hist(var2)
But doing much else in base graphics can get rather tedious and difficult, especially when you want multiple plots.
As a simple example, lets imagine that we have two replicates of each variable that we want to examine.
varData <- data.frame(var1 = c(rnorm(100), rnorm(100)), var2 = c(rnorm(100,
1), rnorm(100, 1)), sample = rep(c(1, 2), each = 100))
Using base graphics, we can easily plot the two variables against each other:
plot(varData$var1, varData$var2)
But if we want to color each sample differently on the same graphic (a common enough thing), things quickly get more complicated:
plot(varData$var1[varData$sample == 1], varData$var2[varData$sample == 1], col = "blue")
points(varData$var1[varData$sample == 2], varData$var2[varData$sample == 2],
col = "red")
We are no longer using the base plot
command, but a sub-part of it. And plumbing it's depths quickly becomes tedious.
Enter ggplot2
. It is an R
implementation of the grammar of graphics. To better understand it's design and how it treats entries and variables, I recommend reading Hadley Wickham's Tidy Data manuscript.
Using ggplot2
, our previous graph becomes rather trivial.
library(ggplot2)
varData$sample <- factor(varData$sample)
ggplot(varData, aes(x = var1, y = var2, colour = sample)) + geom_point()
We can also easily create multiple sub-plots that are broken out the by sample variable:
ggplot(varData, aes(x = var1, y = var2)) + geom_point() + facet_grid(. ~ sample)
In it's current form, we can easily plot summary histograms of each variable:
ggplot(varData, aes(x = var1, fill = sample)) + geom_bar(position = "identity",
alpha = 0.5)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(varData, aes(x = var1, fill = sample)) + geom_density(alpha = 0.5)
ggplot(varData, aes(x = var1)) + geom_bar() + facet_grid(sample ~ .)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
If we want to do a bigger overview of summary statistics, we can just reshape our data a little bit:asdfasdf
library(reshape2)
varData2 <- melt(varData, measure.vars = c("var1", "var2"))
ggplot(varData2, aes(x = value, fill = sample)) + geom_bar(position = "identity",
alpha = 0.5) + facet_grid(. ~ variable)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(varData2, aes(x = value)) + geom_bar() + facet_grid(sample ~ variable)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The other major strength of ggplot
is the ability to layer different graphical on top of each other, as well as customizing plots.
Layers are the individual plot elements that are added to a given plot. For example, we can do:
p <- ggplot(varData, aes(x = var1, y = var2))
p + geom_point()
p + stat_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
p + geom_point() + stat_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
We can customize just about any aspect of a plot that we want.
From the color of the points plotted:
p + geom_point(color = "red")
To the axis labels:
p + geom_point() + labs(x = "Important Data 1", y = "Dependent Variable 1")
To the fonts, background, etc, or the elements of a particular ggplot
theme:
pPoint <- p + geom_point()
pPoint + theme(axis.text = element_text(colour = "red"))
pPoint + theme(axis.text = element_text(size = 14))
pPoint + theme(axis.title = element_text(color = "blue", size = 20))
There are a couple of built-in themes such as theme_bw
and theme_classic
alongside the default, and there is also a collection of alternative themes in the ggthemes
package.
Metabolomics and network data are courtesy of Dmitry Grapov's Creative Data Solutions metabolomic tutorial.
metabData <- read.table("pumpkin_tomatillo_data.csv", sep = ",", header = T,
stringsAsFactors = FALSE)
colnames(metabData)[4:151] <- seq(1, 148)
metab2 <- melt(metabData, id.vars = seq(1, 3), measure.vars = seq(4, 151))
varNameData <- read.table("metabID.csv", sep = ",", header = T, stringsAsFactors = FALSE)
metab2$metabName <- rep(varNameData$name, each = 24)
metabData <- metab2
newTreat <- rep("fresh", nrow(metabData))
newTreat[grep("lyophilized", metabData$Treatment)] <- "lyoph"
metabData$treat <- newTreat
head(metabData)
## ID Species Treatment variable value
## 1 P07 pumpkin MeOH:CHCl3:H2O (5:2:2) - fresh frozen 1 28
## 2 P08 pumpkin MeOH:CHCl3:H2O (5:2:2) - fresh frozen 1 70
## 3 P09 pumpkin MeOH:CHCl3:H2O (5:2:2) - fresh frozen 1 43
## 4 P10 pumpkin MeOH:CHCl3:H2O (5:2:2) - fresh frozen 1 139
## 5 P11 pumpkin MeOH:CHCl3:H2O (5:2:2) - fresh frozen 1 42
## 6 P12 pumpkin MeOH:CHCl3:H2O (5:2:2) - fresh frozen 1 112
## metabName treat
## 1 zymosterol fresh
## 2 zymosterol fresh
## 3 zymosterol fresh
## 4 zymosterol fresh
## 5 zymosterol fresh
## 6 zymosterol fresh
write.table(metabData, file = "metabolomics_reshapedData.csv", sep = ",", row.names = FALSE,
col.names = TRUE)
Let's summarize the values for each type of sample. We will stop at a value of 5000 because
ggplot(metabData, aes(x = value, fill = treat)) + geom_bar(alpha = 0.8, position = "identity") +
facet_grid(Species ~ .) + xlim(0, 5000)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(metabData, aes(x = value, fill = Species)) + geom_bar(alpha = 0.8, position = "identity") +
facet_grid(treat ~ .) + xlim(0, 5000)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
As an alternative to histograms, ggplot
provides smoothed density estimates that can nicely capture the shape of a distribution.
ggplot(metabData, aes(x = value, fill = treat)) + geom_density(alpha = 0.8) +
facet_grid(Species ~ ., scales = "free_y") + xlim(0, 5000)
## Warning: Removed 259 rows containing non-finite values (stat_density).
## Warning: Removed 286 rows containing non-finite values (stat_density).
## Warning: Removed 356 rows containing non-finite values (stat_density).
## Warning: Removed 362 rows containing non-finite values (stat_density).
ggplot(metabData, aes(x = value, fill = Species)) + geom_density(alpha = 0.8) +
facet_grid(treat ~ ., scales = "free_y") + xlim(0, 5000)
## Warning: Removed 259 rows containing non-finite values (stat_density).
## Warning: Removed 356 rows containing non-finite values (stat_density).
## Warning: Removed 286 rows containing non-finite values (stat_density).
## Warning: Removed 362 rows containing non-finite values (stat_density).
We can also do the boxplot for all the metabolites, with a separate box for each species.
ggplot(metabData, aes(x = metabName, y = value, fill = Species)) + geom_boxplot()
This is not that easily to decipher, but it was quite easy to do.
Based on the analysis presented in the tutorial, we can pick out some of the top changers:
topChangers <- c(34, 62, 8, 17, 124, 100, 115, 9, 93, 123, 75, 71, 85, 33)
metabTop <- metabData[(metabData$variable %in% topChangers), ]
ggplot(metabTop, aes(x = metabName, y = value, fill = Species)) + geom_boxplot()
ggplot(metabTop, aes(x = metabName, y = value, fill = Species)) + geom_boxplot() +
scale_y_log10()
Finally, what is also quite nice about using ggplot2
is the ability to add incrementally to a graph. In our previous example, we could have achieved the same result piecemeal like this:
g <- ggplot(metabTop, aes(x = metabName, y = value, fill = Species)) + geom_boxplot()
print(g)
g + scale_y_log10()
Networks are very useful things in many fields, due to the fact that we can easily come up with ways to define relationships between objects. These objects can be anything, but in the context of metabolomics they are often genes
, proteins
, and metabolites
. The relationships between metabolites can be defined in a couple of different ways. This example we will define edges between metabolites by links in the KEGG
database, and by their chemical similarity (tanimoto coefficient). All data was generated by metamapR
.
The network graph will be constructed in R
, and visualized in Cytoscape
. This gives us the flexibility to manipulate the graph easily in R
and save groups of relevant nodes, with the visual richness and interactive flexibility of Cytoscape
.
This connection of R
with Cytoscape
uses the RCytoscape
Bioconductor
package, and requires Cytoscape
v2.8, and the CytoscapeRPC
plugin. There is currently no way to do this with Cytoscape
v3.0, however given that a REST
api is being introduced to Cytoscape
, it should be possible in the future (if we can persuade Paul Shannon to write a new R
package).
library(graph)
nodeEdges <- read.table("biochemEdgeList.csv", header = T, sep = ",", stringsAsFactors = FALSE)
# need character for node IDs
nodeEdges[, 1] <- as.character(nodeEdges[, 1])
nodeEdges[, 2] <- as.character(nodeEdges[, 2])
allNodes <- unique(c(nodeEdges$source, nodeEdges$target))
metabNetwork <- graphNEL(nodes = allNodes, edgemode = "directed")
metabNetwork <- addEdge(nodeEdges$source, nodeEdges$target, metabNetwork, nodeEdges$weight)
We also need the node and edge attributes, and to decide how they will be represented in the graph so that the types can be properly passed to Cytoscape.
nodeData <- read.table("node_attributes.csv", header = T, sep = ",", stringsAsFactors = FALSE)
nodeData$Index <- as.character(nodeData$Index)
nodeDataDefaults(metabNetwork, "name") <- ""
attr(nodeDataDefaults(metabNetwork, "name"), "class") <- "STRING"
nodeDataDefaults(metabNetwork, "direction") <- ""
attr(nodeDataDefaults(metabNetwork, "direction"), "class") <- "STRING"
nodeDataDefaults(metabNetwork, "sig.dir") <- ""
attr(nodeDataDefaults(metabNetwork, "sig.dir"), "class") <- "STRING"
nodeDataDefaults(metabNetwork, "size.FC") <- 0
attr(nodeDataDefaults(metabNetwork, "size.FC"), "class") <- "DOUBLE"
nodeDataDefaults(metabNetwork, "size.log.FC") <- 0
attr(nodeDataDefaults(metabNetwork, "size.log.FC"), "class") <- "DOUBLE"
nodeDataDefaults(metabNetwork, "size.LV") <- 0
attr(nodeDataDefaults(metabNetwork, "size.LV"), "class") <- "DOUBLE"
nodeDataDefaults(metabNetwork, "log.mean.Pumpkin") <- 0
attr(nodeDataDefaults(metabNetwork, "log.mean.Pumpkin"), "class") <- "DOUBLE"
nodeDataDefaults(metabNetwork, "log.mean.Tomatillo") <- 0
attr(nodeDataDefaults(metabNetwork, "log.mean.Tomatillo"), "class") <- "DOUBLE"
edgeDataDefaults(metabNetwork, "weight") <- 0
attr(edgeDataDefaults(metabNetwork, "weight"), "class") <- "DOUBLE"
edgeDataDefaults(metabNetwork, "source") <- ""
attr(edgeDataDefaults(metabNetwork, "source"), "class") <- "STRING"
edgeData(metabNetwork, nodeEdges[, 1], nodeEdges[, 2], "source") <- nodeEdges$type
useNodes <- nodeData$Index %in% nodes(metabNetwork)
for (useAttr in c("name", "direction", "sig.dir", "size.FC", "size.log.FC",
"size.LV", "log.mean.Pumpkin", "log.mean.Tomatillo")) {
nodeData(metabNetwork, nodeData$Index[useNodes], useAttr) <- nodeData[useNodes,
useAttr]
}
Now that we've added all the nitty gritty details to our network, we can pass it to Cytoscape and start doing some visualization.
library(RCytoscape)
## Loading required package: XMLRPC
## Note: the specification for S3 class "AsIs" in package 'XMLRPC' seems equivalent to one from package 'BiocGenerics': not turning on duplicate class definitions for this class.
cw <- new.CytoscapeWindow("metabolite similarity", graph = metabNetwork)
displayGraph(cw)
## [1] "name"
## [1] "direction"
## [1] "sig.dir"
## [1] "size.FC"
## [1] "size.log.FC"
## [1] "size.LV"
## [1] "log.mean.Pumpkin"
## [1] "log.mean.Tomatillo"
## [1] "label"
## [1] "weight"
## [1] "source"
Not much to look at yet, we have to tell Cytoscape
to layout the network.
setLayoutProperties(cw, "force-directed", list(edge_attribute = "weight"))
layoutNetwork(cw, "force-directed")
As it stands, we can't get a whole lot of information yet from this network. But lets start adding some visual properties to it and see what happens.
Lets change the edge visuals based on the source. Maybe black for KEGG (biochemical) and gray for Tanimoto (similarity)?
edgeColors <- c("#000000", "#999999")
setEdgeColorRule(cw, "source", c("KEGG", "Tanimoto"), edgeColors, mode = "lookup")
redraw(cw)
So now we can see that most of the edges are from the similarity measure.
We have metabolites that have changes either in the up or down direction. It would be nice to at a glance see which is which.
nodeColors <- c("#99FF00", "#FF9900")
setNodeColorRule(cw, "direction", c("increase", "decrease"), nodeColors, mode = "lookup")
redraw(cw)
Currently, we have to select a node to see it's metabolite name. We can use something else as the display label.
setNodeLabelRule(cw, "name")
## [1] TRUE
redraw(cw)
We can select nodes based on particular properties in our original data. Such as those nodes that had large fold-changes, and their immediate neighbors:
hiFCNodes <- nodeData$Index[abs(nodeData$size.log.FC) >= 1.2]
selectNodes(cw, hiFCNodes)
selectFirstNeighborsOfSelectedNodes(cw)
hiFCNodesNeighbors <- getSelectedNodes(cw)
hiFCNodesNeighborsData <- nodeData[nodeData$Index %in% hiFCNodesNeighbors, ]
We can also use other R
packages like igraph
to analyze our network, and select particular groups of nodes based on the computed properties. In this example we will analyze the network for communities of nodes:
library(igraph)
##
## Attaching package: 'igraph'
##
## The following objects are masked from 'package:graph':
##
## degree, edges
tmpNetwork <- metabNetwork
edgeDataDefaults(tmpNetwork, "source") <- NULL
edgeDataDefaults(tmpNetwork, "weight") <- NULL
nodeData(tmpNetwork, nodes(tmpNetwork), "name") <- nodes(tmpNetwork)
igNetwork <- igraph.from.graphNEL(tmpNetwork, name = TRUE)
igNetwork <- as.undirected(igNetwork)
eb <- edge.betweenness.community(igNetwork)
eb
## Graph community structure calculated with the edge betweenness algorithm
## Number of communities (best split): 14
## Modularity (best split): 0.655
## Membership vector:
## 6 11 8 30 18 25 13 29 48 94 96 9 97 98 56 95 33 108
## 1 2 3 4 4 5 6 6 6 6 4 4 4 7 7 6 4 7
## 88 116 117 34 22 46 131 134 14 49 115 2 3 1 16 15 35 32
## 6 6 6 7 8 6 6 6 7 9 7 7 7 8 7 10 11 10
## 23 45 47 51 31 59 60 63 61 66 50 71 64 72 69 77 57 70
## 12 12 12 6 7 4 4 4 7 7 12 12 4 12 7 7 6 6
## 78 79 73 83 90 92 99 100 101 102 85 103 107 38 44 89 109 113
## 7 6 4 7 4 7 7 7 7 7 6 7 7 10 10 10 7 7
## 7 110 76 122 104 75 124 123 125 105 42 127 132 114 130 137 19 121
## 1 1 8 1 4 12 1 8 4 6 1 1 1 10 4 10 13 7
## 133 65 126 146 142 12 37 43 118 129 135 136 139 140 143 40 52 87
## 1 14 12 12 10 2 5 1 6 8 6 9 7 6 1 11 6 9
## 120 138 144 145 147 148
## 6 4 14 7 12 4
What does this get us? A set of communities, with each node assigned to each community. We can examine them in Cytoscape
by doing:
allComms <- split(eb$names, eb$membership)
allComms <- allComms[order(sapply(allComms, length), decreasing = T)]
selectNodes(cw, allComms[[1]], preserve.current.selection = FALSE)
selectNodes(cw, allComms[[2]], preserve.current.selection = FALSE)
selectNodes(cw, allComms[[3]], preserve.current.selection = FALSE)
selectNodes(cw, allComms[[4]], preserve.current.selection = FALSE)
deleteWindow(cw)
## [1] TRUE
We often times want to save our visualization for later use in publications and reports. One important consideration is what format. PDF is frequently useful for hi-resolution lineart, but so is SVG (scalable vector graphics). If you have lots of points, SVG and PDF are not a good idea due to encoding each individual point. For those types of plots, you want to use bitmaps, such as PNG (preferred), TIFF (next up), or (shudder) JPG.
To save just about any plot on any computer system that runs R
, one can save graphics using the Cairo
package relatively painlessly:
library(Cairo)
CairoPNG(file = "testPNG.png")
plot(var1, var2)
dev.off()
To save our network visualizations generated by Cytoscape
, you should use the saveImage
function. Again, unless there are a lot of nodes, you should probably use SVG or PDF, otherwise use PNG. Note that saveImage
needs a full path on Linux.
saveImage(cw, file.name = file.path(getwd(), "testNetwork.svg"), image.type = "svg")
sessionInfo()
## R version 3.0.3 (2014-03-06)
## Platform: x86_64-unknown-linux-gnu (64-bit)
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Cairo_1.5-5 igraph_0.7.1 RCytoscape_1.12.0 XMLRPC_0.3-0
## [5] graph_1.40.1 reshape2_1.4 ggplot2_0.9.3.1 knitr_1.5
##
## loaded via a namespace (and not attached):
## [1] BiocGenerics_0.8.0 colorspace_1.2-4 digest_0.6.4
## [4] evaluate_0.5.3 formatR_0.10 grid_3.0.3
## [7] gtable_0.1.2 labeling_0.2 MASS_7.3-32
## [10] munsell_0.4.2 parallel_3.0.3 plyr_1.8.1
## [13] proto_0.3-10 Rcpp_0.11.1 RCurl_1.95-4.1
## [16] scales_0.2.4 stats4_3.0.3 stringr_0.6.2
## [19] tools_3.0.3 XML_3.98-1.1
purl("wg_visualization_ggplot_cytoscape_matplotlib.Rmd", quiet = TRUE)
## [1] "wg_visualization_ggplot_cytoscape_matplotlib.R"