Robert M Flight's home on the web
I can finally say that the publication on my
categoryCompare is finally published in the Bioinformatics and Computational Biology section of Frontiers in Genetics. This has been a long time coming, and I wanted to give some background on the inspiration and development of the method and software.
The software package has been in development in one form or another since 2010, released to
Bioconductor in summer 2012, and the publication has bounced around and been revised since spring of 2013, and it is finally available to you. All of the supplementary data and methods are available as an
R package on github. Version control using
git was instrumental in getting this work out in a timely manner. There is still a bunch of work to do on the package.
If I did it again, I would:
Rmarkdown, as a vignette in a package
In spring of 2010 I started as a PostDoc with Eric Rouchka. One of his collaborators, Jeff Petruska is interested in the process of collateral sprouting of neurons, especially as it compares to regeneration. Early in my PostDoc, Jeff wanted to do a gene-level comparison of his microarray data of collateral sprouting in skin compared to previously published studies with muscle.
Combing through the literature produced a number of genes differentially expressed in denervated muscle. However, when comparing with the genes resulting from the skin data, there was almost nothing in common and nothing that made sense from a functional standpoint. Skimming around the
Bioconductor literature in
GOStats, I was struck by Robert Gentleman's example of coloring Gene Ontology nodes by which data set they originated from. This is a very simple meta-analysis and visualization. When I tried it with the skin - muscle comparison, I got some very interesting (i.e. the Petruska group thought the results were very interpretable) results.
Note: At this point, because of the data sources (gene lists from publications with little to no original data), I was using the hypergeometric enrichment test in
Category to determine significant GO terms from the two tissues.
I started developing this idea into a simple package (i.e. collection of
R scripts), that was able to do at least GO term enrichment, and that could be hosted on our group webserver to enable others to make use of it. Visualization and interrogation used the
imageMaps function in Robert Gentleman's original demonstration, however any number of data sets could now be compared.
The basic method is to take gene (or really any annotated feature) lists from multiple experiments, and perform annotation (Gene Ontology Terms, KEGG Pathways, etc) enrichment (either hypergeometric or GSEA type) on each gene list, determine significant annotations from each list, and then examine which annotations come from which list.
Because this results in a lot of data to parse through, exploration of the results is facilitated by considering the annotations as a network of annotations related by the number of shared genes between them, and interacting with the networks in Cytoscape.
As I developed this idea, I also started looking into other possible software implementations.
Shen & Tseng (Shen & Tseng, 2010) had recently published similar work in MAPE-I, but at my level of understanding it required the same identifiers. In retrospect, I should have read the paper better. However, they also did not have a released implementation with their publication.
ConceptGen (Sartor et al., 2010) is another interesting application, that allows very similar analyses to
categoryCompare, except that it is not possible to explore how the concepts map between multiple user supplied data sets. The only way one can relate multiple data sets is if the data sets map as a concept to another, but one cannot visualize the interrelated concepts between user supplied data sets.
Enrichment Map (Merico et al., 2010) was another alternative, but did all of comparison mathematics in Cytoscape itself, based on enrichments calculated outside of Cytoscape. The publication does give an example of a similar type of analysis as performed by
categoryCompare. However, I wanted everything, from enrichment calculation to visualization controlled by
R. I did use their method of weighting the edges between annotations, however, ditching the GO directed acyclic graph (DAG) view I used initially.
About this time I realized that I wanted the package to fit into the
Bioconductor ecosystem. This required a complete redesign and rewrite of the code, as I moved to actually used some sort of OOP model (S4 in this case), and creating an actual package.
The package was released to the wild in the fall of 2012, as part of
Bioconductor v 2.10. Of course, this was the last
Bioconductor release based on R 2.15, which brought some particular challenges in
namespaces with the switch in R 3.0.0 the next year.
The original code used the
graphviz package to do layouts for visualization, but it was difficult to install, and did not work the way I wanted. Thankfully Paul Shannon had just developed the
RCytoscape package the year previous (Shannon et al., 2013), and this enabled truly interactive visualization and passing data back and forth between
My use of
RCytoscape has actually lead to finding improvements in the package, and better use of it. I am also hoping to make much more use of
RCytoscape and the
igraph package and combining them in novel ways.
Our first attempt at publication focussed on the original data that inspired the method, and comparing it with gene-gene comparisons directly. We submitted to BMC Bioinformatics, and the reviews were not favorable, and took forever to get back. We actually wondered if the reviewers had read the paper that we submitted. We gave up on this venue. BTW, our last three publications have been submitted there, and the publication process has gotten worse and worse with every submission there. I don't think our papers are getting worse over the years. I don't know if we will bother submitting to BMC Bioinformatics again.
The second attempt was submitting to the current home in the Bioinformatics and Computational Biology section of Frontiers in Genetics. This time, although the reviews were harsh (i.e. they were not immediately favorable), they were fair, and actually contained useful critique, and pointed to a way forward.
Unfortunately for me, revising the manuscript to address the reviewers criticisms meant a lot of work to construct theoretical examples, as well as a lot of thought in order to pare the manuscript down to make sure our primary message was clearly communicated.
I would like to note that the Frontiers interactive review system, with the ability to discuss individual points with the reviewers (still anonymously) really helped make it possible to determine which points were make or break, and discuss different ways to approach things. This was the best review experience I have ever had, and a large part of that I think was due to being able to interact with the reviewers directly, and not just through letters mediated by the editor.
I think it would be nice if Frontiers had an option for making the review history available if the authors and reviewers were agreeable to it.
In the initial publication, I had included the set of scripts used for analysis, data files, results, etc. However, the amount of work required in the rewrite was so substantial, that I created an
R package specifically for the analyses that went into the paper, with separate
R markdown vignettes for each result type (hypothetical, two different experimental comparisons). This package has documents on how raw data was processed, as well as the semi-processed experimental data.
Due to specific changes to the
categoryCompare software needed to address the reviewer comments, a publication specific branch of the development version was created (paper). This allowed me to quickly introduce code and features that the reviewers had asked for, without worrying about breaking the current development version that will be released in Bioconductor shortly.
As with most software projects, there is still plenty to be done.
Incorporate new functionality from
paper branch into
GSEAbuilt in (probably using
roastfunctions), and new visualization options
latex vignette to
R markdown (this is technically done, but hasn't made it into the dev branch)
roxygen2 documentation (will be interesting due to use of S4 objects and methods)
Implement proper testing using
Refactor a lot of code to improve speed, use
Rnewbie when I wrote the package, and as I looked at the code while making changes for the reviewers, I noticed some places where I had done some silly things, mainly because I didn't know better when I did it.
Consider splitting visualization into its own package
categoryCompare, I think it is generic enough that others might benefit from having it available separately. Therefore I need to think about separating it out, and how best to go about it without making it hard for
categoryCompareto keep working as it already does.
Make it easy to investigate the actual genes with particularly interesting annotations, and how they are linked together by annotations or other data sources, as well as their original expression levels in the experiments.
A nice side effect of the package is that any annotation enrichment I do now, I almost always do it in the framework of
categoryCompare, just because it is a lot easier to make sense of using the visualization capabilities and coupling with
Keeping in mind that I started this work almost 4 years ago, when I still didn't know any
R, and had yet to be exposed to the reproducibility and open science movements, or
pandoc, here are some things that if I started today I would do differently (you can probably guess some of these from above):
Put the package under version control immediately! Thankfully I didn't have any moments early in the process when things were not in a
git repo, but I am very thankful later on I was able to do
branch on my code to figure out where things broke and introduce new features.
Start thinking of the analysis as a standalone package from the very beginning, instead of a directory of data and scripts. This is what I do now (blogpost to come), and it makes it much easier if I come up with novel methods to spin them off into a fully fledged
Don't underestimate the novelty of something as simple as visualization, and how much it may make or break your method. We ended up adding a good chunk of text to the manuscript on the visualization because we realized how important it was, but only after the reviewers pointed it out to us.
Write the paper as a vignette of the ccPaper package itself, and generate Word documents for collaborators who insist on Word docs using Pandoc
Start a github repo for the paper, and ask collaborators to try and work on it there
Submit a preprint when submitted, so that we start getting feedback on the manuscript early
Ask for permission to set up reviewer comments as issues on the github repo to easily track how well we are addressing them.