categoryCompare Paper Finally Out!

2014-04-08 10 min read

I can finally say that the publication on my Bioconductor package categoryCompare is finally published in the Bioinformatics and Computational Biology section of Frontiers in Genetics. This has been a long time coming, and I wanted to give some background on the inspiration and development of the method and software.

TL;DR

The software package has been in development in one form or another since 2010, released to Bioconductor in summer 2012, and the publication has bounced around and been revised since spring of 2013, and it is finally available to you. All of the supplementary data and methods are available as an R package on github. Version control using git was instrumental in getting this work out in a timely manner. There is still a bunch of work to do on the package.

If I did it again, I would: * Write the manuscript using R markdown, as a vignette in a package * Ask to be able to make reviewer points issues on Github * Submit a preprint with submission

Inspiration

In spring of 2010 I started as a PostDoc with Eric Rouchka. One of his collaborators, Jeff Petruska is interested in the process of collateral sprouting of neurons, especially as it compares to regeneration. Early in my PostDoc, Jeff wanted to do a gene-level comparison of his microarray data of collateral sprouting in skin compared to previously published studies with muscle.

Combing through the literature produced a number of genes differentially expressed in denervated muscle. However, when comparing with the genes resulting from the skin data, there was almost nothing in common and nothing that made sense from a functional standpoint. Skimming around the Bioconductor literature in GOStats, I was struck by Robert Gentleman’s example of coloring Gene Ontology nodes by which data set they originated from. This is a very simple meta-analysis and visualization. When I tried it with the skin - muscle comparison, I got some very interesting (i.e. the Petruska group thought the results were very interpretable) results.

Note: At this point, because of the data sources (gene lists from publications with little to no original data), I was using the hypergeometric enrichment test in Category to determine significant GO terms from the two tissues.

V 0.0000001

I started developing this idea into a simple package (i.e. collection of R scripts), that was able to do at least GO term enrichment, and that could be hosted on our group webserver to enable others to make use of it. Visualization and interrogation used the imageMaps function in Robert Gentleman’s original demonstration, however any number of data sets could now be compared.

categoryCompare Method – Summary

The basic method is to take gene (or really any annotated feature) lists from multiple experiments, and perform annotation (Gene Ontology Terms, KEGG Pathways, etc) enrichment (either hypergeometric or GSEA type) on each gene list, determine significant annotations from each list, and then examine which annotations come from which list.

Because this results in a lot of data to parse through, exploration of the results is facilitated by considering the annotations as a network of annotations related by the number of shared genes between them, and interacting with the networks in Cytoscape.

If you want to know more, check out the paper, or the vignette in the Bioconductor package

Others

As I developed this idea, I also started looking into other possible software implementations.

Shen & Tseng (Shen & Tseng, 2010) had recently published similar work in MAPE-I, but at my level of understanding it required the same identifiers. In retrospect, I should have read the paper better. However, they also did not have a released implementation with their publication.
ConceptGen (Sartor et al., 2010) is another interesting application, that allows very similar analyses to categoryCompare, except that it is not possible to explore how the concepts map between multiple user supplied data sets. The only way one can relate multiple data sets is if the data sets map as a concept to another, but one cannot visualize the interrelated concepts between user supplied data sets.
Enrichment Map (Merico et al., 2010) was another alternative, but did all of comparison mathematics in Cytoscape itself, based on enrichments calculated outside of Cytoscape. The publication does give an example of a similar type of analysis as performed by categoryCompare. However, I wanted everything, from enrichment calculation to visualization controlled by R. I did use their method of weighting the edges between annotations, however, ditching the GO directed acyclic graph (DAG) view I used initially.

Bioconductor Package

About this time I realized that I wanted the package to fit into the Bioconductor ecosystem. This required a complete redesign and rewrite of the code, as I moved to actually used some sort of OOP model (S4 in this case), and creating an actual package.

The package was released to the wild in the fall of 2012, as part of Bioconductor v 2.10. Of course, this was the last Bioconductor release based on R 2.15, which brought some particular challenges in namespaces with the switch in R 3.0.0 the next year.

Graphviz to RCytoscape

The original code used the graphviz package to do layouts for visualization, but it was difficult to install, and did not work the way I wanted. Thankfully Paul Shannon had just developed the RCytoscape package the year previous (Shannon et al., 2013), and this enabled truly interactive visualization and passing data back and forth between R and Cytoscape.

My use of RCytoscape has actually lead to finding improvements in the package, and better use of it. I am also hoping to make much more use of RCytoscape and the igraph package and combining them in novel ways.

First Publication Attempt

Our first attempt at publication focussed on the original data that inspired the method, and comparing it with gene-gene comparisons directly. We submitted to BMC Bioinformatics, and the reviews were not favorable, and took forever to get back. We actually wondered if the reviewers had read the paper that we submitted. We gave up on this venue. BTW, our last three publications have been submitted there, and the publication process has gotten worse and worse with every submission there. I don’t think our papers are getting worse over the years. I don’t know if we will bother submitting to BMC Bioinformatics again.

Second Publication Attempt

The second attempt was submitting to the current home in the Bioinformatics and Computational Biology section of Frontiers in Genetics. This time, although the reviews were harsh (i.e. they were not immediately favorable), they were fair, and actually contained useful critique, and pointed to a way forward.

Unfortunately for me, revising the manuscript to address the reviewers criticisms meant a lot of work to construct theoretical examples, as well as a lot of thought in order to pare the manuscript down to make sure our primary message was clearly communicated.

Frontiers Interactive Review

I would like to note that the Frontiers interactive review system, with the ability to discuss individual points with the reviewers (still anonymously) really helped make it possible to determine which points were make or break, and discuss different ways to approach things. This was the best review experience I have ever had, and a large part of that I think was due to being able to interact with the reviewers directly, and not just through letters mediated by the editor.

I think it would be nice if Frontiers had an option for making the review history available if the authors and reviewers were agreeable to it.

Github package of supplementary materials

In the initial publication, I had included the set of scripts used for analysis, data files, results, etc. However, the amount of work required in the rewrite was so substantial, that I created an R package specifically for the analyses that went into the paper, with separate R markdown vignettes for each result type (hypothetical, two different experimental comparisons). This package has documents on how raw data was processed, as well as the semi-processed experimental data.

Publication specific branch

Due to specific changes to the categoryCompare software needed to address the reviewer comments, a publication specific branch of the development version was created (paper). This allowed me to quickly introduce code and features that the reviewers had asked for, without worrying about breaking the current development version that will be released in Bioconductor shortly.

To Do

As with most software projects, there is still plenty to be done.

Incorporate new functionality from paper branch into dev
- Specifically ability to do GSEA built in (probably using limma’s romer and roast functions), and new visualization options
Change from latex vignette to R markdown (this is technically done, but hasn’t made it into the dev branch)
Switch to roxygen2 documentation (will be interesting due to use of S4 objects and methods)
Implement proper testing using testthat
Refactor a lot of code to improve speed, use R conventions
- I was still a relatively R newbie when I wrote the package, and as I looked at the code while making changes for the reviewers, I noticed some places where I had done some silly things, mainly because I didn’t know better when I did it.
Consider splitting visualization into its own package
- Although I developed the visualization specifically for categoryCompare, I think it is generic enough that others might benefit from having it available separately. Therefore I need to think about separating it out, and how best to go about it without making it hard for categoryCompare to keep working as it already does.
Make it easy to investigate the actual genes with particularly interesting annotations, and how they are linked together by annotations or other data sources, as well as their original expression levels in the experiments.

Side Effects

A nice side effect of the package is that any annotation enrichment I do now, I almost always do it in the framework of categoryCompare, just because it is a lot easier to make sense of using the visualization capabilities and coupling with Cytoscape.

What Would I Do Differently?

Keeping in mind that I started this work almost 4 years ago, when I still didn’t know any R, and had yet to be exposed to the reproducibility and open science movements, or knitr, or pandoc, here are some things that if I started today I would do differently (you can probably guess some of these from above):

Put the package under version control immediately! Thankfully I didn’t have any moments early in the process when things were not in a git repo, but I am very thankful later on I was able to do diff and branch on my code to figure out where things broke and introduce new features.
Start thinking of the analysis as a standalone package from the very beginning, instead of a directory of data and scripts. This is what I do now (blogpost to come), and it makes it much easier if I come up with novel methods to spin them off into a fully fledged R / Bioconductor package
Don’t underestimate the novelty of something as simple as visualization, and how much it may make or break your method. We ended up adding a good chunk of text to the manuscript on the visualization because we realized how important it was, but only after the reviewers pointed it out to us.
Write the paper as a vignette of the ccPaper package itself, and generate Word documents for collaborators who insist on Word docs using Pandoc
Start a github repo for the paper, and ask collaborators to try and work on it there
Submit a preprint when submitted, so that we start getting feedback on the manuscript early
Ask for permission to set up reviewer comments as issues on the github repo to easily track how well we are addressing them.
- Wouldn’t it be cool in a totally open peer review journal to actually do all of the peer review on a service like Github, and have reviewers leave issues, tag them, and comment directly on the text of the publication using the commenting feature of commenting on commits?

R bioconductor meta-analysis publications