Correlation of PARP1 with Transcript Abundance

We will analyze the correlation of nucleosome associated PARP1 reads with transcript abundance as measured using Affymetrix Gene Chip HGU 133 Plus 2.0. The gene chip data is from GEO GSM307014. The PARP1 reads and expression data has already been processed and is made available as part of the fmdatabreastcaparp1 package.

library(GenomicRanges)
library(ggplot2)
library(hgu133plus2.db)
## Loading required package: AnnotationDbi
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## 
## Attaching package: 'AnnotationDbi'
## 
## The following object is masked from 'package:BSgenome':
## 
##     species
## 
## Loading required package: org.Hs.eg.db
## Loading required package: DBI
library(fmcorrelationbreastcaparp1)
library(fmdatabreastcaparp1)

Load the data.

data(tss_windows)
data(parp1_ln4_unique)
data(parp1_ln5_unique)
data(expr_data)

Calculate the weighted coverage of the ln4 and ln5 sample reads, and then sum the reads in each TSS window.

mcf7_cov <- coverage(parp1_ln4_unique, weight = "n_count")
mdamb231_cov <- coverage(parp1_ln5_unique, weight = "n_count")

tss_windows <- binned_function(tss_windows, mcf7_cov, "sum", "mcf7_read")
tss_windows <- binned_function(tss_windows, mdamb231_cov, "sum", "mdamb231_read")

Now, we need to translate the Affymetrix ID's to the Ensembl ID's that define the transcription start sites.

affy_2_ensembl <- select(hgu133plus2.db, keys = rownames(expr_data), columns = "ENSEMBLTRANS")
## Warning in .generateExtraRows(tab, keys, jointype): 'select' resulted in
## 1:many mapping between keys and return rows
expr_data$ENSEMBL <- ""
expr_data <- expr_data[affy_2_ensembl[,1],]
expr_data$ENSEMBL <- affy_2_ensembl[,2]

Because we have many cases where there are multiple transcripts per Affy ID, and multiple Affy IDs for some transcripts, we need to go through each of the Ensembl transcripts and do averaging if necessary.

Get the average value by transcript.

mean_transcript <- tapply(expr_data$VALUE, expr_data$ENSEMBL, mean)

And put it onto the tss_windows object.

keep_transcript <- intersect(names(mean_transcript), names(tss_windows))
tss_windows <- tss_windows[keep_transcript]
mean_transcript <- mean_transcript[keep_transcript]
mcols(tss_windows)$mcf7_expr <- mean_transcript

Now do a comparison.

non_zero <- "both"
expr_v_mcf7 <- subsample_nonzeros(mcols(tss_windows), c("mcf7_expr", "mcf7_read"), non_zero = non_zero, n_points = 10000)
ggplot(expr_v_mcf7, aes(x = mcf7_expr, y = mcf7_read)) + geom_point() + scale_y_log10() + scale_x_log10()

plot of chunk graphit

cor(log10(expr_v_mcf7[,1]+1), log10(expr_v_mcf7[,2]+1))
## [1] 0.09127957

Calculate the correlation value using all of the points:

out_cor <- correlate_non_zero(mcols(tss_windows), c("mcf7_expr", "mcf7_read"), log_transform = TRUE, non_zero = non_zero, test = TRUE)
out_cor
## corr_value    p_value 
##  0.1016187  0.0000000

We will save the correlation results in some plain text files.

saveloc <- "../inst/correlation_tables"
cat(out_cor, file = file.path(saveloc, "expression_tss.txt"))

Session Info

## [1] "2014-12-11 11:05:00 EST"
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] hgu133plus2.db_2.14.0               
##  [2] org.Hs.eg.db_2.14.0                 
##  [3] RSQLite_0.11.4                      
##  [4] DBI_0.3.1                           
##  [5] AnnotationDbi_1.26.1                
##  [6] Biobase_2.24.0                      
##  [7] fmdatabreastcaparp1_0.0.1           
##  [8] fmcorrelationbreastcaparp1_0.0.1    
##  [9] pracma_1.7.9                        
## [10] BSgenome.Hsapiens.UCSC.hg19_1.3.1000
## [11] BSgenome_1.32.0                     
## [12] Biostrings_2.32.1                   
## [13] XVector_0.4.0                       
## [14] ggplot2_1.0.0                       
## [15] GenomicRanges_1.16.4                
## [16] GenomeInfoDb_1.0.2                  
## [17] IRanges_1.22.10                     
## [18] BiocGenerics_0.10.0                 
## [19] devtools_1.6.1                      
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6     colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5  
##  [5] formatR_1.0      grid_3.1.1       gtable_0.1.2     knitr_1.7       
##  [9] labeling_0.3     magrittr_1.0.1   markdown_0.7.4   MASS_7.3-35     
## [13] mime_0.2         munsell_0.4.2    plyr_1.8.1       proto_0.3-10    
## [17] Rcpp_0.11.3      reshape2_1.4     Rsamtools_1.16.1 scales_0.2.4    
## [21] stats4_3.1.1     stringr_0.6.2    tools_3.1.1      zlibbioc_1.10.0