Creating Custom CDFs for Affymetrix Chips in Bioconductor

R bioconductor bioinformatics cdf affymetrix microarray random-code-snippets

Examples of messing with Affymetrix CDF data in Bioconductor.

Robert M Flight
2012-07-13

What?

For those who don’t know, CDF files are chip definition format files that define which probes on an Affymetrix microarray chip belong together, and are necessary to use any of the standard summarization methods such as RMA, and others.

Why?

Because we can, and because custom definitions have been shown to be quite useful. See the information over at Brainarray.

Why not somewhere else?

A lot of times other people create custom CDF files based on their own criteria, and make it subsequently available for others to use (see the Brainarray for an example of what some are doing, as well as PlandbAffy)

You have a really nifty idea for a way to reorganize the probesets on an Affymetrix chip to perform a custom analysis, but you don’t want to go to the trouble of actually creating the CDF files and Bioconductor packages normally required to do the analysis, and yet you want to test and develop your analysis method.

How?

It turns out you are in luck. At least for AffyBatch objects in Bioconductor (created by calling ReadAffy), the CDF information is stored as an attached environment that can be easily hacked and modified to your hearts content. Environments in R are quite important and useful, and I wouldn’t have come up with this if I hadn’t been working in R for the past couple of years, but figured someone else might benefit from this knowledge.

The environment

In R, one can access an environment like so:

get("objName", envName) # get the value of object in the environment
ls(envName)

What is also very cool, is that one can extract the objects in an environment to a list, and also create their own environment from a list using list2env. Using this methodology, we can create our own definition of probesets that can be used by standard Bioconductor routines to summarize the probes into probesets.

A couple of disclaimers:

Example

require(affy)
require(estrogen)
require(hgu95av2cdf)

datadir = system.file("extdata", package="estrogen")

pd = read.AnnotatedDataFrame(file.path(datadir, "estrogen.txt"), header=TRUE, sep="", row.names=1)
pData(pd)
             estrogen time.h
low10-1.cel    absent     10
low10-2.cel    absent     10
high10-1.cel  present     10
high10-2.cel  present     10
low48-1.cel    absent     48
low48-2.cel    absent     48
high48-1.cel  present     48
high48-2.cel  present     48
celDat = ReadAffy(filenames = rownames(pData(pd)), 
                  phenoData = pd,
                  verbose=TRUE, celfile.path=datadir)
1 reading /software/R_libs/R400/estrogen/extdata/low10-1.cel ...instantiating an AffyBatch (intensity a 409600x8 matrix)...done.
Reading in : /software/R_libs/R400/estrogen/extdata/low10-1.cel
Reading in : /software/R_libs/R400/estrogen/extdata/low10-2.cel
Reading in : /software/R_libs/R400/estrogen/extdata/high10-1.cel
Reading in : /software/R_libs/R400/estrogen/extdata/high10-2.cel
Reading in : /software/R_libs/R400/estrogen/extdata/low48-1.cel
Reading in : /software/R_libs/R400/estrogen/extdata/low48-2.cel
Reading in : /software/R_libs/R400/estrogen/extdata/high48-1.cel
Reading in : /software/R_libs/R400/estrogen/extdata/high48-2.cel

This loads up the data, reads in the raw data, and gets it ready for us to use. Now, lets see what is in the actual CDF environment.

topProbes <- head(ls(hgu95av2cdf)) # get a list of probesets
topProbes
[1] "100_g_at"  "1000_at"   "1001_at"   "1002_f_at" "1003_s_at"
[6] "1004_at"  
exSet <- get(topProbes[1], hgu95av2cdf)
exSet
          pm     mm
 [1,] 175218 175858
 [2,] 356689 357329
 [3,] 227696 228336
 [4,] 237919 238559
 [5,] 275173 275813
 [6,] 203444 204084
 [7,] 357984 358624
 [8,] 368524 369164
 [9,] 285352 285992
[10,] 304510 305150
[11,] 159937 160577
[12,] 223929 224569
[13,] 282764 283404
[14,] 270003 270643
[15,] 303343 303983
[16,] 389048 389688

We can see here that the first probe set 100_g_at has 16 perfect-match and mis-match probes in associated with it.

Lets summarize the original data using RMA.

rma1 <- exprs(rma(celDat))
Background correcting
Normalizing
Calculating Expression
head(rma1)
          low10-1.cel low10-2.cel high10-1.cel high10-2.cel
100_g_at     9.642896    9.741496     9.537036     9.353625
1000_at     10.398169   10.254362    10.003971     9.903528
1001_at      5.717613    5.881008     5.859563     5.954028
1002_f_at    5.512596    5.801807     5.571065     5.608132
1003_s_at    7.783927    8.007975     8.037999     7.835120
1004_at      7.289162    7.603670     7.488539     7.771506
          low48-1.cel low48-2.cel high48-1.cel high48-2.cel
100_g_at     9.591697    9.570590     9.475796     9.530655
1000_at     10.374866   10.033520    10.345066     9.863321
1001_at      5.960540    6.020889     5.981080     6.285192
1002_f_at    5.390064    5.494511     5.508104     5.630107
1003_s_at    7.926487    8.138870     7.994937     8.233338
1004_at      7.521789    7.599544     7.456149     7.675171

Now lets get the data as a list, and then create a new environment to be used for summarization.

allSets <- ls(hgu95av2cdf)
allSetDat <- mget(allSets, hgu95av2cdf)

allSetDat[1]
$`100_g_at`
          pm     mm
 [1,] 175218 175858
 [2,] 356689 357329
 [3,] 227696 228336
 [4,] 237919 238559
 [5,] 275173 275813
 [6,] 203444 204084
 [7,] 357984 358624
 [8,] 368524 369164
 [9,] 285352 285992
[10,] 304510 305150
[11,] 159937 160577
[12,] 223929 224569
[13,] 282764 283404
[14,] 270003 270643
[15,] 303343 303983
[16,] 389048 389688
hgu2 <- list2env(allSetDat)
celDat@cdfName <- "hgu2"

rma2 <- exprs(rma(celDat))
Background correcting
Normalizing
Calculating Expression
head(rma2)
          low10-1.cel low10-2.cel high10-1.cel high10-2.cel
100_g_at     9.642896    9.741496     9.537036     9.353625
1000_at     10.398169   10.254362    10.003971     9.903528
1001_at      5.717613    5.881008     5.859563     5.954028
1002_f_at    5.512596    5.801807     5.571065     5.608132
1003_s_at    7.783927    8.007975     8.037999     7.835120
1004_at      7.289162    7.603670     7.488539     7.771506
          low48-1.cel low48-2.cel high48-1.cel high48-2.cel
100_g_at     9.591697    9.570590     9.475796     9.530655
1000_at     10.374866   10.033520    10.345066     9.863321
1001_at      5.960540    6.020889     5.981080     6.285192
1002_f_at    5.390064    5.494511     5.508104     5.630107
1003_s_at    7.926487    8.138870     7.994937     8.233338
1004_at      7.521789    7.599544     7.456149     7.675171

What about removing the MM columns? RMA only uses the PM, so it should still work.

allSetDat <- lapply(allSetDat, function(x){
  x[,1, drop=F]
})

allSetDat[1]
$`100_g_at`
          pm
 [1,] 175218
 [2,] 356689
 [3,] 227696
 [4,] 237919
 [5,] 275173
 [6,] 203444
 [7,] 357984
 [8,] 368524
 [9,] 285352
[10,] 304510
[11,] 159937
[12,] 223929
[13,] 282764
[14,] 270003
[15,] 303343
[16,] 389048
hgu3 <- list2env(allSetDat)
celDat@cdfName <- "hgu3"
rma3 <-exprs(rma(celDat))
Background correcting
Normalizing
Calculating Expression
head(rma3)
          low10-1.cel low10-2.cel high10-1.cel high10-2.cel
100_g_at     9.642896    9.741496     9.537036     9.353625
1000_at     10.398169   10.254362    10.003971     9.903528
1001_at      5.717613    5.881008     5.859563     5.954028
1002_f_at    5.512596    5.801807     5.571065     5.608132
1003_s_at    7.783927    8.007975     8.037999     7.835120
1004_at      7.289162    7.603670     7.488539     7.771506
          low48-1.cel low48-2.cel high48-1.cel high48-2.cel
100_g_at     9.591697    9.570590     9.475796     9.530655
1000_at     10.374866   10.033520    10.345066     9.863321
1001_at      5.960540    6.020889     5.981080     6.285192
1002_f_at    5.390064    5.494511     5.508104     5.630107
1003_s_at    7.926487    8.138870     7.994937     8.233338
1004_at      7.521789    7.599544     7.456149     7.675171

What if we only want to use the first 5 probesets?

allSetDat <- allSetDat[1:5]
hgu4 <- list2env(allSetDat)
celDat@cdfName <- "hgu4"
celDat
AffyBatch object
size of arrays=640x640 features (21 kb)
cdf=hgu4 (5 affyids)
number of samples=8
number of genes=5
annotation=hgu95av2
notes=
rma4 <- exprs(rma(celDat))
Background correcting
Normalizing
Calculating Expression
rma4
          low10-1.cel low10-2.cel high10-1.cel high10-2.cel
100_g_at     9.463007    9.554665     9.449050     9.401976
1000_at     10.182753   10.009785    10.009785     9.970396
1001_at      5.943840    6.005177     5.944015     6.089531
1002_f_at    5.787166    5.846225     5.816964     5.814798
1003_s_at    7.750877    7.769401     7.913021     7.864052
          low48-1.cel low48-2.cel high48-1.cel high48-2.cel
100_g_at     9.447562    9.457986     9.401366     9.431078
1000_at     10.102424   10.009785    10.197065     9.889555
1001_at      6.237329    6.147957     6.189200     6.206669
1002_f_at    5.763175    5.763175     5.740757     5.755085
1003_s_at    7.860778    7.917565     7.862614     7.928691
dim(rma4)
[1] 5 8

Custom CDF

To generate our custom CDF, we are going to set our own names, and take random probes from all of the probes on the chip. The actual criteria of which probes should be together can be made using any method the author chooses.

maxIndx <- 640*640

customCDF <- lapply(seq(1,100), function(x){
  tmp <- matrix(sample(maxIndx, 20), nrow=20, ncol=1)
  colnames(tmp) <- "pm"
  return(tmp)
})

names(customCDF) <- seq(1, 100)

hgu5 <- list2env(customCDF)
celDat@cdfName <- "hgu5"
rma5 <- exprs(rma(celDat))
Background correcting
Normalizing
Calculating Expression
head(rma5)
    low10-1.cel low10-2.cel high10-1.cel high10-2.cel low48-1.cel
1      6.557239    6.522249     6.639682     6.683450    6.862754
10     7.145256    7.166927     7.124289     7.227240    7.055032
100    6.494146    6.388879     6.349202     6.605702    6.499393
11     6.355700    6.431657     6.108054     6.213595    6.119581
12     5.602336    5.523976     5.370915     5.692428    5.512562
13     6.320948    6.211828     6.338067     6.178676    6.478441
    low48-2.cel high48-1.cel high48-2.cel
1      7.084571     6.742868     7.289879
10     6.908141     7.164122     7.279369
100    6.455822     6.475354     6.514070
11     6.455912     6.136583     6.310523
12     5.590132     5.515823     5.544244
13     6.280520     6.116337     6.348493

I hope this information is useful to someone else. I know it made my life a lot easier.

R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Pop!_OS 20.04 LTS

Matrix products: default
BLAS:   /software/R-4.0.0/lib/R/lib/libRblas.so
LAPACK: /software/R-4.0.0/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets 
[7] methods   base     

other attached packages:
[1] hgu95av2cdf_2.18.0  estrogen_1.34.0     affy_1.66.0        
[4] Biobase_2.48.0      BiocGenerics_0.34.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6            bslib_0.2.4           compiler_4.0.0       
 [4] BiocManager_1.30.10   jquerylib_0.1.3       tools_4.0.0          
 [7] zlibbioc_1.34.0       bit_4.0.4             digest_0.6.27        
[10] downlit_0.2.1         memoise_2.0.0         jsonlite_1.7.2       
[13] evaluate_0.14         RSQLite_2.2.3         preprocessCore_1.50.0
[16] rlang_0.4.10          DBI_1.1.1             distill_1.2          
[19] yaml_2.2.1            xfun_0.21             fastmap_1.1.0        
[22] stringr_1.4.0         knitr_1.31            IRanges_2.22.2       
[25] S4Vectors_0.26.1      vctrs_0.3.6           sass_0.3.1           
[28] stats4_4.0.0          bit64_4.0.5           R6_2.5.0             
[31] fansi_0.4.2           AnnotationDbi_1.50.3  rmarkdown_2.7        
[34] blob_1.2.1            magrittr_2.0.1        htmltools_0.5.1.1    
[37] stringi_1.5.3         cachem_1.0.4          affyio_1.58.0        

Originally published 2013/07/13, moved to http://rmflight.github.io on 2013/12/04.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/rmflight/researchBlog_distill, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Flight (2012, July 13). Deciphering Life: One Bit at a Time: Creating Custom CDFs for Affymetrix Chips in Bioconductor. Retrieved from https://rmflight.github.io/posts/2012-07-13-creating-custom-cdf-s-for-affymetrix-chips-in-bioconductor/

BibTeX citation

@misc{flight2012creating,
  author = {Flight, Robert M},
  title = {Deciphering Life: One Bit at a Time: Creating Custom CDFs for Affymetrix Chips in Bioconductor},
  url = {https://rmflight.github.io/posts/2012-07-13-creating-custom-cdf-s-for-affymetrix-chips-in-bioconductor/},
  year = {2012}
}