Deciphering Life: One Bit at a Time - Creating Custom CDFs for Affymetrix Chips in Bioconductor

What?

For those who don’t know, CDF files are chip definition format files that define which probes on an Affymetrix microarray chip belong together, and are necessary to use any of the standard summarization methods such as RMA, and others.

Why?

Because we can, and because custom definitions have been shown to be quite useful. See the information over at Brainarray.

Why not somewhere else?

A lot of times other people create custom CDF files based on their own criteria, and make it subsequently available for others to use (see the Brainarray for an example of what some are doing, as well as PlandbAffy)

You have a really nifty idea for a way to reorganize the probesets on an Affymetrix chip to perform a custom analysis, but you don’t want to go to the trouble of actually creating the CDF files and Bioconductor packages normally required to do the analysis, and yet you want to test and develop your analysis method.

How?

It turns out you are in luck. At least for AffyBatch objects in Bioconductor (created by calling ReadAffy), the CDF information is stored as an attached environment that can be easily hacked and modified to your hearts content. Environments in R are quite important and useful, and I wouldn’t have come up with this if I hadn’t been working in R for the past couple of years, but figured someone else might benefit from this knowledge.

The environment

In R, one can access an environment like so:

get("objName", envName) # get the value of object in the environment
ls(envName)

What is also very cool, is that one can extract the objects in an environment to a list, and also create their own environment from a list using list2env. Using this methodology, we can create our own definition of probesets that can be used by standard Bioconductor routines to summarize the probes into probesets.

A couple of disclaimers:

I have only tried this on 3’ expression arrays
There might be a better way to do this, but I couldn’t find it (let me know in the comments)

Example

require(affy)

Loading required package: affy

Loading required package: BiocGenerics


Attaching package: 'BiocGenerics'

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min

Loading required package: Biobase

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

require(estrogen)

Loading required package: estrogen

require(hgu95av2cdf)

Loading required package: hgu95av2cdf

Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail' when
loading 'hgu95av2cdf'

Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head' when
loading 'hgu95av2cdf'

datadir = system.file("extdata", package="estrogen")

pd = read.AnnotatedDataFrame(file.path(datadir, "estrogen.txt"), header=TRUE, sep="", row.names=1)
pData(pd)

             estrogen time.h
low10-1.cel    absent     10
low10-2.cel    absent     10
high10-1.cel  present     10
high10-2.cel  present     10
low48-1.cel    absent     48
low48-2.cel    absent     48
high48-1.cel  present     48
high48-2.cel  present     48

celDat = ReadAffy(filenames = rownames(pData(pd)), 
                  phenoData = pd,
                  verbose=TRUE, celfile.path=datadir)

1 reading /home/rmflight/Projects/personal/researchblog_quarto/renv/library/R-4.2/x86_64-pc-linux-gnu/estrogen/extdata/low10-1.cel ...instantiating an AffyBatch (intensity a 409600x8 matrix)...done.
Reading in : /home/rmflight/Projects/personal/researchblog_quarto/renv/library/R-4.2/x86_64-pc-linux-gnu/estrogen/extdata/low10-1.cel
Reading in : /home/rmflight/Projects/personal/researchblog_quarto/renv/library/R-4.2/x86_64-pc-linux-gnu/estrogen/extdata/low10-2.cel
Reading in : /home/rmflight/Projects/personal/researchblog_quarto/renv/library/R-4.2/x86_64-pc-linux-gnu/estrogen/extdata/high10-1.cel
Reading in : /home/rmflight/Projects/personal/researchblog_quarto/renv/library/R-4.2/x86_64-pc-linux-gnu/estrogen/extdata/high10-2.cel
Reading in : /home/rmflight/Projects/personal/researchblog_quarto/renv/library/R-4.2/x86_64-pc-linux-gnu/estrogen/extdata/low48-1.cel
Reading in : /home/rmflight/Projects/personal/researchblog_quarto/renv/library/R-4.2/x86_64-pc-linux-gnu/estrogen/extdata/low48-2.cel
Reading in : /home/rmflight/Projects/personal/researchblog_quarto/renv/library/R-4.2/x86_64-pc-linux-gnu/estrogen/extdata/high48-1.cel
Reading in : /home/rmflight/Projects/personal/researchblog_quarto/renv/library/R-4.2/x86_64-pc-linux-gnu/estrogen/extdata/high48-2.cel

This loads up the data, reads in the raw data, and gets it ready for us to use. Now, lets see what is in the actual CDF environment.

topProbes <- head(ls(hgu95av2cdf)) # get a list of probesets
topProbes

[1] "100_g_at"  "1000_at"   "1001_at"   "1002_f_at" "1003_s_at" "1004_at"

exSet <- get(topProbes[1], hgu95av2cdf)
exSet

          pm     mm
 [1,] 175218 175858
 [2,] 356689 357329
 [3,] 227696 228336
 [4,] 237919 238559
 [5,] 275173 275813
 [6,] 203444 204084
 [7,] 357984 358624
 [8,] 368524 369164
 [9,] 285352 285992
[10,] 304510 305150
[11,] 159937 160577
[12,] 223929 224569
[13,] 282764 283404
[14,] 270003 270643
[15,] 303343 303983
[16,] 389048 389688

We can see here that the first probe set 100_g_at has 16 perfect-match and mis-match probes in associated with it.

Lets summarize the original data using RMA.

rma1 <- exprs(rma(celDat))

Background correcting
Normalizing
Calculating Expression

head(rma1)

          low10-1.cel low10-2.cel high10-1.cel high10-2.cel low48-1.cel
100_g_at     9.642896    9.741496     9.537036     9.353625    9.591697
1000_at     10.398169   10.254362    10.003971     9.903528   10.374866
1001_at      5.717613    5.881008     5.859563     5.954028    5.960540
1002_f_at    5.512596    5.801807     5.571065     5.608132    5.390064
1003_s_at    7.783927    8.007975     8.037999     7.835120    7.926487
1004_at      7.289162    7.603670     7.488539     7.771506    7.521789
          low48-2.cel high48-1.cel high48-2.cel
100_g_at     9.570590     9.475796     9.530655
1000_at     10.033520    10.345066     9.863321
1001_at      6.020889     5.981080     6.285192
1002_f_at    5.494511     5.508104     5.630107
1003_s_at    8.138870     7.994937     8.233338
1004_at      7.599544     7.456149     7.675171

Now lets get the data as a list, and then create a new environment to be used for summarization.

allSets <- ls(hgu95av2cdf)
allSetDat <- mget(allSets, hgu95av2cdf)

allSetDat[1]

$`100_g_at`
          pm     mm
 [1,] 175218 175858
 [2,] 356689 357329
 [3,] 227696 228336
 [4,] 237919 238559
 [5,] 275173 275813
 [6,] 203444 204084
 [7,] 357984 358624
 [8,] 368524 369164
 [9,] 285352 285992
[10,] 304510 305150
[11,] 159937 160577
[12,] 223929 224569
[13,] 282764 283404
[14,] 270003 270643
[15,] 303343 303983
[16,] 389048 389688

hgu2 <- list2env(allSetDat)
celDat@cdfName <- "hgu2"

rma2 <- exprs(rma(celDat))

Background correcting
Normalizing
Calculating Expression

head(rma2)

          low10-1.cel low10-2.cel high10-1.cel high10-2.cel low48-1.cel
100_g_at     9.642896    9.741496     9.537036     9.353625    9.591697
1000_at     10.398169   10.254362    10.003971     9.903528   10.374866
1001_at      5.717613    5.881008     5.859563     5.954028    5.960540
1002_f_at    5.512596    5.801807     5.571065     5.608132    5.390064
1003_s_at    7.783927    8.007975     8.037999     7.835120    7.926487
1004_at      7.289162    7.603670     7.488539     7.771506    7.521789
          low48-2.cel high48-1.cel high48-2.cel
100_g_at     9.570590     9.475796     9.530655
1000_at     10.033520    10.345066     9.863321
1001_at      6.020889     5.981080     6.285192
1002_f_at    5.494511     5.508104     5.630107
1003_s_at    8.138870     7.994937     8.233338
1004_at      7.599544     7.456149     7.675171

What about removing the MM columns? RMA only uses the PM, so it should still work.

allSetDat <- lapply(allSetDat, function(x){
  x[,1, drop=F]
})

allSetDat[1]

$`100_g_at`
          pm
 [1,] 175218
 [2,] 356689
 [3,] 227696
 [4,] 237919
 [5,] 275173
 [6,] 203444
 [7,] 357984
 [8,] 368524
 [9,] 285352
[10,] 304510
[11,] 159937
[12,] 223929
[13,] 282764
[14,] 270003
[15,] 303343
[16,] 389048

hgu3 <- list2env(allSetDat)
celDat@cdfName <- "hgu3"
rma3 <-exprs(rma(celDat))

Background correcting
Normalizing
Calculating Expression

head(rma3)

          low10-1.cel low10-2.cel high10-1.cel high10-2.cel low48-1.cel
100_g_at     9.642896    9.741496     9.537036     9.353625    9.591697
1000_at     10.398169   10.254362    10.003971     9.903528   10.374866
1001_at      5.717613    5.881008     5.859563     5.954028    5.960540
1002_f_at    5.512596    5.801807     5.571065     5.608132    5.390064
1003_s_at    7.783927    8.007975     8.037999     7.835120    7.926487
1004_at      7.289162    7.603670     7.488539     7.771506    7.521789
          low48-2.cel high48-1.cel high48-2.cel
100_g_at     9.570590     9.475796     9.530655
1000_at     10.033520    10.345066     9.863321
1001_at      6.020889     5.981080     6.285192
1002_f_at    5.494511     5.508104     5.630107
1003_s_at    8.138870     7.994937     8.233338
1004_at      7.599544     7.456149     7.675171

What if we only want to use the first 5 probesets?

allSetDat <- allSetDat[1:5]
hgu4 <- list2env(allSetDat)
celDat@cdfName <- "hgu4"
celDat

AffyBatch object
size of arrays=640x640 features (22 kb)
cdf=hgu4 (5 affyids)
number of samples=8
number of genes=5
annotation=hgu95av2
notes=

rma4 <- exprs(rma(celDat))

Background correcting
Normalizing
Calculating Expression

rma4

          low10-1.cel low10-2.cel high10-1.cel high10-2.cel low48-1.cel
100_g_at     9.463007    9.554665     9.449050     9.401976    9.447562
1000_at     10.182753   10.009785    10.009785     9.970396   10.102424
1001_at      5.943840    6.005177     5.944015     6.089531    6.237329
1002_f_at    5.787166    5.846225     5.816964     5.814798    5.763175
1003_s_at    7.750877    7.769401     7.913021     7.864052    7.860778
          low48-2.cel high48-1.cel high48-2.cel
100_g_at     9.457986     9.401366     9.431078
1000_at     10.009785    10.197065     9.889555
1001_at      6.147957     6.189200     6.206669
1002_f_at    5.763175     5.740757     5.755085
1003_s_at    7.917565     7.862614     7.928691

dim(rma4)

[1] 5 8

Custom CDF

To generate our custom CDF, we are going to set our own names, and take random probes from all of the probes on the chip. The actual criteria of which probes should be together can be made using any method the author chooses.

maxIndx <- 640*640

customCDF <- lapply(seq(1,100), function(x){
  tmp <- matrix(sample(maxIndx, 20), nrow=20, ncol=1)
  colnames(tmp) <- "pm"
  return(tmp)
})

names(customCDF) <- seq(1, 100)

hgu5 <- list2env(customCDF)
celDat@cdfName <- "hgu5"
rma5 <- exprs(rma(celDat))

Background correcting
Normalizing
Calculating Expression

head(rma5)

    low10-1.cel low10-2.cel high10-1.cel high10-2.cel low48-1.cel low48-2.cel
1      6.805156    6.993719     6.899264     6.932725    6.928347    6.957158
10     6.441884    6.319232     6.297822     6.198441    6.274462    6.419477
100    5.700887    5.712753     5.618964     5.717375    5.824081    5.741998
11     5.243884    5.204840     5.156431     5.149247    5.455638    5.249618
12     6.790776    6.660627     6.718511     6.576265    6.828149    7.009592
13     5.666141    5.752682     5.697687     5.696507    5.622639    5.778334
    high48-1.cel high48-2.cel
1       6.915275     6.836921
10      6.204771     5.928888
100     5.944784     5.899380
11      5.284089     5.298048
12      6.744098     6.714799
13      5.660630     5.666974

I hope this information is useful to someone else. I know it made my life a lot easier.

sessionInfo()

R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Pop!_OS 22.04 LTS

Matrix products: default
BLAS:   /rmflight_stuff/software/R-4.2.1/lib/libRblas.so
LAPACK: /rmflight_stuff/software/R-4.2.1/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] hgu95av2cdf_2.18.0  estrogen_1.42.0     affy_1.74.0        
[4] Biobase_2.56.0      BiocGenerics_0.42.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9             XVector_0.36.0         GenomeInfoDb_1.32.4   
 [4] compiler_4.2.1         BiocManager_1.30.19    bitops_1.0-7          
 [7] tools_4.2.1            zlibbioc_1.42.0        digest_0.6.29         
[10] bit_4.0.4              jsonlite_1.8.0         evaluate_0.16         
[13] RSQLite_2.2.19         memoise_2.0.1          preprocessCore_1.58.0 
[16] png_0.1-7              rlang_1.0.5            cli_3.4.0             
[19] DBI_1.1.3              rstudioapi_0.14        yaml_2.3.5            
[22] xfun_0.33              fastmap_1.1.0          GenomeInfoDbData_1.2.8
[25] httr_1.4.4             stringr_1.4.1          knitr_1.40            
[28] Biostrings_2.64.1      IRanges_2.30.1         S4Vectors_0.34.0      
[31] htmlwidgets_1.5.4      vctrs_0.4.1            stats4_4.2.1          
[34] bit64_4.0.5            R6_2.5.1               AnnotationDbi_1.58.0  
[37] rmarkdown_2.16         blob_1.2.3             magrittr_2.0.3        
[40] htmltools_0.5.3        KEGGREST_1.36.3        renv_0.15.5           
[43] stringi_1.7.8          RCurl_1.98-1.8         cachem_1.0.6          
[46] crayon_1.5.1           affyio_1.66.0

Originally published 2013/07/13, moved to http://rmflight.github.io on 2013/12/04.

Reuse

https://creativecommons.org/licenses/by/4.0/

Citation

BibTeX citation:

@online{mflight2012,
  author = {Robert M Flight},
  title = {Creating {Custom} {CDFs} for {Affymetrix} {Chips} in
    {Bioconductor}},
  date = {2012-07-13},
  url = {https://rmflight.github.io/posts/2012-07-13-creating-custom-cdf-s-for-affymetrix-chips-in-bioconductor},
  langid = {en}
}

For attribution, please cite this work as:

Robert M Flight. 2012. “Creating Custom CDFs for Affymetrix Chips in Bioconductor.” July 13, 2012. https://rmflight.github.io/posts/2012-07-13-creating-custom-cdf-s-for-affymetrix-chips-in-bioconductor.