Affymetrix® Mismatch (MM) Probes: Useful After All

Robert M. Flight1, Abdallah M Eteleeb2 and Eric C Rouchka2

1 Department of Anatomical Sciences and Neurobiology, University of Louisville, Louisville, Kentucky, USA 40292

2 Department of Computer Engineering and Computer Science, University of Louisville, Louisville, Kentucky, USA 40292

Email Contacts

robert dot flight near louisville dot edu

ametel01 near louisville dot edu

eric dot rouchka near louisville dot edu

ABSTRACT

Affymetrix® GeneChip® microarray design define probe sets consisting of 11, 16, or 20 distinct 25 base pair (BP) probes for determining mRNA expression for a specific gene, which may be covered by one or more probe sets. Each probe has a corresponding perfect match (PM) and mismatch (MM) set. Traditional analytical techniques have either used the MM probes to determine the level of cross-hybridization or reliability of the PM probe, or have been completely ignored. Given the availability of reference genome sequences, we have reanalyzed the mapping of both PM and MM probes to reference genomes in transcript regions. Our results suggest that depending of the species of interest, 66%-93% of the PM probes can be used reliably in terms of single unique matches to the genome, while a small number of the MM probes (typically less than 1%) could be incorporated into the analysis. In addition, we have examined the mapping of PM and MM probes to five different human genome projects, resulting in approximately a 70% overlap of uniquely mapping PM probes, and a subset of 161 uniquely mapping MM probes commonly found in all five projects, 24 of which are found within annotated exonic regions. These results suggest that individual variation in transcriptome regions provides an additional complexity to microarray data analysis. Given these results, we conclude that the development of custom chip definition files (CDFs) should include MM probe sequences to provide the most effective means of transcriptome analysis of Affymetrix® GeneChip® arrays.

Categories and Subject Descriptors: J.3 [Life and Medical Sciences]: Biology and Genetics

General Terms: Algorithms, Measurement, Theory

Keywords: Bioinformatics, microarray, probe set, custom definition files.

INTRODUCTION

Oligonucleotide-based microarray technologies provide a methodology whereby a researcher can indirectly measure the expression level of an mRNA molecule being actively transcribed under a set of conditions by labeling a cDNA fragment that hybridizes to a complementary probe sequence specific to a particular transcript. Since their first use on customized cDNA arrays [1] in the mid-1990s, they have been used as the de-facto standard for measuring global transcriptional changes under differing conditions. While RNA-Seq [2] may eventually supplant microarrays as the method of choice, a large number of microarray experiments exist that have been deposited into publicly available repositories such as NCBI’s Gene Expression Omnibus (GEO) [3] and EBI’s ArrayExpress [4]. As a case in point, GEO contains 32,471 series as of 9/6/2012. The majority of the entries in GEO were performed on arrays designed by companies such as Affymetrix®, Inc. (Santa Clara, CA), Agilent Technologies, Inc. (Santa Clara, CA), Illumina®, Inc. (San Diego, CA), and GE Healthcare Lifesciences (Piscataway, NJ), with nearly half of the series (16,181) being performed on various Affymetrix® arrays.

The design of Affymetrix® GeneChip® arrays in particular provides for probe sets consisting of 11, 16, or 20 distinct 25 base pair (BP) probes, with each probe having a corresponding perfect match (PM) and mismatch (MM) probe. The PM and MM differ by the exchange of the complementary base at the 13th position in the probe. While MM probes were originally designed to account for signal in the PM resulting from non-specific cross-hybridization, they are often underutilized or completely ignored. Mismatch probes have been explored for use in long oligonucleotide arrays as well [5], but their utilization is limited to the Affymetrix® platform.

Affymetrix® provides a default GeneChip® analysis package known as the Micro Array Suite 5.0 (MAS 5.0) [6] that measures the signal intensity for a particular probe pair as:

\[ \begin{aligned} \text{signal} = TukeyBiweight ( \log (PM_{j} - MM^{*}_{j}) ) \end{aligned} \]

Where MM* is a modified version of MM that is never bigger than the intensity value of the PM. The motivation behind the modified mismatch intensity MM* is to report all probe-level intensities as positive values, and to remove the influence of the minority of probes where the MM intensity value is significantly higher than the corresponding PM intensity. In addition to the intensity signal, MAS 5.0 also produces a detection p-value which flags a transcript as “P” (present), “M” (marginal), or “A” (absent) based on the reliability of the probe set based on differences between PM and MM intensities.

Known issues in the use of PM and MM probe intensities to generate a single probe set intensity values led to the development of other approaches, including RMA [7] and GCRMA [8] which completely ignore the MM probes.

With the availability of individual probe and reference genome sequences, it is possible to re-map probes based on new sources of genome annotations. This allows custom Chip Description Files (CDFs) wherein probes are grouped into novel probe sets based on exon, transcript, and gene level annotation [9-22]. Most notable is the effort of the BrainArray group [10] which updates custom CDFs for a large number of Affymetrix® GeneChips® by creating probe sets based on annotated features such as Entrez Gene [23], Ensembl transcript, Ensembl gene, and RefSeq Gene [24]. Using custom CDFs has been shown to impact the reliability of expression analysis [10, 20-22, 25]. However, to the authors’ knowledge, only the PM probe sequences are used when generating custom CDFs.

Based on the observation that a small, yet significant number of PM-MM probe pairs exist where the MM intensity is significantly increased over the PM intensity, our initial inclination was that these differences in intensities were not due to cross-hybridization or rogue probes alone. Therefore, keeping in mind that Affymetrix® probes have been designed according to continually evolving genome assemblies, we proceeded to analyze PM and MM probes across eight commonly studied species (Table 1) by looking at PM and MM probes that uniquely map to the respective genome.

In addition to changing functional annotations, one potential problem area for microarray probe design is the presence of single nucleotide polymorphisms (SNPs) within a population. As the probes are designed using a reference genome or transcriptome, a “one-size fits all” approach has been taken for the probes on a particular array. However, SNPs are known to occur relatively frequently throughout the genome, with build 137 of dbSNP [26] containing over 53.5 million reference SNPs for the human genome. We have previously studied the effects of SNPs on Affymetrix® GeneChips® [27] showing that a large number of SNPs lie in the areas where microarray probes have been designed. This has been taken into account in the BrainArray’s custom CDF files which incorporate SNP information. To study the effects that individual variation can play in microarray analysis, we looked at the mappings of PM and MM probes within five distinct publicly available assemblies of human genomes.

METHODS

Mapping of PM and MM Probes

Chromosomal-based genome assemblies were downloaded from the UCSC Goldenpath Genomes ftp server using an anonymous login (ftp://hgdownload.cse.ucsc.edu/goldenPath/) [28] for eight commonly studied species, including C. elegans (roundworm), D. melanogaster (fruit fly), S. cerevisiae (baker’s yeast), X. tropicalis (western clawed frog), D. rerio (zebrafish), M. musculus (house mouse), R. norvegicus (brown Norway rat), and H. sapiens (human) (Table 1). Genome indices were created using bowtie-build version 0.12.8 [29] with the default parameters. Perfect match (PM) probe sequences for Affymetrix® GeneChips® were obtained from Bioconductor (v 2.10) probe packages, which are constructed from data available in NetAffx (Table 2) with each new Bioconductor release. Mismatch (MM) probe sequences were constructed by replacing the 13th base in the supplied PM probe sequence with the complementary base. PM and MM probes were aligned to the indexed genomes using bowtie version 0.12.8 [29] with the parameters –v 0 and –a which used together will report all valid probes matching with 100% identity.

Table 1. Genome assemblies used

Organism Reference Assembly Build Date
Caenorhabditis elegans ce6 May 2008
Drosophila melanogaster dm3 Apr. 2006
Saccaromyces cerevisiae sc3 Apr. 2011
Xenopus tropicalis xt3 Nov. 2009
Danio rerio dr6 Dec. 2008
Mus musculus mm10 Dec. 2011
Rattus norvegicus rn4 Nov. 2004
Homo sapiens hg19 Feb. 2009

Table 2. Affymetrix® GeneChips® used

Organism GeneChip® Name
Caenorhabditis elegans C. elegans Genome
Drosophila melanogaster Drosophila Genome 2.0
Saccaromyces cerevisiae Yeast Genome 2.0
Xenopus tropicalis Xenopus tropicalis Genome
Danio rerio Zebrafish Genome
Mus musculus Mouse Genome 430 2.0
Rattus norvegicus Rat Genome 230 2.0
Homo sapiens Human Genome U133 Plus 2.0

Generation of Exons and Overlap

Exon regions were obtained from the UCSC genome browser as BED files with an entry for each exon. mergeBed from the bedTools suite was used to merge overlapping exons from multiple transcripts into single contiguous exons. These merged exons were used when defining overlaps of probe alignments with an exon. Probe and exon overlaps were defined as any type of overlap with at least 23 bases overlapping on the same strand. Overlaps were determined using Genomic Ranges version 1.8.7 [31].

DNA Microarray Data

For each GeneChip®, CEL files were downloaded from GEO for 20 random samples (with the exception of S. cerevisiae (12) and X. tropicalis (4), the GSMs are listed in gsmFiles.txt). Probe intensities were background corrected using the MAS background correction method implemented in Bioconductor. Depending on the application, intensities were log (base 2), square root transformed, or used as is.

Negative PM-MM Set

A PM-MM set of probes was considered to be negative if nine (two for S. cerevisiae and six for X. tropicalis) or more samples had a negative value for the difference in the PM-MM intensities. For examination of intensity distribution, any PM-MM pair with a negative different greater than 1000 in one or more samples was considered and examined.

Probe Correlations

For each MM probe that uniquely overlapped one merged exon (designated as a true-match MM, (TMmm)), the correlation with all other MM probes in the probe set (mm) and the correlation with all other TM probes that also mapped uniquely to the same exon (if there were three or more other probes also mapped to the exon) was calculated ™.

Human Variation

To gain an understanding of individual variation and the unique mapping of microarray probes, five whole genome assemblies were downloaded for the human genome 32-36). Probes from the HGU133APlus2.0 Affymetrix® GeneChip® were aligned to each of these genomes using the methods previously described for mapping PM and MM probes.

Table 3. Whole human genome sequencing projects

Name Abbr. Assembly Identifier Bioproject Number Race
GRCh37 Hg19 420368 31257 Mixed
Hs_Celera_WGSA Celera 281338 1431 Mixed1
HuRefPrime JCVI 281188 19621 Caucasian
BGIAF BGI 165398 42201 African
HsapALLPATHS1 HSAP1 238948 59877 Caucasian

1Celera assembly consists of one African-American, one Asian-Chinese, one Hispanic-Mexican, and two Caucasians.

RESULTS

Probes Matching Genomic Locations

Given the PM probe sequences and the inferred MM sequences, individual probes were mapped to the corresponding genome assembly as outlined in Methods. The percentage of perfect match probes mapping to the genome ranged from a low of 80% (X. tropicalis) to a high of 95% (D. melanogaster), with the exception of S. cerevisiae (Table 4). It must be noted that the lower percentage of S. cerevisiae matches (53%) is expected, as the Yeast Genome 2.0 GeneChip® contains probes for two yeast species, S. cerevisiae and S. pombe.

Table 4. GeneChip® probes mapping to reference genomes

Organism Number of Probe Pairs PM Mapped to Reference MM Mapped to Reference PM Unique MM Unique
Ce 249165 226856 143 213745 96
Dm 265400 251602 89 245712 54
Dr 249752 200608 1282 171282 726
Hs 604258 562673 1094 521642 608
Mm 496468 456674 557 427920 394
Rn 342410 304646 391 286784 282
Sc 120855 63731 1 61942 1
Xt 648548 519177 1884 426237 1014

Affymetrix® probesets are given suffix definitions depending upon the uniqueness of the exemplar sequence used to design a probe set. A designation of _at indicates the probe set perfectly matches a single transcript; _a_at probe sets only perfectly match transcripts of the same gene; _s_at perfectly match multiple transcripts for the same gene family; and _x_at indicates the probe set is identical or highly similar to other genes. One of the difficulties with these designations is that it relies upon a set of annotations at a particular point in time.

Analysis of the probes that map to the genome (Table 4) indicates that 82% to 98% of the mapped probes map uniquely to a single genomic location. The fact that a number of probes map to multiple locations is not to be unexpected due to the restrictions placed on probe set design. However, it is expected that those probes mapping to multiple locations would not be from the _at class of probes.

To determine the reliability of these probes with the fluctuation of unknown transcripts, those probes that map with 100% identity to two or more locations in the genome were considered. While these probes typically represent less than 10% of the total number of probes for a given GeneChip®, their classification could be important in detecting cross hybridization. One might expect that the greatest percentage of these would be within the _x_at and _s_at classes. However, as Table 5 shows, the larger genomes actually contain the greatest percentage in the _at and _a_at classes, with anywhere from 18% (S. cerevisiae) to 91% (R. norvegicus) of the probes matching multiple locations belonging to the _at class. In addition, a small number of MM probes map to the genome as well. To better understand the effects this small set of MM probes might have on gene expression, we further reduced this to a smaller subset where the mapping was within exon regions. For these probes, we analyzed their signal intensities from random samples compared to the overall distribution of PM and MM intensities, and the distribution of PM and MM intensities within the corresponding exonic sequences (Figure 1). As these plots indicate, MM probes mapping within exonic regions closely follow the expression density of PM probes mapping within exonic regions, and are significantly shifted from the overall expression profiles of MM probes. These results suggest that while the number of these probes is small, they offer significant information that should not be ignored, and furthermore, can confound analyses where MM data is incorporated.

Table 5. Probe set classification of probes perfectly matching multiple genomic locations.

Organism _x_at _s_at _a_at _at control
Ce 4040 (31%) 6416 (49%) 0 (0%) 2465 (19%) 190 (1.4%)
Dm 361 (6.1%) 2742 (47%) 224 (3.8%) 2520 (43%) 43 (0.73%)
Dr 1376 (4.7%) 374 (1.3%) 1203 (4.1%) 25879 (88%) 494 (1.7%)
Hs 9402 (23%) 10634 (26%) 803 (2%) 19961 (49%) 231 (0.56%)
Mm 4075 (14%) 2920 (10%) 4893 (17%) 16703 (58%) 163 (0.57%)
Rn 440 (2.5%) 394 (2.2%) 805 (4.5%) 16127 (90%) 96 (0.54%)
Sc 81 (4.5%) 1126 (63%) 0 (0%) 271 (15%) 311 (17%)
Xt 14092 (15%) 12086 (13%) 34877 (38%) 31754 (34%) 131 (0.14%)

plot of chunk allPlots

Figure 1. Density profile of probe intensities (log2). PM: perfect match, MM: mismatch. mm.all: background MM intensities; pm.all: background PM intensities; mm.exon: intensities of MM probes in exonic regions; pm.exon: intensities of PM probes in exonic regions.

As some of these MM probes may bind to transcripts, we further considered those MM probes that uniquely mapped to exons (irrespective of whether the corresponding PM probe mapped zero, one or multiple times to exons or the full genome), examining the differences in signal intensity between the MM and its associated PM. In many cases there is a significant negative difference in the expression level of the PM-MM pair (an example is shown in Figures 2 and 3).

plot of chunk pmmmDensity

Figure 2. Density of MM and PM probe set intensities not including the PM-MM pair that had a large negative difference. Intensities from zebrafish probeset Dr.5545.1.S1_at, in GEO sample GSM604808.CEL.gz.

plot of chunk pmmmIntensity

Figure 3. Square root transformed intensities for each PM-MM pair. The negative difference pair is at the extreme right end of the figure. Intensities from zebrafish probeset Dr.5545.1.S1_at, in GEO sample GSM604808.CEL.gz.

If these MM probes are grouped instead with the other probes within the transcriptional region for which they uniquely match (we have renamed these probes as “true match” ™ probes since they truly match the region in the genome), there is a much better association between the probe intensities, as shown in Figure 4.

plot of chunk tmIntensity

Figure 4. Plot of probe set intensities for zebrafish where the TM probes overlap with the same TMmm probe in Figure 3 Intensities from zebrafish probeset Dr.5545.1.S1_at, in GEO sample GSM604808.CEL.gz.

Further analysis was performed to test the correlation of the TM probes with the expression levels of both the annotated probe group MM probes and with the TM-mapped transcript probes (Figure 5). The box plot in Figure 5 clearly indicates that the TM intensities more closely correlate with those from the group based on mapping to the same exon.

plot of chunk correlationBoxPlot

Figure 5. Box plot of correlations of the TMmm with MM intensities of annotated probe set (red, MM) and TM intensities of custom probe sets based on shared mapping to exons (blue, TM)

To determine if the observed difference in the correlations from Figure 5 is due to measurement of different mRNA entities, we considered the MM probe both within its annotated location as well as within the new mapped location (TMmm). The intensity of the MM probe was compared with the intensity of its corresponding neighbor probes (MM probes for the annotated location; PM probes for the new mapped location). An average correlation value between intensities was calculated on a per-exon probe basis. The resulting correlations are summarized in Table 6.

tm mm ProbeSet Probe ID Annotated RefSeq Exon RefSeq Annotated Symbol Exon Symbol
0.8654 0.53475 209135_at mm.209135_at.763.274 NM_001164750, NM_001164751, NM_001164752, NM_001164753, NM_001164754, NM_001164755, NM_001164756, NM_004318, NM_020164, NM_032466, NM_032467, NM_032468 NM_032466, NM_032468, NM_001164755, NM_001164754, NM_001164753, NM_001164752, NM_001164751 ASPH ASPH
0.8602 0.61192 204041_at mm.204041_at.895.14 NM_000898 NM_000898 MAOB MAOB
0.8444 0.59839 206432_at mm.206432_at.534.594 NM_005328 NM_005328 HAS2 HAS2
0.8080 0.54951 205004_at mm.205004_at.1030.1044 NM_001173487, NM_001173488, NM_017544 NM_001173488, NM_001173487, NM_017544 NKRF NKRF
0.8059 0.78041 201622_at mm.201622_at.1129.72 NM_014390 NM_014390 SND1 SND1
0.7922 0.81628 235958_at mm.235958_at.810.810 NM_213600, NR_033151 NM_213600, NR_033151 PLA2G4F PLA2G4F
0.7890 0.74357 233052_at mm.233052_at.882.6 NM_001206927, NM_001371 NM_001206927 DNAH8 DNAH8
0.7728 0.36086 206084_at mm.206084_at.139.1104 NM_001207015, NM_001207016, NM_002849, NM_130846 NM_002849, NM_001207015, NM_130846, NM_001207016 PTPRR PTPRR
0.7546 0.39994 217416_x_at mm.217416_x_at.747.452 NM_152924, NM_007011 ABHD2
0.7474 0.55101 203646_at mm.203646_at.468.274 NM_004109 NM_004109 FDX1 FDX1
0.7416 0.60912 210467_x_at mm.210467_x_at.1162.450 NM_001166386, NM_001166387, NM_005367 NM_004988 MAGEA12 MAGEA1
0.7221 0.64148 207687_at mm.207687_at.947.968 NM_005538 NM_005538 INHBC INHBC
0.7125 0.48949 211741_x_at mm.211741_x_at.916.1066 NM_021016 NM_002781, NM_001130014 PSG3 PSG5
0.6724 0.40608 211493_x_at mm.211493_x_at.351.512 NM_001128175, NM_001198938, NM_001198939, NM_001198940, NM_001198941, NM_001198942, NM_001198943, NM_001198944, NM_001198945, NM_001390, NM_001391, NM_001392, NM_032975, NM_032978, NM_032979, NM_032980, NM_032981 NM_001198939, NM_001198940, NM_001390, NM_032975, NM_001198938, NM_001198944, NM_001198943, NM_001198942, NM_032980 DTNA DTNA
0.6605 0.55326 219337_at mm.219337_at.935.290 NM_017891 NM_017891 C1orf159 C1orf159
0.6565 0.65743 221351_at mm.221351_at.390.354 NM_000524 NM_000524 HTR1A HTR1A
0.6487 0.54945 238916_at mm.238916_at.697.790 NR_028408 NR_028408 LOC400027 LOC400027
0.6463 0.49228 222221_x_at mm.222221_x_at.912.300 NM_006795 NM_014600 EHD1 EHD3
0.6446 0.41951 203399_x_at mm.203399_x_at.917.1066 NM_021016 NM_002781, NM_001130014 PSG3 PSG5
0.6403 0.77766 201844_s_at mm.201844_s_at.803.352 NM_012234 NM_012234 RYBP RYBP
0.6367 0.34014 240239_at mm.240239_at.132.1114 NM_001145343, NM_001145344, NM_001145345, NM_032838 NM_001145344, NM_001145343, NM_032838, NM_001145345 ZNF566 ZNF566
0.6315 0.37627 201220_x_at mm.201220_x_at.905.256 NM_001083914, NM_001329, NM_022802 NR_003682 CTBP2 MGC70870
0.6299 0.23979 210835_s_at mm.210835_s_at.906.256 NM_001083914, NM_001329, NM_022802 NR_003682 CTBP2 MGC70870
0.5943 0.55241 223485_at mm.223485_at.745.316 NM_032304, NM_207112 NM_032304 HAGHL HAGHL
0.5930 0.45395 206281_at mm.206281_at.529.784 NM_001099733, NM_001117 NM_001117, NM_001099733 ADCYAP1 ADCYAP1
0.5866 0.55602 1562659_at mm.1562659_at.440.914 NR_033984 NR_033984 LOC400548 LOC400548
0.5845 0.51707 229852_at mm.229852_at.330.998 NM_022787 NM_022787 NMNAT1 NMNAT1
0.5725 0.24840 1553901_x_at mm.1553901_x_at.258.150 NM_052852 NM_178558 ZNF486 ZNF680
0.5597 0.53161 220547_s_at mm.220547_s_at.462.628 NM_019054 NM_019054 FAM35A FAM35A
0.5565 0.56330 209811_at mm.209811_at.257.468 NM_001224, NM_032982, NM_032983 NM_032982, NM_032983, NM_001224 CASP2 CASP2
0.5440 0.60341 219710_at mm.219710_at.1132.106 NM_024577 NM_024577 SH3TC2 SH3TC2
0.4971 0.38237 223838_at mm.223838_at.145.282 NM_025244, NM_182911 NM_182911, NM_025244 TSGA10 TSGA10
0.4883 0.63939 221691_x_at mm.221691_x_at.746.320 NM_001037738, NM_002520, NM_199185 NR_036693, NM_001004419, NM_001197317, NM_001197318, NM_001197319, NM_013269 NPM1 CLEC2D
0.4527 0.62369 200724_at mm.200724_at.575.1144 NM_001256577, NM_001256580, NM_006013, NR_026898 NM_001256580, NM_001256577, NM_006013 RPL10 RPL10
0.4048 0.51515 204431_at mm.204431_at.1040.48 NM_001144761, NM_001144762, NM_003260 NM_003260, NM_001144761, NM_001144762 TLE2 TLE2
0.3851 0.09551 217547_x_at mm.217547_x_at.860.982 NM_138330 NM_016220, NM_001013746 ZNF675 ZNF107
0.3284 0.44156 228128_x_at mm.228128_x_at.1075.350 NM_002581 NM_002581 PAPPA PAPPA

What is interesting is that with few exceptions, the transcripts and genes being measured by the probe sets are the same, implying that the TMmm probe aligns to the same gene as its complimentary PM probe. Upon further examination it appears that many of the PM complements of the TMmm do not align to the genome at all (data not shown). One possibility is that these probes are in regions that have seen changes in the reference sequence over the years, or that the original sequencing of the ESTs used to design the probe sets was of poor quality.

Effects of Individual Variation

To gain an understanding of the effect of individual variation, the unique mapping of PM and MM probes to five distinct human genome assemblies was analyzed (Table 7). Four of the five projects have roughly the same number of uniquely mapped PM probes (within 2% variation). The fifth project (BGI) provides an exception to this trend. While there are a number of potential explanations for this (including sequence and assembly quality and coverage of the sequencing), one potential feature to be considered is the fact that this sequencing project involves the sequencing of an African individual, and it is the only project not to have a large component of the library consisting of Caucasian individuals.

Table 7. Number of probes mapping uniquely to individual human genomes.

Assembly Total Perfect Match Mismatch
Hg19 522250 521642 608
Celera 515111 514518 593
JCVI 530213 529569 644
BGI 469973 469714 259
HSAP1 522480 521922 558

While there is a large agreement for the number of probes, we also checked if the probes represented were consistent among all of the projects. A Venn diagram depicting the number of overlapping perfectly matching probes is given in Figure 6. As can be seen from this figure, a total of 422279 probes uniquely map for all five assemblies. Of these, 422118 are perfect match probes, indicating that 70% of the HGU133APlus2.0 perfect match probes are reliable in terms of their mapping to the genome for these assemblies. One of the interesting results is that there are 161 shared perfectly matching mismatch probes. Of these, 24 fall within RefSeq annotated exonic regions (results not shown), with 16 of the 24 showing higher correlation in the TMmm assignments calculated in Table 6.

plot of chunk insertVenn

Figure 6. Venn diagram of overlapping perfectly matching Affymetrix® HGU133A Plus2 probes to each of the five human genome assemblies.

Conclusion

MM probes are theoretically designed to capture background and non-specific binding. Alignment of the MM probes to the genome shows that in a very small percentage of cases, MM probes align uniquely to the genome in transcribed regions. Signal from these probes should be useful for quantifying true transcriptional events rather than for PM signal adjustment.

In addition, current custom CDF generation workflows ignore the MM probes during the probe alignment process. Given that some MM probes align to reference genomes, they should be considered for inclusion when creating custom CDFs. The utility of the probes may be limited due to variation among individuals.

ACKNOWLEDGEMENTS

This work was partially funded by National Institutes of Health (NIH) grant 8P20GM103436-12. Its contents are solely the responsibility of the authors and do not represent the official views of NIH or the National Institute of General Medical Sciences.

REFERENCES

[1] Schena, M., Shalon, D., Davis, R. W. and Brown, P. O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270, 5235 (Oct 20 1995), 467-470.

[2] Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M. and Snyder, M. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science, 320, 5881 (Jun 6 2008), 1344-1349.

[3] Barrett, T., Troup, D. B., Wilhite, S. E., Ledoux, P., Evangelista, C., Kim, I. F., Tomashevsky, M., Marshall, K. A., Phillippy, K. H., Sherman, P. M., Muertter, R. N., Holko, M., Ayanbule, O., Yefanov, A. and Soboleva, A. NCBI GEO: archive for functional genomics data sets–10 years on. Nucleic acids research, 39, Database issue (Jan 2011), D1005-1010.

[4] Parkinson, H., Sarkans, U., Kolesnikov, N., Abeygunawardena, N., Burdett, T., Dylag, M., Emam, I., Farne, A., Hastings, E., Holloway, E., Kurbatova, N., Lukk, M., Malone, J., Mani, R., Pilicheva, E., Rustici, G., Sharma, A., Williams, E., Adamusiak, T., Brandizi, M., Sklyar, N. and Brazma, A. ArrayExpress update–an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic acids research, 39, Database issue (Jan 2011), D1002-1004.

[5] Deng, Y., He, Z., Van Nostrand, J. D. and Zhou, J. Design and analysis of mismatch probes for long oligonucleotide microarrays. BMC genomics, 92008), 491.

[6] Hubbell, E., Liu, W. M. and Mei, R. Robust estimators for expression analysis. Bioinformatics, 18, 12 (Dec 2002), 1585-1592.

[7] Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B. and Speed, T. P. Summaries of Affymetrix GeneChip probe level data. Nucleic acids research, 31, 4 (Feb 15 2003), e15.

[8] Wu, Z. and Irizarry, R. A. Stochastic models inspired by hybridization theory for short oligonucleotide arrays. Journal of computational biology : a journal of computational molecular cell biology, 12, 6 (Jul-Aug 2005), 882-893.

[9] Carter, S. L., Eklund, A. C., Mecham, B. H., Kohane, I. S. and Szallasi, Z. Redefinition of Affymetrix probe sets by sequence overlap with cDNA microarray probes reduces cross-platform inconsistencies in cancer-associated gene expression measurements. BMC bioinformatics, 62005), 107.

[10] Dai, M., Wang, P., Boyd, A. D., Kostov, G., Athey, B., Jones, E. G., Bunney, W. E., Myers, R. M., Speed, T. P., Akil, H., Watson, S. J. and Meng, F. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic acids research, 33, 20 2005), e175.

[11] de Leeuw, W. C., Rauwerda, H., Jonker, M. J. and Breit, T. M. Salvaging Affymetrix probes after probe-level re-annotation. BMC research notes, 12008), 66.

[12] Ferrari, F., Bortoluzzi, S., Coppe, A., Sirota, A., Safran, M., Shmoish, M., Ferrari, S., Lancet, D., Danieli, G. A. and Bicciato, S. Novel definition files for human GeneChips based on GeneAnnot. BMC bioinformatics, 82007), 446.

[13] Langer, W., Sohler, F., Leder, G., Beckmann, G., Seidel, H., Grone, J., Hummel, M. and Sommer, A. Exon array analysis using re-defined probe sets results in reliable identification of alternatively spliced genes in non-small cell lung cancer. BMC genomics, 112010), 676.

[14] Liu, H., Zeeberg, B. R., Qu, G., Koru, A. G., Ferrucci, A., Kahn, A., Ryan, M. C., Nuhanovic, A., Munson, P. J., Reinhold, W. C., Kane, D. W. and Weinstein, J. N. AffyProbeMiner: a web resource for computing or retrieving accurately redefined Affymetrix probe sets. Bioinformatics, 23, 18 (Sep 15 2007), 2385-2390.

[15] Lu, J., Lee, J. C., Salit, M. L. and Cam, M. C. Transcript-based redefinition of grouped oligonucleotide probe sets using AceView: high-resolution annotation for microarrays. BMC bioinformatics, 82007), 108.

[16] Moll, A. G., Lindenmeyer, M. T., Kretzler, M., Nelson, P. J., Zimmer, R. and Cohen, C. D. Transcript-specific expression profiles derived from sequence-based analysis of standard microarrays. PloS one, 4, 3 2009), e4702.

[17] Moreews, F., Rauffet, G., Dehais, P. and Klopp, C. SigReannot-mart: a query environment for expression microarray probe re-annotations. Database : the journal of biological databases and curation, 20112011), bar025.

[18] Neerincx, P. B., Rauwerda, H., Nie, H., Groenen, M. A., Breit, T. M. and Leunissen, J. A. OligoRAP - an Oligo Re-Annotation Pipeline to improve annotation and estimate target specificity. BMC proceedings, 3 Suppl 42009), S4.

[19] Risueno, A., Fontanillo, C., Dinger, M. E. and De Las Rivas, J. GATExplorer: genomic and transcriptomic explorer; mapping expression probes to gene loci, transcripts, exons and ncRNAs. BMC bioinformatics, 112010), 221.

[20] Sandberg, R. and Larsson, O. Improved precision and accuracy for microarrays using updated probe set definitions. BMC bioinformatics, 82007), 48.

[21] Yin, J., McLoughlin, S., Jeffery, I. B., Glaviano, A., Kennedy, B. and Higgins, D. G. Integrating multiple genome annotation databases improves the interpretation of microarray gene expression data. BMC genomics, 112010), 50.

[22] Yu, H., Wang, F., Tu, K., Xie, L., Li, Y. Y. and Li, Y. X. Transcript-level annotation of Affymetrix probesets improves the interpretation of gene expression data. BMC bioinformatics, 82007), 194.

[23] Maglott, D., Ostell, J., Pruitt, K. D. and Tatusova, T. Entrez Gene: gene-centered information at NCBI. Nucleic acids research, 39, Database issue (Jan 2011), D52-57.

[24] Pruitt, K. D., Tatusova, T., Brown, G. R. and Maglott, D. R. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic acids research, 40, Database issue (Jan 2012), D130-135.

[25] Lu, X. and Zhang, X. The effect of GeneChip gene definitions on the microarray study of cancers. BioEssays : news and reviews in molecular, cellular and developmental biology, 28, 7 (Jul 2006), 739-746.

[26] Sayers, E. W., Barrett, T., Benson, D. A., Bolton, E., Bryant, S. H., Canese, K., Chetvernin, V., Church, D. M., Dicuccio, M., Federhen, S., Feolo, M., Fingerman, I. M., Geer, L. Y., Helmberg, W., Kapustin, Y., Krasnov, S., Landsman, D., Lipman, D. J., Lu, Z., Madden, T. L., Madej, T., Maglott, D. R., Marchler-Bauer, A., Miller, V., Karsch-Mizrachi, I., Ostell, J., Panchenko, A., Phan, L., Pruitt, K. D., Schuler, G. D., Sequeira, E., Sherry, S. T., Shumway, M., Sirotkin, K., Slotta, D., Souvorov, A., Starchenko, G., Tatusova, T. A., Wagner, L., Wang, Y., Wilbur, W. J., Yaschenko, E. and Ye, J. Database resources of the National Center for Biotechnology Information. Nucleic acids research, 40, Database issue (Jan 2012), D13-25.

[27] Rouchka, E. C., Phatak, A. W. and Singh, A. V. Effect of single nucleotide polymorphisms on Affymetrix match-mismatch probe pairs. Bioinformation, 2, 9 2008), 405-411.

[28] Dreszer, T. R., Karolchik, D., Zweig, A. S., Hinrichs, A. S., Raney, B. J., Kuhn, R. M., Meyer, L. R., Wong, M., Sloan, C. A., Rosenbloom, K. R., Roe, G., Rhead, B., Pohl, A., Malladi, V. S., Li, C. H., Learned, K., Kirkup, V., Hsu, F., Harte, R. A., Guruvadoo, L., Goldman, M., Giardine, B. M., Fujita, P. A., Diekhans, M., Cline, M. S., Clawson, H., Barber, G. P., Haussler, D. and James Kent, W. The UCSC Genome Browser database: extensions and updates 2011. Nucleic acids research, 40, Database issue (Jan 2012), D918-923.

[29] Langmead, B., Trapnell, C., Pop, M. and Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10, 3 2009), R25.

[30] Liu, G., Loraine, A. E., Shigeta, R., Cline, M., Cheng, J., Valmeekam, V., Sun, S., Kulp, D. and Siani-Rose, M. A. NetAffx: Affymetrix probesets and annotations. Nucleic acids research, 31, 1 (Jan 1 2003), 82-86.

[31] Aboyoun, H., Pages, H. and Lawrence, M. GenomicRanges: Representation and manipulation of genomic intervals. . City.

[32] Gnerre, S., Maccallum, I., Przybylski, D., Ribeiro, F. J., Burton, J. N., Walker, B. J., Sharpe, T., Hall, G., Shea, T. P., Sykes, S., Berlin, A. M., Aird, D., Costello, M., Daza, R., Williams, L., Nicol, R., Gnirke, A., Nusbaum, C., Lander, E. S. and Jaffe, D. B. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences of the United States of America, 108, 4 (Jan 25 2011), 1513-1518.

[33] Istrail, S., Sutton, G. G., Florea, L., Halpern, A. L., Mobarry, C. M., Lippert, R., Walenz, B., Shatkay, H., Dew, I., Miller, J. R., Flanigan, M. J., Edwards, N. J., Bolanos, R., Fasulo, D., Halldorsson, B. V., Hannenhalli, S., Turner, R., Yooseph, S., Lu, F., Nusskern, D. R., Shue, B. C., Zheng, X. H., Zhong, F., Delcher, A. L., Huson, D. H., Kravitz, S. A., Mouchard, L., Reinert, K., Remington, K. A., Clark, A. G., Waterman, M. S., Eichler, E. E., Adams, M. D., Hunkapiller, M. W., Myers, E. W. and Venter, J. C. Whole-genome shotgun assembly and comparison of human genome assemblies. Proceedings of the National Academy of Sciences of the United States of America, 101, 7 (Feb 17 2004), 1916-1921.

[34] Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., Funke, R., Gage, D., Harris, K., Heaford, A., Howland, J., Kann, L., Lehoczky, J., LeVine, R., McEwan, P., McKernan, K., Meldrim, J., Mesirov, J. P., Miranda, C., Morris, W., Naylor, J., Raymond, C., Rosetti, M., Santos, R., Sheridan, A., Sougnez, C., Stange-Thomann, N., Stojanovic, N., Subramanian, A., Wyman, D., Rogers, J., Sulston, J., Ainscough, R., Beck, S., Bentley, D., Burton, J., Clee, C., Carter, N., Coulson, A., Deadman, R., Deloukas, P., Dunham, A., Dunham, I., Durbin, R., French, L., Grafham, D., Gregory, S., Hubbard, T., Humphray, S., Hunt, A., Jones, M., Lloyd, C., McMurray, A., Matthews, L., Mercer, S., Milne, S., Mullikin, J. C., Mungall, A., Plumb, R., Ross, M., Shownkeen, R., Sims, S., Waterston, R. H., Wilson, R. K., Hillier, L. W., McPherson, J. D., Marra, M. A., Mardis, E. R., Fulton, L. A., Chinwalla, A. T., Pepin, K. H., Gish, W. R., Chissoe, S. L., Wendl, M. C., Delehaunty, K. D., Miner, T. L., Delehaunty, A., Kramer, J. B., Cook, L. L., Fulton, R. S., Johnson, D. L., Minx, P. J., Clifton, S. W., Hawkins, T., Branscomb, E., Predki, P., Richardson, P., Wenning, S., Slezak, T., Doggett, N., Cheng, J. F., Olsen, A., Lucas, S., Elkin, C., Uberbacher, E., Frazier, M., Gibbs, R. A., Muzny, D. M., Scherer, S. E., Bouck, J. B., Sodergren, E. J., Worley, K. C., Rives, C. M., Gorrell, J. H., Metzker, M. L., Naylor, S. L., Kucherlapati, R. S., Nelson, D. L., Weinstock, G. M., Sakaki, Y., Fujiyama, A., Hattori, M., Yada, T., Toyoda, A., Itoh, T., Kawagoe, C., Watanabe, H., Totoki, Y., Taylor, T., Weissenbach, J., Heilig, R., Saurin, W., Artiguenave, F., Brottier, P., Bruls, T., Pelletier, E., Robert, C., Wincker, P., Smith, D. R., Doucette-Stamm, L., Rubenfield, M., Weinstock, K., Lee, H. M., Dubois, J., Rosenthal, A., Platzer, M., Nyakatura, G., Taudien, S., Rump, A., Yang, H., Yu, J., Wang, J., Huang, G., Gu, J., Hood, L., Rowen, L., Madan, A., Qin, S., Davis, R. W., Federspiel, N. A., Abola, A. P., Proctor, M. J., Myers, R. M., Schmutz, J., Dickson, M., Grimwood, J., Cox, D. R., Olson, M. V., Kaul, R., Shimizu, N., Kawasaki, K., Minoshima, S., Evans, G. A., Athanasiou, M., Schultz, R., Roe, B. A., Chen, F., Pan, H., Ramser, J., Lehrach, H., Reinhardt, R., McCombie, W. R., de la Bastide, M., Dedhia, N., Blocker, H., Hornischer, K., Nordsiek, G., Agarwala, R., Aravind, L., Bailey, J. A., Bateman, A., Batzoglou, S., Birney, E., Bork, P., Brown, D. G., Burge, C. B., Cerutti, L., Chen, H. C., Church, D., Clamp, M., Copley, R. R., Doerks, T., Eddy, S. R., Eichler, E. E., Furey, T. S., Galagan, J., Gilbert, J. G., Harmon, C., Hayashizaki, Y., Haussler, D., Hermjakob, H., Hokamp, K., Jang, W., Johnson, L. S., Jones, T. A., Kasif, S., Kaspryzk, A., Kennedy, S., Kent, W. J., Kitts, P., Koonin, E. V., Korf, I., Kulp, D., Lancet, D., Lowe, T. M., McLysaght, A., Mikkelsen, T., Moran, J. V., Mulder, N., Pollara, V. J., Ponting, C. P., Schuler, G., Schultz, J., Slater, G., Smit, A. F., Stupka, E., Szustakowski, J., Thierry-Mieg, D., Thierry-Mieg, J., Wagner, L., Wallis, J., Wheeler, R., Williams, A., Wolf, Y. I., Wolfe, K. H., Yang, S. P., Yeh, R. F., Collins, F., Guyer, M. S., Peterson, J., Felsenfeld, A., Wetterstrand, K. A., Patrinos, A., Morgan, M. J., de Jong, P., Catanese, J. J., Osoegawa, K., Shizuya, H., Choi, S. and Chen, Y. J. Initial sequencing and analysis of the human genome. Nature, 409, 6822 (Feb 15 2001), 860-921.

[35] Levy, S., Sutton, G., Ng, P. C., Feuk, L., Halpern, A. L., Walenz, B. P., Axelrod, N., Huang, J., Kirkness, E. F., Denisov, G., Lin, Y., MacDonald, J. R., Pang, A. W., Shago, M., Stockwell, T. B., Tsiamouri, A., Bafna, V., Bansal, V., Kravitz, S. A., Busam, D. A., Beeson, K. Y., McIntosh, T. C., Remington, K. A., Abril, J. F., Gill, J., Borman, J., Rogers, Y. H., Frazier, M. E., Scherer, S. W., Strausberg, R. L. and Venter, J. C. The diploid genome sequence of an individual human. PLoS biology, 5, 10 (Sep 4 2007), e254.

[36] Li, R., Li, Y., Zheng, H., Luo, R., Zhu, H., Li, Q., Qian, W., Ren, Y., Tian, G., Li, J., Zhou, G., Zhu, X., Wu, H., Qin, J., Jin, X., Li, D., Cao, H., Hu, X., Blanche, H., Cann, H., Zhang, X., Li, S., Bolund, L., Kristiansen, K., Yang, H. and Wang, J. Building the sequence map of the human pan-genome. Nature biotechnology, 28, 1 (Jan 2010), 57-63.

RECREATION OF RESULTS

All of the necessary code to regenerate this work is hosted at https://github.com/rmflight/affymm. To recreate this html document, you need to run:

knit2html("flightetal_draft.Rmd", "flightetal_draft.html")

Alternatively use the Knit HTML button in RStudio v0.97 or newer. This document was generated using R version 2.15.0 (2012-03-30). The following information about packages used is also supplied:

sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] GenomicRanges_1.8.13    IRanges_1.14.4         
 [3] hgu133plus2probe_2.10.0 hgu133plus2.db_2.7.1   
 [5] org.Hs.eg.db_2.7.1      RSQLite_0.11.2         
 [7] DBI_0.2-5               AnnotationDbi_1.18.3   
 [9] Biobase_2.16.0          BiocGenerics_0.2.0     
[11] VennDiagram_1.5.1       plyr_1.7.1             
[13] ggplot2_0.9.2.1         knitr_0.8.1            

loaded via a namespace (and not attached):
 [1] colorspace_1.1-1   dichromat_1.2-4    digest_0.5.2      
 [4] evaluate_0.4.2     formatR_0.6        gtable_0.1.1      
 [7] labeling_0.1       MASS_7.3-21        memoise_0.1       
[10] munsell_0.4        proto_0.3-9.2      RColorBrewer_1.0-5
[13] reshape2_1.2.1     scales_0.2.2       stats4_2.15.0     
[16] stringr_0.6.1      tools_2.15.0