Microarray Lab

1. Reasons for generating custom CDF files for Affymetrix GeneChips

2. Procedures for generating custom CDF files

3. Statistics of Affymetrix and custom CDF files

4. Known shortcomings of custom CDF files

5. Effects of custom CDF files on the detection of differential expression

6. How to use the custom CDF files

7. Bulletin board for comments and suggestions

1. Reasons for generating custom CDF files for Affymetrix GeneChips

Affymetrix GeneChips were based on the best UniGene clustering and genomic sequence information available at the time of chip design. Due to the significant increase in EST/cDNA/Genomic sequence information in the last couple of years, some oligonucleotide probes in these old designs can now be assigned to different genes/transcripts based on the current UniGene clustering and genome annotation. While Affymetrix’s current annotation system maps each probe set to the latest UniGene build every couple of months, it does not deal with situations where a subset of oligonucleotide probes in a probe set may be assigned to another gene or more than one gene based on the current UniGene clustering and genome annotation.
In addition, a significant portion of UniGene clusters can be represented by more than one oligonucleotide probe set on GeneChips but there is no standard approach to deal with signals from different probe sets representing the same gene. It will be highly desirable to have one probe set-one target relationship for the interpretation of the data.
To satisfy different analysis purposes, we also developed a new function called "removeprobe" in our CustomCDF R package (http://arrayanalysis.mbni.med.umich.edu/MBNIUM.html#CustomCDF ) for more flexibility in further customizing CDFs such as removing allele-specific probes and distant 5' probes based on the list of "bad probes" generated by users.
Furthermore, based on the Chimpanzee genome sequences, we generated the chimpanzee version of CDF files for popular human GeneChips.

2. Procedures for generating custom CDF files

After probe sequences are BLASTed against the latest UniGene Build and genome sequence, a series of filtering and grouping criteria are applied for different CDF files.

2.1. UniGene CDF files: They are based on UniGene clustering and genome sequence. They are closest close to Affymetrix annotation in terms of gene definition.

· A probe must have perfect match (hit) to both cDNA/EST sequences and genome sequence.

· A probe must only hit one UniGene cluster and one genomic location

· All probes representing the same gene must align sequentially in the same direction within the same genomic region

· Each probe set must contain at least three oligonucleotide probes and probes in a set are ordered according to their genomic location.

2.2. CDF files for Reference sequence, Entrez Gene and Exon, ENSEMBL Gene, Transcript and Exon and VEGA Gene, Transcript and Exon

· A probe must hit only one genomic location

· Probes that can be mapped to the same target sequence in the correct direction are grouped together in the same probe set.

· Each probe set must contain at least three oligonucleotide probes and probes in a set are ordered according to their location in the corresponding exon.

2.3. Chimpanzee CDFsfinitions.

· Affymetrix-Chimp: Probes in Affymetrix CDF but not presented on Chimpanzee genome are eliminated from the corresponding human Affymtrix CDF.

· Human UniGene-Chimp: Probes in human UniGene-based CDF but with no hit or with more than one hit on the chimpanzee genomes are eliminated.

3. Statistics of Affymetrix and custom CDF files

· Version 1 Statistics

· Version 2 Statistics

· Version 3 Statistics

· Version 4 Statistics

· Version 5 Statistics

· Version 6 Statistics

· Version 7 Statistics

· Version 8 Statistics

· Version 9 Statistics

· Version 10 Statistics

4. Known shortcomings of custom CDF files

4.1. Probesets in these custom CDF files contain from 3 to several dozen probes. The within-chip error is very different for different probe set.

4.2. UniGene CDF files: While our criteria ensure the purity of the redefined probe set based on the available information, we may throw away some good probes since large UniGene clusters may contain small percentage of sequences from other genes due to the presence of chimeric clones or significant homologous sequences.

4.3. ENSEMBL exon CDF files: There are still significant overlap and redundancy in ENSEMBL exon definition. Exons represented by the same exon can be identified from the probe-exon query function on our website.

4.4. ENSEMBL transcript CDF files: Although ENSEMBL probably provides the most extensive and clear transcript definition in the public domain, it may not include all known transcripts due to issues such as database synchronization. Probe targeting at different region of transcripts may not behave the same way.

5. Effects of custom CDF files on the detection of differential expression

Preliminary studies using human brain samples show the most obvious effect of the Hs_U133A_UG167 CDF file is the “averaging” of expression values for multiple Affy probesets representing the same UniGene cluster (~4000 clusters). The within-chip standard error for those merged probesets is significantly smaller due to the increase in probe number. However, the within-chip standard error is usually 2-3 times than those from Affy probe set if only 3-5 probes are used to represent a UniGene cluster. Compared to Affy probeset representing the same gene, about 900 probesets that do not involve probeset merging showed at least 25% change in absolute expression values.

A pilot study comparing Hs_U133A_UG167 CDF and Affymetrix U133A CDF based on data derived from 14 human brain samples suggests 20%-30% of the genes in the differentially expressed gene category can be different under several cut off criteria.

Systematic analysis of the impact of custom CDFs can be found in the following papers:

M. Dai, P. Wang, A.D. Boyd, G. Kostov, B. Athey, E. G. Jones, W. E. Bunney, R. M. Myers, T. P. Speed, H. Akil, S. J. Watson, F. Meng (2005) Evolving Gene/Transcript Definitions Significantly Alter the Interpretation of GeneChip Data. Nucleic Acid Research 33 (20), e175 (http://nar.oxfordjournals.org/cgi/content/full/33/20/e175 )

Lu X, and Zhang X.(2006) The effect of GeneChip gene definitions on the microarray study of cancers. Bioessays. 28(7):739-46.. (http://www3.interscience.wiley.com/cgi-bin/abstract/112708504/ABSTRACT?CRETRY=1&SRETRY=0 )

Rickard Sandberg and Ola Larsson (2007) Improved precision and accuracy for microarrays using updated probe set definitions. BMC Bioinformatics 8:48. (http://www.biomedcentral.com/1471-2105/8/48 )

6. How to use the custom CDF file

Custom CDF files can be selected based on species, Affymetrix GeneChip type, CDF file type and CDF file format on our CDF download webpage.

6.1. Affymetrix MAS5 and dCHIP

The ASCII format CDF is for Affymetrix MAS5 and standalone dCHIP analysis. After unzip the ASCII CDF package, the custom CDF file can be used exactly the same way as Affymetrix CDF files. Please note that dCHIP only accept Affymetrix CDF names thus one has to change the name of the CDF file to the corresponding Affymetrix name.

6.2. BioConductor

The R packages for Win32/LINUX are for using GeneChip analysis functions in BioConductor in the corresponding platforms. Since version 11, Bioconductor redirect request to our own repository http://brainarray.mbni.med.umich.edu/bioc. Or you can modify file $R_HOME/etc/repositories to add our repository.

A. Use custom CDF files Version 8 in the R-environment (For Bionconductor 1.9).

Our Version 8 of custom CDF is included Bioconductor 1.9's repository, it would be downloaded and installed automatically just like affymetrix's original cdf packages. What you need to do is to replace AffyBatch object's cdfName with Custom CDF name. For example,

library(affy)
data<-ReadAffy(cdfname = "HS133A_HS_UG" ) #Custom CDF name consists of Original CDF name, Species and Custom CDF Type
result<-rma(data)

Please note that the version number is already removed from cdfName since version 8.

B. Use custom CDF files Version 7 in the R-environment (For Bionconductor 1.8).

Our Version 7 of custom CDF is included Bioconductor 1.8's repository, it would be downloaded and installed automatically just like affymetrix's original cdf packages. What you need to do is to replace AffyBatch object's cdfName. For example,

library(affy)
data<-ReadAffy()
data@cdfName<-"HS133A_HS_UG_7"
result<-rma(data)

C. Download and installed custom CDF files in the R-environment (For all older Bionconductor versions).

Under Linux/Unix, use command "R CMD INSTALL ?.tar.gz".

Under Windows, select menu "Packages->Install package(s) from local zip files".

In order to use the custom CDF files in data analysis after installation, a single line of R command should be added to replace the default Affymetrix CDF file. The following are some examples for different chip and custom probe set combinations:

data<-ReadAffy();
data@cdfName<-"HS133A_HS_UG_[V]"

data<-read.affybatch("1.cel", "2.cel");
data@cdfName<-"HS133B_HS_ENSG_[V]"

Note: subtitute [V] with version number of the CDF you downloaded.

6.3. The probe mapping file matches individual probes in the custom CDF file and the corresponding Affymetrix CDF file.

6.4. The grouping file can be used to find all targets (exons, transcripts) represented by the same probe set. It also contains the probe set spanning range on genome or transcripts to facilitate RT-PCR primer design.

6.5. The best acc file contains the best nucleic acid accession numbers in the corresponding UniGene databases. Basically the most reliable (Refseq>cDNA>EST) short sequence with maximum probe match count is selected to represent a probe set. Affymetrix’s “Representative Public ID” are also updated and our choice of accession numbers have more probe hit than the original acc under most situations.

6.6. The structure of custom probe set name is “database entry ID_at”. New probe set name can be linked to their corresponding UniGene and ENSEMBL entries using “Batch query custom probe sets identity” function at http://brainarray.mhri.med.umich.edu/Brainarray/Database/CustomCDF/genomic_curated_CDF.asp .

6.7. The effect of various custom CDF files can be tested on cel files deposited in NCBI GEO and EBI ArrayExpress through the “GeneChip Analysis using Custom CDF files” function at http://brainarray.mhri.med.umich.edu/Brainarray/Database/CustomCDF/genomic_curated_CDF.asp .

7. Bulletin board for comments and suggestions

Comments and suggestions are welcome. You can pose your opinions and discuss relevant issues in our forum.