-
Affymetrix GeneChips were based on the best UniGene
clustering and genomic sequence information available at the time of
chip design. Due to the significant increase in EST/cDNA/Genomic
sequence information in the last couple of years, some
oligonucleotide probes in these old designs can now be assigned to
different genes/transcripts based on the current UniGene clustering
and genome annotation. While Affymetrix’s current annotation system
maps each probe set to the latest UniGene build every couple of
months, it does not deal with situations where a subset of
oligonucleotide probes in a probe set may be assigned to another
gene or more than one gene based on the current UniGene clustering
and genome annotation.
-
In addition, a significant portion of UniGene
clusters can be represented by more than one oligonucleotide probe
set on GeneChips but there is no standard approach to deal with
signals from different probe sets representing the same gene. It
will be highly desirable to have one probe set-one target
relationship for the interpretation of the data.
-
To satisfy different analysis purposes, we also
developed a new function called "removeprobe" in our CustomCDF R
package (http://arrayanalysis.mbni.med.umich.edu/MBNIUM.html#CustomCDF
) for more flexibility in further customizing CDFs such as removing
allele-specific probes and distant 5' probes based on the list of
"bad probes" generated by users.
-
Furthermore, based on the Chimpanzee genome
sequences, we generated the chimpanzee version of CDF files for
popular human GeneChips.
After probe sequences are BLASTed against the
latest UniGene Build and genome sequence, a series of filtering and
grouping criteria are applied for different CDF files.
2.1. UniGene CDF files:
They are based on UniGene clustering and genome sequence. They are
closest close to Affymetrix annotation in terms of gene definition.
·
A probe must have perfect match (hit) to both cDNA/EST sequences and
genome sequence.
·
A probe must only hit one UniGene cluster and one genomic location
·
All probes representing the same gene must align sequentially in the
same direction within the same genomic region
·
Each probe set must contain at least three oligonucleotide probes and
probes in a set are ordered according to their genomic location.
2.2. CDF files for
Reference sequence, Entrez Gene and Exon, ENSEMBL Gene, Transcript and
Exon and VEGA Gene, Transcript and Exon
·
A probe must hit only one genomic location
·
Probes that can be mapped to the same target sequence in the correct
direction are grouped together in the same probe set.
·
Each probe set must contain at least three oligonucleotide probes and
probes in a set are ordered according to their location in the
corresponding exon.
2.3. Chimpanzee CDFsfinitions.
·
Affymetrix-Chimp: Probes in Affymetrix CDF but not presented on
Chimpanzee genome are eliminated from the corresponding human Affymtrix
CDF.
·
Human UniGene-Chimp: Probes in human UniGene-based CDF but with no hit
or with more than one hit on the chimpanzee genomes are eliminated.
·
Version 1 Statistics
· Version
2 Statistics
· Version
3 Statistics
· Version
4 Statistics
· Version
5 Statistics
·
Version 6 Statistics
·
Version 7 Statistics
·
Version 8 Statistics
·
Version 9 Statistics
·
Version 10 Statistics
4.1. Probesets in these custom
CDF files contain from 3 to several dozen probes. The within-chip error is very
different for different probe set.
4.2.
UniGene CDF files: While
our criteria ensure the purity of the redefined probe set based on the available
information, we may throw away some good probes since large UniGene clusters may
contain small percentage of sequences from other genes due to the presence of
chimeric clones or significant homologous sequences.
4.3.
ENSEMBL exon CDF files:
There are still significant overlap and redundancy in ENSEMBL exon definition.
Exons represented by the same exon can be identified from the probe-exon query
function on our website.
4.4.
ENSEMBL transcript CDF
files: Although ENSEMBL probably provides the most extensive and clear
transcript definition in the public domain, it may not include all known
transcripts due to issues such as database synchronization. Probe targeting at
different region of transcripts may not behave the same way.
Preliminary studies
using human brain samples show the most obvious effect of
the Hs_U133A_UG167 CDF file is the “averaging” of expression
values for multiple Affy probesets representing the same
UniGene cluster (~4000 clusters). The within-chip standard
error for those merged probesets is significantly smaller
due to the increase in probe number. However, the
within-chip standard error is usually 2-3 times than those
from Affy probe set if only 3-5 probes are used to represent
a UniGene cluster. Compared to Affy probeset representing
the same gene, about 900 probesets that do not involve
probeset merging showed at least 25% change in absolute
expression values.
A pilot study comparing
Hs_U133A_UG167 CDF and Affymetrix U133A CDF based on data
derived from 14 human brain samples suggests 20%-30% of the
genes in the differentially expressed gene category can be
different under several cut off criteria.
Systematic analysis of
the impact of custom CDFs can be found in the following
papers:
-
M. Dai, P. Wang, A.D.
Boyd, G. Kostov, B. Athey, E. G. Jones, W. E. Bunney, R. M.
Myers, T. P. Speed, H. Akil, S. J. Watson, F. Meng (2005)
Evolving Gene/Transcript Definitions Significantly Alter the
Interpretation of GeneChip Data. Nucleic Acid Research 33
(20), e175 (http://nar.oxfordjournals.org/cgi/content/full/33/20/e175
)
-
Lu X, and Zhang
X.(2006) The effect of GeneChip gene definitions on the
microarray study of cancers. Bioessays. 28(7):739-46.. (http://www3.interscience.wiley.com/cgi-bin/abstract/112708504/ABSTRACT?CRETRY=1&SRETRY=0
)
-
Rickard Sandberg and
Ola Larsson (2007) Improved precision and accuracy for
microarrays using updated probe set definitions. BMC
Bioinformatics 8:48. (http://www.biomedcentral.com/1471-2105/8/48
)
Custom CDF files can be
selected based on species, Affymetrix GeneChip type, CDF file type and
CDF file format on our CDF download webpage.
6.1. Affymetrix MAS5 and dCHIP
The ASCII format CDF is for
Affymetrix MAS5 and standalone dCHIP analysis. After unzip the ASCII CDF
package, the custom CDF file can be used exactly the same way as
Affymetrix CDF files. Please note that dCHIP only accept Affymetrix CDF
names thus one has to change the name of the CDF file to the
corresponding Affymetrix name.
6.2. BioConductor
The R packages for Win32/LINUX
are for using GeneChip analysis functions in BioConductor in the
corresponding platforms. Since version 11, Bioconductor redirect request
to our own repository
http://brainarray.mbni.med.umich.edu/bioc.
Or you can modify file $R_HOME/etc/repositories to add our
repository.
A. Use custom CDF files
Version 8 in the R-environment (For Bionconductor 1.9).
Our Version 8 of custom CDF is included Bioconductor 1.9's repository,
it would be downloaded and installed automatically just like
affymetrix's original cdf packages. What you need to do is to replace
AffyBatch object's cdfName with Custom CDF name. For example,
library(affy)
data<-ReadAffy(cdfname =
"HS133A_HS_UG" ) #Custom CDF name consists of
Original CDF name, Species and Custom CDF Type
result<-rma(data)
Please note that the version number is already removed from cdfName
since version 8.
B. Use custom CDF files
Version 7 in the R-environment (For Bionconductor 1.8).
Our Version 7 of custom CDF is included Bioconductor 1.8's repository,
it would be downloaded and installed automatically just like
affymetrix's original cdf packages. What you need to do is to replace
AffyBatch object's cdfName. For example,
library(affy)
data<-ReadAffy()
data@cdfName<-"HS133A_HS_UG_7"
result<-rma(data)
C. Download and installed
custom CDF files in the R-environment (For all older Bionconductor
versions).
Under Linux/Unix, use command
"R CMD INSTALL ?.tar.gz".
Under Windows, select menu
"Packages->Install package(s) from local zip files".
In order to use the custom CDF
files in data analysis after installation, a single line of R command
should be added to replace the default Affymetrix CDF file. The
following are some examples for different chip and custom probe set
combinations:
data<-ReadAffy();
data@cdfName<-"HS133A_HS_UG_[V]"
data<-read.affybatch("1.cel", "2.cel");
data@cdfName<-"HS133B_HS_ENSG_[V]"
Note: subtitute [V] with version number of the CDF you downloaded.
6.3.
The probe mapping file matches
individual probes in the custom CDF file and the corresponding
Affymetrix CDF file.
6.4.
The grouping file can be used to find all targets (exons, transcripts)
represented by the same probe set. It also contains the probe set
spanning range on genome or transcripts to facilitate RT-PCR primer
design.
6.5.
The best acc file contains the
best nucleic acid accession numbers in the corresponding UniGene
databases. Basically the most reliable (Refseq>cDNA>EST) short sequence
with maximum probe match count is selected to represent a probe set.
Affymetrix’s “Representative Public ID” are also updated and our choice
of accession numbers have more probe hit than the original acc under
most situations.
6.6.
The structure of custom probe set name is “database entry ID_at”. New
probe set name can be linked to their corresponding UniGene and ENSEMBL
entries using “Batch
query custom probe sets identity” function at
http://brainarray.mhri.med.umich.edu/Brainarray/Database/CustomCDF/genomic_curated_CDF.asp
.
6.7.
The effect of various custom CDF files can be tested on cel files
deposited in NCBI GEO and EBI ArrayExpress through the “GeneChip
Analysis using Custom CDF files” function at
http://brainarray.mhri.med.umich.edu/Brainarray/Database/CustomCDF/genomic_curated_CDF.asp
.
Comments and suggestions are welcome. You can pose your opinions and
discuss relevant issues in
our forum.
|