IMS Microarray Core facility (IMCF)- Bioinformatics Suite
The Bioinformatics Suite is located in the IMS atrium and is an IMS facility, available to all registered UA staff or students wishing to utilise the software on these workstations. Affymetrix Microarray analysis suite GCOSv4, Expression Console, Data Mining Tool (DMTv3) and GTYPEv4 software are available on these computers. Several free gene expression analysis software programs (see below) can also be accessed from these computers.
There is a charge of £100 per annum per user for individuals running their chips through the on-site Affymetrix Microarray Core Facility and a charge of £250 per user per annum for those running their chips off-site. This charge will allow access to the Bioinformatics Suite for use of all available Gene Expression Analysis software. A booking system is in operation and computers can be booked through IMS reception.
Several 'floating' copies of GCOSv4 are also available for installation on individual computers. These copies are provided to registered users with active projects on a loan basis. We currently provide all groups using the Affymetrix Core Facility with their own copy of GCOSv4. A high specification computer running Windows NT or 2000 is required to run the software.
Payment of the Bioinformatics Suite Access charge does not guarantee the provision of a personal copy of GCOSv4 on your own computer. Users of the on-site Affymetrix Core Facility will be given priority for the 'floating' copies of GCOSv4. Individuals running their chips off-site will be provided with a personal copy of GCOSv4, when available.
Any individuals wishing to access the Bioinformatics Suite must raise an internal requisition via QSP. Please contact Elaina Collie-Duguid to set up your registration following payment (please state your QSP requisition number or PO number).
Affymetrix Analysis Software
Manuals outlining data analysis fundamentals and an overview of experimental design, statistical analysis and biological interpretation of Affymetrix gene expression data using Affymetrix software are available on the Affymetrix website. An overview of Affymetrix data analysis software (GCOSv4, MicroDBv3 and DMTv3) is also available. If you do not have the library file for the genome you are analysing on your computer, library files for all Affymetrix GeneChips can be dowloaded from the Affymetrix website.
GCOSv4
Affymetrix GCOSv4 software is used for normalization, absolute expression analysis and pairwise comparisons of Affymetrix gene expression data. GCOSv4 must be set-up on each computer the first time that you use your log-on name. This will set the defaults to provide optimal visualisation and analysis of your data.
Affymetrix Microarray Analysis Suite (GCOSv4) software utilises two distinct statistical algorithms to generate a 'detection call' in absolute analysis (present, marginal or absent) and a 'change call' (no change, increase, decrease, marginal increase or marginal decrease) in comparative analysis. Affymetrix's in-house data, hybridising a single sample to 2 human arrays of the same type (HG-U133 set), demonstrates a detection call concordance of over 90 % and a false change rate of less than 1%, where a two-fold cut-off was used for analysis of fold change. Additional information on the performance of the human genome set as well as information on the design and performance of other GeneChip arrays is available in the datasheets. An overview of the statistical algorithm and gene expression ('latin square') data used to develop GCOSv4 as well as a statistical reference guide which provides a basic description of the mathematical concepts behind the algorithm, are decribed on the Affymetrix website. A powerpoint demonstration outlining basic absolute or comparative analysis of your data using GCOSv4 is available on the Bioinformatics Suite computers.
Normalisation:
Affymetrix GCOSv4 software allows either 'Normalisation' or 'Scaling' of your array data to compare between arrays.
Scaling data: an arbitrary target signal is designated and GCOSv4 scales the average intensity of the selected normalisation genes to the specified target intensity. This enables you to compare multiple arrays within an experiment. You should use the same target signal across all arrays being compared (we recommend 100). Scaling can be performed in absolute analysis of the data, independently of the comparison analysis.
Normalization can only be done in comparative data analysis in GCOSv4. The software normalizes the average intensity of the experimental array to the average intensity of the baseline array, during normalization in a comparative analysis. The set of genes you want to use to calculate the average intensity, in the experimental and baseline arrays, can be selected. The normalization factor for a particular array changes when you change the comparison baseline array.
The scaling option is preferable as it does not require reference to a baseline array.
To perform normalisation/ scaling of your data: Tools> Analysis Settings> Expression (or relevant choice if not running an expression array) in GCOSv4 toolbar.
Several different methods are available for normalisation/ scaling of your data using Affymetrix software. These involve selecting which set of genes you wish to use to normalise/ scale your data:
If you wish to export raw data (not normalised) into third party software for normalisation and analysis, select 'User defined' with a 'Scaling Factor/ Normalisation Value' of 1. This option does not scale or normalise the data.
'All gene sets': This is a global method, which allows the each array to be scaled to a specified target intensity (arbitrary but recommend 100 as this usually results in a moderate scaling factor), based on the average signal of the all the probe sets on the array. This option is only available with scaling. If using this method be aware of background differences between arrays as this will shift the baseline. Average log ratio for all genes can be used for 'global normalisation' instead of intensity which avoids problem with baseline shifts. Important to check background and average intensity levels between arrays to identify outliers.
Can use a 100 maintenence gene set to normalize or scale human, mouse or rat expression arrays. The genes in these 'maintenence gene subsets' are present on both the A and B arrays in a set, thereby allowing normalisation of your gene expression data across all probe sets available for that genome. These probe sets represent 100 constitutively expressed transcripts, having a range of signal values from low to high and found to be expressed in a range of tissues and cell lines. Affymetrix states the variation in signal for these probe sets in the tissue panel was minimal (%CV less than 40% - this is relatively high so check validity of this set under your conditions).
You can create your 'own normalisation sub-set': If you have run a large number of arrays, it is possible to analyse the gene expression data and select a subset of genes whose expression does not change across your experimental conditions. Make sure you have sufficient genes in your set to be a valid subset for normalising your data. Previously published data may also provide you with this information, but be wary as it is often difficult to assess whether exactly the same conditions applied. You could check the putative set by firstly normalising/ scaling using another method e.g. global scaling.
To normalise/ scale using options 3 or 4 above, select the 'Selected probes sets' button, set a target intensity (100) and browse for the relevant mask file (.msk). The 100 maintenence gene set mask files for human, mouse or rat can be downloaded from the Affymetrix website and saved to disc. The *.msk file can be opened in Notepad or equivalent to view the list of probe sets (The normalization probe sets are the first 100 probe sets found after probe sets with an "AFFX" prefix and are represented by probe sets 1415670 through to 1415769 in your GeneChip expression data file i.e. CHP file). A new mask file can be created from a subset of any of the probe sets present on the GeneChip array being analysed.
The preliminary analysis of your data performed in the Affymetrix Core Facility uses Global (All probe sets) Scaling with a target intensity of 100.
There is still no concensus on the best method for standardising data to compare between arrays. It is advisable to use a number of different methods for normalisation/ scaling to check the robustness of your data. Please note, other gene expression analysis software packages (e.g. GeneSpring, Bioconductor, etc) offer additional/ different normalisation options.
Before carrying out additional analysis of your gene expression data, examine your data set closely for outliers which may skew your data. GCOSv4 report scanner software is available on the Bioinformatics Suite computers. This program transfers important data from GCOSv4 report files into Excel, where it can be analysed easily using graphical displays.
MicroDBv3.0
To be able to use Affymetrix DMTv3.0 data mining software, you firstly must create a database in MicroDBv3.0. This changes the format of your data to produce a relational database instead of independent CHP files. These databases are compatible with DMTv3.0 as well as third party data mining programs. Please follow the instructions for using MicroDBv3.0
DMTv3.0
Affymetrix DMTv3.0 can be used to analyse data from up to 120 GeneChip arrays. This data analysis software is useful for filtering data (e.g. removing transcripts called as 'absent' in all samples from the analyses, selecting highly expressed genes or selecting genes with the strongest change in expression in response to your experimental condition). Some basic statistical analyses can be carried out in DMTv3.0 (e.g. count/percentage, average/SD, median/IQR, Students' t-test, Mann-Whitney). DMTv3.0 software is fairly limited in its clustering capabilities: Non-hierarchical (SOM; self-organising map) and hierarchical (correlation co-effcient clustering) clustering algorithms are available. DMTv3.0 also has a matrix analysis function, which assesses the overlap or non-overlap (uniqueness) between different probe sets. The best features of DMTv3.0 are its interface with NetAffx, which is an excellent tool (see below) and its ability to filter your data prior to advanced data analysis using other data mining software packages (see below). A DMTv3.0 demonstration and tutorial are available on the computers in the Bioinformatics Suite. Please refer to the 'getting started' leaflet before you begin.
NetAffx
NetAffx Analysis Centre: This analysis tool can be used to obtain extensive probe/ target sequence, design, annotation and ontological information to correlate with your data, thereby providing a source of all publically available data for your gene/s of interest on the Affymetrix GeneChip array and facilitating data filtering, supervised heirarchal clustering analysis and validation studies. Annotation files associated with all GeneChip catalogue arrays can be downloaded from the NetAffx analysis centre, as CSV tabular or MAGE-ML XML files. Updated CSV tabular files are due for release in October 2003. NetAffx demo is available on the Bioinformatics Suite computers.
The following functions are available in NetAffx:
Interactive Query: This facility allows you to search array contents for each GeneChip array (e.g. search your selected GeneChip for your gene or pathway of interest - this information can then be downloaded into DMTv3.0 or other data mining tools for advanced data analysis), access extensive annotation information from both Affymetrix and the public domain (including GenBank, Unigene, RefSeq, DBEST, SWALL, SWISPROT, etc - the annotation information from each of these databases can be integrated into a single file for each gene (i.e. probe set) of interest), and visualize probe and target alignments (full sequence information is available for all probe sets used on each GeneChip array as well as for all cluster members/ 'targets' recognised by each probe set).
Batch Query: This facility can be used to query GeneChip arrays using up to 500 gene names, probe set IDs or accession numbers. For example, this could be used to obtain full annotation information on your probe sets of interest (e.g 100 genes in your predictive gene set)
Gene Ontology Mining Tool: This facility allows separation of probe sets into gene groups based on gene ontology (biological process, molecular function or cellular component). The gene groups are visually presented as a dendrogram with each node colour coded according to number or percent of probe sets in the group or by direction of change (increase or decrease) under the condition being analysed. The data from each node/ gene ontology sub-group can be downloaded to DMTv3.0 for further analysis.
UCSC Genomic Query: This facility, released in June 2003, allows you to view the alignment of Affymetrix probe sets relative to the public genome using the UCSC genome browser.
BLAST Status: This facility allows you to view the progress and results of your BLAST queries.
Probe Match Tool: This facility allows you to search for perfect matches between your query sequence and the probe sequences on GeneChip arrays. For example, you can use this function to check if your newly cloned sequence is represented on a particular GeneChip array.
Download Center: GeneChip array sequence files can be downloaded to disc from this site. This includes both probe and target sequences which can then be used in data analysis and validation studies.
Advanced Data Analysis
The goal of clustering analysis is to identify underlying patterns in data sets containing expression levels for thousands of transcripts and to present the data in a user friendly/ 'visually pleasing' format to facilitate interpretation of the data. Threshold and probability filtering can be used to refine your gene set prior to clustering analysis. There are 3 main types of clustering analysis:
i) Non-hierarchical using unsupervised methodologies:
Principal component analysis (PCA) is used to reduce the complexity of the gene set and indicate the number of clusters (k) that can be used in K-means clustering analysis. PCA is not clustering, it allows visualisation by reduction of dimensionality. Ignores variance between replicates, so it is better to do PCA on data that has already been filtered for significance between replicates (e.g T-test or Mann-Whitney).
Non-hierarchical means that it doesn't look at the relationships between different clusters and k-means clustering is one commonly used method of non-hierarchical clustering. K-means clustering analysis partitions a large dataset into smaller clusters without trying to specify relationships between the clusters. It reduces multi-dimensional data into a 2D x-y graph to identify trends in data. The distance between each gene and the centre of each cluster (centroid) is measured. If the gene is closer to the centre of another cluster than the one it is currently assigned to, then it is reassigned. After reassigning all genes to the closest cluster, the centroids are recalculated. It is a very fast algorithm, so is good for large datasets, but it will only give the number of clusters that you ask for and does not show their relationship to each other as in hierarchical clustering. It can be useful to try different k-values (see PCA above).
Self-organising maps (SOM) is another type of non-hierarchical clustering. It is similar to k-means clustering but the centroids don't move freely in multi-dimensional space, they are 2D (e.g. 2 x 2 or 4 x 4 grid). Unlike k-means clustering, there is a relationship both within a cluster and between clusters as the relationship between neighbouring nodes is maintained. SOM is better for large, challenging datasets. Affymetrix DMTv3.0 can perform SOM analysis (this is the only non-hierarchical clustering available in DMTv3.0)
ii) Hierarchical clustering:
Agglomerative hierarchical clustering (unsupervised) "is the most frequently applied technique for processing array data for inspection". This method carries out pairwise comparison of expression levels between experiments or between genes. A similarity measure based on correlation (r = -1 to 1) [e.g. Standard, Pearson or Spearman correlation), confidence (p = 0 to 1) [e.g. Spearman or 2-sided Spearman confidence] or distance (D = 0 to infinity) is used to cluster genes or samples and the data are displayed in 2 dimensions as a hierarchical dendrogram or as a colour-plot/ colour-coded matrix. This provides easily visualised clusters and the relationship/similarity between clusters is shown. This can be used to cluster either samples with closely related patterns of gene expression or genes with similar expression patterns. (Weistein et al. Science 1997, 275: 343-349; Eisen et al. PNAS 1998, 95: 14863-8; Khan et al. Cancer Research 1998, 58: 5009-5013).
Affymetrix DMTv3.0 can perform correlation co-efficient hierarchical clustering but other software packages have more options and better visual output.
iii) Supervised hierarchical clustering:
This is a direct methodology and data is clustered according to scientific knowledge/ gene ontology (e.g. molecular function, metabolic pathway, biological process, cellular expression, genomic/ subcellular location, etc)
Affymetrix software currently does not have the capability of doing supervised hierarchical clustering. However, NetAffx can be used to obtain extensive, integrated annotations for GeneChip probe sets which can be used to cluster data according to ontology. These data can then be exported to DMTv3.0 or other data mining packages.
No single method is recommended to date. It depends on the question being asked, availability of software and web-based resources.
'Predictive gene set'
The above methodologies can all be used to identify 'interesting genes/ pathways for the 'condition' you are studying
Nearest neighbour with majority voting is commonly used to select a 'predictive gene set' for the 'condition' you are evaluating. One gene is selected that is significantly different between the control and experimental groups and other genes with similar profiles are added to the set. Gene with similar profiles can be identifed using the hierarchical and non-hierarchical clustering methodologies outlined above. This approach can also be used to group similar conditions, thereby allowing class prediction to be performed on each sample using a 'predicitve gene set'.
'Leave-one-out cross validation' method for confirming and refining your 'predictive gene set'. More genes, more noisy and less precise, as genes not involved in the 'condition' are included. Leave-one-out analysis removes one gene at a time from set and determines whether the predictive power of the 'gene set' persists, is improved or is lost/ reduced. This approach can also be used to test the predictive power and precision of the 'predictive gene set' when samples independent of those used to build the molecular classifier are not available. 'Leave-one-out cross validation' is used on training and tests sets set-up with the samples used to develop the molecular classified. Where available, 'blinded' samples not used for generation of the 'predictive gene set' can be used to test the molecular classifier.
Plot number of genes (x-axis) vs. errors in classification (y-axis) to identify the correct number of genes for best predictive power/predictive precision.
Software for Data Analysis:
Commercial
Many available, some packages commonly used for analysis of gene expression data are listed below:
- DMTv3.0, Affymetrix UK
- GeneSpring, Silicon Genetics
- Spotfire DecisionSite, SpotFire Inc.
Free Web Resources
SAM Significance Analysis of Microarrays. This is supervised learning software for genomic expression data mining (cDNA and oligo microarrays, and can also be applied to protein expression and SNP data). Correlates gene expression data to a wide variety of clinical parameters including treatment, diagnosis, survival and time trends. Provides estimate of False Discovery Rate for multiple testing. Convenient Excel Add-in.
NetAffx See above for overview
Bioconductor: This is an open source and open development software project for the analysis of genomic data. It initially focussed on analysis of microarray data, but can be used broadly in the analysis of genomic data generated using other technologies. This is a useful tool for bioinformaticians who may wish to further develop their own algorithms/ genomic analysis software. R is the language and environment for statistical computing used in the Bioconductor project. Bioconductor v1.2 was released in May 2003 and contains 30 different packages, including general, annotation, graphics, pre-processing and differential gene expression based packages. Bioconductor also provides metadata and experimental data submitted by various contributors. The affy and affydata packages provide example datasets for software development.
RMA: Is a stand-alone graphic user interface (GUI) program for computing gene expression data from Affymetrix GeneChip arrays. This program uses the same algorithms as affy/Bioconductor (Robust Multichip Average expression summary) but is a windows-based program, targetted more at the biologist than the bioinformatician. It does not require R and is not dependent on the Bioconductor project.
GenMAPP: This denotes GENe Microarray Pathway Profiler and is a graphical interface allowing gene expression data to be grouped according to gene ontology and displayed on existing or user generated maps representing biological pathways. Instructions for using GenMAPP with Affymetrix gene expression data are available on the Bioinformatics Suite computers. Metabolic and regulatory pathways are also available in KEGG.
There are several free web-based packages available for identifying transcription factor motifs in your probe sets of interest and clustering your data according to putative common transcriptional regulatory mechanisms. These include, ATLAS, ConSite, Genomatix, Improbizer, MEME/MAST and TESS.
Considerations
You should assess the utility of several statistical tests/ clustering algorithms to accurately analyse your data (determine the false call rate of each method). There are large datasets of gene expression data available from both Affymetrix (latin square) to prospectively assess the power of your statistical test or gene expression analysis software/ algorithm. Current scientific knowledge can be used to assess the accuracy of your analysis on your own dataset. The accuracy will also be assessed in your subsequent validation studies.
Some packages use raw Affymetrix data (DAT or CEL file- therefore analysing the probe pair data) and some packages analyse Affymetrix processed data (CHP files - therefore analysing the probe set data only).
'Proof of principle': do you see higher expression for transcripts known to be markers for your disease/disorder/treatment?
You should have a higher correlation (r) between replicates than between your biological groups. If not, you have a problem!
Be aware of outliers.
Is it real? Look at replicates and absolute signals - you may have a 10 fold change but if the absolute signal changes from 1 to 10, this is unlikely to be relevant as both signals are below the level of detection of the system and are within the noise/ background.
Can you predict which group each sample belongs to, using your 'predictive gene set'?
Publishing your Data
You should validate a subset of the gene expression changes you have included in your final set of 'interesting genes'. This can be done using Northerns, real-time PCR or qRT-PCR. Correlate some of your findings with existing scientific knowledge in the literature. Additional studies assessing the protein expression levels of key transcripts in important pathways as well as mechanistic studies will add to the strength of your study.
Nature, The Lancet, Cell, Science and other leading journals require the microarray gene expression data that you are publishing, to be available in the public domain. In order to publish your gene expression data it must be MIAME (Minimum Information About a Microarray Experiment)-compliant. The Microarray Gene Expression Data (MGED) Society is actively involved in establishing standards for microarray (as well as other functional genomic and proteomic) data annotation and release into the public domain. MGED have a microarray annotations working group that provides guidelines for generating MIAME- (including a MIAME checklist, which has been developed to assist authors, editors and reviewers of microarray papers) and MAGE-compliant data to assist in the development of microarray repositories and data analysis tools. There are also some useful documents, explaining the latest MIAME guidelines and how to comply with them, from PICR. MIAME is under continuous development to contend with new microarray technologies and applications. The MicroArray and Gene Expression (MAGE) group within the MGED Society, facilitates representation of microarray gene expression data utilising established standards (MAGE and MIAME standards). MAGE have developed a data exchange object model (MAGE-OM) and a data exchange format (MAGE-ML) for microarray studies, as well as a software toolkit (MAGE-stk) to allow conversion between MAGE-OM and MAGE-ML using various programming platforms. There are currently three main repositories for submission of MIAME-compliant microarray expression data: ArrayExpress at EBI, GEO at the NIH and CIBEX at DDBJ. The Stanford Microarray Database (SMD) is currently the largest repository. Although submission is restricted to researchers linked to Stanford University, a significant amount of data is publicly available.

