Skip Navigation


Mol. Hum. Reprod. Advance Access originally published online on September 5, 2007
Molecular Human Reproduction 2007 13(10):713-720; doi:10.1093/molehr/gam050
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrowOA All Versions of this Article:
13/10/713    most recent
gam050v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Google Scholar
Right arrow Articles by Stanton, J.-A.L.
Right arrow Articles by Green, D.P.L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Stanton, J.-A.L.
Right arrow Articles by Green, D.P.L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press on behalf of the European Society of Human Reproduction and Embryology. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org

Building comparative gene expression databases for the mouse preimplantation embryo using a pipeline approach to UniGene

J.-A.L. Stanton1,3, A.B. Macgregor2, C. Mason1, M. Dameh1 and D.P.L. Green1

1Anatomy and Structural Biology, University of Otago, 270 Great King Street, Dunedin 9001, New Zealand 2Centre for Comparative Genomics, Murdoch University, Murdoch 6150, Western Australia, Australia

3 Correspondence address. Tel: +64-3-479-7483; Fax: +64-3-479-7254; E-mail: jo.stanton{at}anatomy.otago.ac.nz


    Abstract
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Acknowledgements
 References
 
To understand early mammalian development there is a need to compare profiles of gene expression from different stages of the preimplantation mouse embryo. We describe here a method that uses gene expression data held in the UniGene database of the National Institutes of Health (NIH). The full mouse UniGene database (build #151) contains 43 104 gene clusters generated from ~4.1 million sequences. The Expressed Sequence Tags (EST) used to build UniGene are derived from cDNA libraries that are archived separately in the database of Expressed Sequence Tags (dbEST) database, with their own catalogue numbers. The mouse dbEST database contains 32 non-normalized dbEST libraries constructed from preimplantation stages (unfertilized oocyte, fertilized oocyte, 2-, 4-, 8- and 16-cell embryo and blastocyst). These libraries contain 219 852 EST sequences mapping to 15 731 UniGene clusters. We have developed a computational pipeline approach that imports and aggregates inventories of gene expression contained in these dbEST libraries. It uses these data to build an annotated web-based database of preimplantation gene expression with an in-built capacity for comparison of expression profiles. Comparison of gene expression profiles obtained for each developmental stage show statistically significant changes in gene expression during preimplantation development. These in silico-generated profiles were validated using RT–PCR.

Key words: database/EST/expression profiling/global gene expression/UniGene


    Introduction
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Acknowledgements
 References
 
There is a need for methods of developing large, curated profiles of gene expression for specific cells, tissues and organs, and determining the way in which these profiles change in response to disease, drugs, development, external stimuli and pathogens. Ideally, these inventories would represent standardized benchmarks, accessible to all. With this in mind, we have developed a database of gene expression for the mouse preimplantation embryo.

Profiles of gene transcript expression can be obtained in a number of ways but all methods rely on the representative capture and stabilization of mRNA transcripts. The choice of approach to inventory construction and profiling is usually a compromise between information content and the resources needed to acquire it. For example, random sequencing of high-quality cDNA libraries provides information on the population distribution of different transcript species and a large amount of data about specific sequences. When imported into a database such as UniGene, the clustering of Expressed Sequence Tags (EST) provides information about intron/exon boundaries, splice variants, polymorphisms, deletions and the location of 5'-termini. However, cDNA libraries have almost never been sequenced exhaustively enough to acquire complete transcriptomes, and it is increasingly clear that the size-selection of mRNA transcripts that occurs in cDNA library construction eliminates many legitimate small RNA species. An alternative approach to inventory construction is through use of high-throughput techniques such as serial analysis of gene expression (SAGE) (Velculescu et al., 1995) and massively parallel signature sequencing (MPSS) (Brenner et al., 2000). These generate large transcriptomes but suffer from a number of problems, including false positives and redundancy in gene assignment. This is a direct result of sacrificing EST length for a high number of ESTs.

We have been interested in developing methods for building gene expression inventories for the preimplantation embryo from public-domain databases, and making these available to the research community. To this end, we have developed a computational pipeline that starts with the National Institutes of Health (NIH) dbEST and UniGene databases and allows us to build annotated web-based databases of comparative gene expression. The input into the pipeline is a simple, user-created list of dbEST libraries, specified by their identification number (ID). Libraries are aggregated and different aggregates compared statistically. The process is automated using a master script. The pipeline approach is of general applicability to any species in UniGene and any set of dbEST libraries within those species. We have used the approach here to construct a gene expression database for mouse preimplantation embryos. The mouse UniGene database contains 32 preimplantation dbEST libraries ranging over seven developmental stages from the unfertilized oocyte to blastocyst. None of the libraries was normalized, making it possible to establish profiles of gene expression in silico as well as gene inventories. These profiles were tested by RT–PCR of candidate genes.


    Materials and Methods
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Acknowledgements
 References
 
Computation
The method employs readily available open source products. In its current form, the application consists of three parts. The first is a Perl script that handles the downloading of new UniGene builds and other related files from NCBI. The second module is written in the C programming language and acts as a wrapper application, controlling the execution of data parsing and normalization as well as the annotation pipeline itself. The third module is the annotation pipeline, and is written in Perl.

Scripts written in Perl, incorporate a variety of modules readily available from CPAN (http://www.cpan.org). The application’s back end is MySQL, with a web-based front end powered by an Apache web server, i.e. served data from an Apache Tomcat JavaTM web application server. Ongoing development work will see the application being ported almost entirely to the C programming language. The pipeline is shown diagrammatically in Fig. 1. The subscriber database and further information is available from http://www.bio-informatix.com/annotated.


Figure 1
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1: Overview of UniGene pipeline process

 
For any given organism, the application’s initial input is derived from two configuration files. The first of these is a file containing general information applicable to all annotation runs, irrespective of organism. This file contains information such as where downloaded data can be found, where it needs to be parsed to, what the organisms of interest are, etc. The second configuration file is a per-organism specification of dbEST libraries, tissue types and other organism-specific information. This file contains, among other things, the set of dbEST libraries used to build the comparative gene expression databases. Each library is associated with a tissue type into which it is to be amalgamated. The per-organism configuration divides the library and tissue declarations into ‘views’. This allows gene expression profiles to be generated for different classes of tissue in the same organism using possibly overlapping dbEST libraries. It also makes it easy to add or subtract libraries and tissues if required. The dbEST library ID numbers and details of library construction can be obtained via the UniGene library browser (http://www.ncbi.nlm.nih.gov/UniGene).

Currently, execution of the application is command line-based. The script buildcheck is responsible for updating organism data from NCBI. It is a stand-alone application and responsible only for downloading data and keeping track of current build numbers. The annotation process, as mentioned above, is controlled by a wrapper application, called ‘updatedbs’, that automates the execution of all annotation subprocesses. Each proceeding subprocess is generally dependant on the previous one for its input. The overall structure and flow of the application is depicted diagrammatically in Fig. 1.

The UniGene database is updated from time-to-time with new builds, so the application responsible for updating organism data for annotation must be run periodically in order to ensure the annotated databases contain the most up-to-date information. As a result, a number of more general programs must be run before the pipeline itself is executed. These include modules to parse ‘flat files’ from the UniGene, Homologene and Entrez gene data repositories, accessed via NCBI’s ftp server, as well as dbEST libraries, downloaded from the NCBI UniGene website (http://www.ncbi.nlm.nih.gov/UniGene). The parsing process extracts the required information from these files and stores them in convenience tables to be used by preceding scripts. Some of this information is accessible later by way of search functionality built into the annotated web application. Once these initial steps have been completed, the pipeline portion of the application is executed, the individual steps of which we now outline:

(i) each parsed dbEST library is merged into the relevant tissue-specific file as declared in the per-organism configuration file. These are saved as tab-delimited text files.
(ii) The second subprocess of the pipeline is executed in four steps. The first step reads the files created by step 1 above and removes from each one those records that make redundant reference to individual EST sequences as identified by clone identification tags. Step 2 of the process takes the remaining records and loads them into temporary, tissue-specific database tables. A series of SQL queries are then executed on each table to determine:
(a) the total number of records (ESTs),
(b) the total number of ESTs ‘with’ an associated UniGene number,
(c) the total number of ESTs ‘without’ an associated UniGene number,
(d) the number of distinct UniGenes referenced, and finally
(e) the number of ESTs referenced for each distinct UniGene number.
The next step in the process takes the above outcomes and calculates for each UniGene a mathematically-normalized abundance (ESTs referenced for a UniGene(e)/total number of ESTs(a) x10 000) along with the percentage of records that each UniGene in any given tissue file contributes to the total number of records in that file. The final step in the process uploads the normalized data into tissue specific database tables.

(iii) The next script, ‘unigene_annotate’, gathers together the data making up the annotation, and stores these in the annotated database where they can be viewed and queried via the web-based interface. First, it uses Perl’s DBI and file modules to query the tables of mathematically-normalized data created in step 2 in order to generate a list of all UniGenes associated with each tissue. Working through each tissue in turn (e.g. seven stages of preimplantation development), the script queries the mathematically-normalized data to find the abundance and percentage values for each UniGene (as calculated in step 2). Next it queries the database table containing data extracted from the NCBI UniGene flat file to find (a) the gene name, (b) cytoband, (c) Gene ID, (d) HomoloGene group ID and (e) title data for each UniGene cluster. Having gathered these data, the ‘unigene_annotate’ script stores them in tissue-specific database tables.
(iv) The script ‘unigene_edd’ is responsible for determining the statistical significance of the computed gene expression profile in a process we call electronic differential display(edd). This process involves performing pair-wise comparisons between one tissue file and any other tissue file of interest. These comparisons determine whether the expression of a given UniGene differs significantly between the first tissue file (reference tissue) and another tissue file. The calculation is performed using an algorithm developed by Audic and Claverie (1997).
The expression data gathered by ‘unigene_edd’ are the non-normalized data generated during the first step of step 2 above. As they are not normalized, the data reflect the different sizes of the aggregate files. Computing time can be saved by setting a threshold below which the pairwise calculation is not performed. This threshold is based on the percentage expression of each UniGene in the two tissue files being compared. The calculation is not performed if both tissue files show expression levels for the UniGene below the threshold value. The script takes the results of these calculations and stores them in appropriate, temporary database tables.
(v) The final step in preparing the data for display involves flagging each record generated in the above process according to how expression levels of a given UniGene differ between two tissue types. A record is also flagged if no calculation was performed due to low expression levels.
(vi) Because the calculations performed in step 4 can take some time to perform, they are not suitable for creating on-the-fly web pages. For this reason, the final script, ‘unigene_annotate_with_edd’, stores the data in another MySQL database, i.e. ready for display on the web. This means that an investigator views pre-computed results on the website. The script ‘unigene_annotate_with_edd’ performs the same process as ‘unigene_annotate’ in step 5 but adds in the electronic differential display data resulting from steps 4 and 5.

Web site display
At this point, all data needed for display have been computed and all required information is loaded into the backend database tables ready for querying via the annotated database web application. Figs 2 and 3 provide examples of the two views available to an investigator accessing the web site. The first view displays a tissue-specific catalogue with abundance and percentage expression data as well as other information incorporating external links to specific locations within the NCBI web site. The second view incorporates the same information as the first, but includes the expression analysis data along with a comparative visual representation of the data generated by the edd phase of the pipeline, from calculations of steps 4–6. Each of the tissues is displayed across the line with the user selected reference tissue highlighted. The background colour of the table cells reflects the UniGene abundance in each tissue and other data derived in step 5. First, a colour scale runs from dark blue to light blue to indicate greater or less abundance. Next, if the expression level of a UniGene in another tissue is calculated to show no statistically significant difference from the reference tissue (denoted by a red column header), the table cell is coloured yellow. If the percentage of the UniGene in question is too low to be considered (see step 4), the cell is coloured grey. The number in each cell is the mathematically-normalized abundance for a library of 100 000 ESTs. It is this ordering of UniGenes by abundance, colouring of table cells and collation of data from distributed sources that provides an immediate visualization tool for investigators. All databases are fully searchable for UniGene ID, gene names, Gene ID and gene ontology terms.


Figure 2
View larger version (63K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2: Tissue-specific view of the 25 most abundant entries in the fertilized oocyte dataset

Each row contains reference to a single UniGene. Each columns contains the UniGene cluster number, Gene ID, normalized abundance of EST in tissue dataset, abundance as a percent of the total dataset, the gene symbol, cytoband if known, HomoloGene ID and full gene name

 

Figure 3
View larger version (39K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3: The 10 most abundant entries in the fertilized oocyte dataset

The reference stage is labelled in red. Table cell colour indicates increasing normalized EST count (dark blue more, light blue fewer ESTs). Yellow cells indicate no statistically significant difference between the EST count in that cell compared with the cell in the reference stage column. u, unfertilized oocyte; f, fertilized oocyte; twocell, 2-cell embryo; four, 4-cell embryo; eight, 8-cell embryo; sixteen, 16-cell embryo and blast, blastocyst

 
RT–PCR protocol
All animal experiments were approved by the Animal Ethics Committee, Otago University (approval number 2/02). Female mice were superovulated by injection with Folligon (5 units IP, Southern Vet Supplies, Christchurch, New Zealand) at 2 p.m. on experimental Day 1 followed by Chorulon injection (5 units IP, Southern Vet Supplies) at 12 noon on Day 3. To obtain embryos, superovulated females were naturally mated by placing them with male mice at a ratio of one male per female on Day 1 of the experiment. Oocytes and embryos were collected on the morning of experimental Day 4. Animals were killed by CO2 exposure and the reproductive track was dissected out immediately. Cumulus masses were harvested by nicking the swollen oviduct. These were transferred to FHM + hyaluronidase media (Specialty Media, Phillipsburg, NJ, USA) to remove granulosa cells. Oocytes and embryos were then placed in FHM media (Specialty Media). Unfertilized oocytes were then processed for RNA isolation. Embryos were incubated at 37°C/5% CO2 for 45 h after which they were sorted into categories. For this work these were: 2-, 4-, 8-cell and blastocyst. Embryos with an abnormal phenotype were discarded.

Total RNA was extracted from oocytes and embryos by placing them directly into 100 µl of Trizol reagent (Invitrogen, Carlsbad, CA, USA). These were frozen and thawed twice at –80°C to lyse cells. RNA was isolated according to the manufacturers protocol with 20 ng of glycogen (Invitrogen) added to the final isopropanol precipitation. RNA was precipitated at –20°C overnight and the pellet was resuspended in 10 µl 1xTE (10 mM Tris, 1 mM EDTA, pH 8) buffer. cDNA synthesis and amplification was performed using the SMART cDNA synthesis system (Clontech, Mountain View, CA, USA) as per the manufacturer's instructions. Oocyte and embryo cDNA was amplified with 23 cycles of PCR. A 1 in 100 dilution of amplified cDNAs in PCR grade H2O was used in all PCR experiments. Total RNA from a number of adult mouse tissues snap frozen in liquid nitrogen was extracted using Trizol reagent (Invitrogen) according to the manufacturer' recommended protocol.

A total of 109 PCR primer pairs were designed to span intron/exon boundaries where possible, to select for mRNA derived amplicons that were between 300 and 600 base pairs in length. High-throughput PCRs were performed using PCR Supermix (Invitrogen) pre-aliquoted into 96-well plates. After an initial incubation of 94°C for 2 min, needed to activate the enzyme, PCR cycling conditions were 94°C for 15 s, 58°C for 30 s, 72°C for 1 min for a total of 35 cycles. PCR amplicons were fractionated using 96-well formatted 2% agarose E-gels (Invitrogen). Results were scored for each tissue or embryonic stage by the presence or absence of a PCR product.


    Results
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Acknowledgements
 References
 
The result of the computational pipeline for the set of preimplantation dbEST libraries is a searchable database of gene expression, in which EST clusters are ranked and displayed on a web interface. Two views of the database are given in Figs 2 and 3. Fig. 2 shows the display generated for a single EST data set. Fig. 3 shows the display when comparing across several EST data sets. Each line contains data for one UniGene, and each column represents different stages of preimplantation development. Columns illustrate gene transcript expression profiles, and the relationship between expression of individual genes in one stage and over many stages.

Fig. 3 shows the first 10 entries for the fertilized oocyte stage of mouse preimplantation development computed from the dbEST libraries constructed from preimplantation mouse embryos and oocytes. The first column (unigene_id) indicates the UniGene cluster number. The next seven columns cover the seven developmental stages: unfertilizedoocyte (u), fertilized oocyte (f), 2- (twocell), 4- (fourcell), 8- (eightcell), 16-cell embryo (sixteencell) and blastocysts (blast) that can be ranked in descending order of expression. Any one of the stages can be chosen, in turn, as the reference stage. In Fig. 3, the reference is the fertilized oocyte. The background colour of each table cell reflects the predicted abundance in the embryo and other data derived in step 6 of the computational method. The colour scale runs from dark blue to light blue to provide a visual indication of greater or less abundance. A table-cell coloured yellow denotes expression level of a UniGene that shows no statistically significant difference from the reference stage using the method of Audic and Claverie (1997). This computation takes library size into account, with the result that two stages may appear to have similar levels of expression for a particular gene but do, in fact, differ significantly. If the percentage of the UniGene in question is too low to be considered (see step 4 in Materials and Methods), the cell is coloured grey (not shown). The number given in each cell is the normalized abundance for a library of 100 000 ESTs. The unit size of this value is different for each stage, reflecting the total number of ESTs in the aggregate set (4.32; 2.78; 3.17; 5.92; 4.24; 8.50 and 1.30, respectively). A summary of EST data for each aggregate set and an indication of EST assignment to UniGene clusters is given in Table 1.


View this table:
[in this window]
[in a new window]

 
Table 1: Number of ESTs and UniGenes in each aggregate data set

 
In addition to these data, each line of the table also contains the current gene abbreviation (gene), the chromosomal location of each UniGene cluster (cytoband), the Gene number of the UniGene (gene), the HomoloGene group ID of the UniGene cluster (homolog) and the title or descriptor of each gene (title). This information and its colour coding are all computed automatically after each UniGene database update (about every 3 weeks). The data presented here is based on UniGene Build 151.

Fig. 3 also shows the ease with which genes can be tracked, particularly those with high expression levels. Thus, there are a large number of genes showing high transcript expression in the fertilized oocyte, that decline sharply during other developmental stages, consistent with the transition from zygotic genome activation to the setting up of embryo compaction and the first differentiation decision. Data for the blastocyst show new genes turning on, consistent with separation of trophectoderm and inner cell mass lineages, and preparation of the blastocyst for implantation (data not shown).

RT–PCR of selected genes
The computational pipeline makes a wealth of predictions about the expression of genes and significant changes in gene expression over the course of development. A sample of these predictions were tested by RT–PCR. cDNA templates, constructed from unfertilized oocyte, 2-, 4-, 8-cell embryo- and blastocyst- RNA, were used to test expression of 109 genes identified from the database. This analysis included seven genes not represented in the embryo databases. A comparison of RT–PCR measurements and database predictions for these genes is given in Fig. 4. Fig. 5 summarizes RT–PCR results from a range of adult mouse tissues amplified using the same set of 109 PCR primer sets, as positive controls. We found a 96% concordance between predicted gene expression (both presences and complete absences) and our RT–PCR results when preimplantation development was taken as a whole. The discrepancy was accounted for by the depth of EST sampling at each developmental stage. This was confirmed in a more detailed, stage-by-stage comparison, of RT–PCR results to database prediction where there was a simple relationship between aggregated library sizes and the percentage concordance. The smallest EST dataset in the catalogue tested by RT–PCR was the 4-cell stage (Table 1), where there was an approximate 54% concordance between RT–PCR results and database predictions. By comparison the blastocyst stage, which has produced the largest EST dataset, demonstrated a concordance between RT–PCR and database prediction of ~77%.


Figure 4
View larger version (59K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4: Summary of RT–PCR results and database predictions for preimplantation gene expression

Dark shades indicate either the presence of an RT–PCR product or a positive prediction by the database. Light shades indicate no RT–PCR product or an absence of ESTs for the gene from the database. Each column represents a developmental stage and each row is identified by a Gene ID number. u, unfertilized oocyte-; two, 2-; four, 4-; eight, 8-cell embryo-; blast, blastocyst RT–PCR; uDB, unfertilized oocyte database prediction; 2 DB, 2-; 4 DB, 4-; 8 DB, 8-cell embryo database prediction and blast DB, blastocyst database prediction

 

Figure 5
View larger version (52K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 5: Summary of RT–PCR results from positive control tissues

Darker shades indicate the presence of an RT–PCR product, while light shades indicate no RT–PCR product. Each column represents a developmental stage or tissue and each row is identified by a Gene ID number. u, unfertilized oocyte RT–PCR; 2, two, 2-; four, 4-; eight, 8-cell embryo-; blast, blastocyst RT–PCR; brain, adult mouse brain; kidney, adult mouse kidney; liver, adult mouse liver; lung, adult mouse lung and ovary, adult mouse ovary

 

    Discussion
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Acknowledgements
 References
 
This discussion is in three parts: (i) the computational pipeline; (ii) the data generated on mouse preimplantation development in silico by the pipeline and a comparison of these data with our own RT–PCR results and (iii) the opportunities for developing UniGene.

The computational pipeline
The computational pipeline is designed to assemble an annotated in-house database of gene expression, tailored to an investigator’s area of interest. The assembly uses a number of databases, but it starts with the NIH dbEST and UniGene databases. These two databases possess important features that give the pipeline its structure. For dbEST, this feature is an archived set of cDNA libraries, each with a catalogue number and a tissue of origin. These libraries contain ESTs, each with their own GenBank accession number. For UniGene, the unique feature is the clustering of ESTs imported from dbEST. These clusters are assembled from transcripts that share sequence similarity and nominally represent the products of single genes (Boguski and Schuler, 1995). By clustering ESTs from different tissues and organs and aligning the clusters to the genome, it is frequently possible to identify tissue-related splice variants, and track these variants back to specific tissues.

At its simplest, our computational pipeline takes a dbEST library, identifies the UniGene cluster in which an EST is binned, sums occurrences of each UniGene cluster number in the dbEST library, ranks the sums and adds annotation from a variety of other NIH databases. The pipeline has two additional features. It amalgamates dbEST libraries chosen by the investigator into aggregate sets, and it compares (single or aggregate) libraries mathematically and computes statistically significant differences between them.

The principal task of an investigator is to choose those dbEST libraries that are of biological interest. The pipeline will process any libraries, but, to make biological sense, the choices should be consistent with the biological question under investigation. Where the software aggregates dbEST libraries to produce larger data sets, the key assumption is that, if the amalgamated dbEST libraries are non-normalized, an aggregate is equivalent to amalgamating data from replicate tissue samples (Stanton and Green, 2001). One outcome of amalgamation is larger inventories. Amalgamated libraries also potentially provide a more robust gene expression profile because variation between nominally-identical libraries is smoothed, i.e. aggregation potentially smoothes ‘snap-shot’ effects. Factors that could affect the choice of libraries for amalgamation include normalization, unwanted pooling of tissues, developmental staging and animal strain. The computational outcome is automatically determined once the choice of libraries is made.

The pipeline approach also compares gene expression levels between dbEST libraries (either single libraries or aggregated sets), with automatic identification of those genes that differ significantly. This process is carried out at step 4 of the computational pipeline. The statistical test used is one of several methods that are available (Fisher’s Exact test, Z test (equivalent to the chi-squared test)) (Audic and Claverie, 1997). These give similar results.

The pipeline integrates data from other NIH databases as it runs. For example, it identifies orthologues of mouse genes, where these are available, using HomoloGene. Most of the available orthologues turn out to be for genes expressed in adults of the other species. This is not entirely unexpected. Analysis of transcript expression data from mouse embryos, neonates and adults shows clearly that there are genes expressed in embryos and neonates that are not expressed in adults (Stanton and Green, unpublished data). Embryo and neonate cDNA libraries are almost entirely absent from human and other mammalian dbEST catalogues, and hence from HomoloGene.

Predictions about gene expression in the mouse preimplantation development
We used our pipeline approach to obtain expression data from a set of mouse preimplantation dbEST libraries. These libraries have at least two important properties. First, they are obtained from uncontaminated cell populations. Preimplantation development is largely cell autonomous, and the embryo makes no direct contact with other cell types, being sequestered within the zona pellucida. Preimplantation development is also a highly ordered set of cell divisions with clear, visually-defined stages (unfertilized oocyte, fertilized oocyte, 2-, 4-, 8- and 16-cell-embryo and blastocyst), greatly reducing temporal mixing. Second, the libraries are not normalized, making it more likely that their random sampling will lead to an accurate gene expression profile.

The computational pipeline makes a number of predictions about the presence or absence of gene transcripts and their changes with development. We have examined a sample of 109 genes by RT–PCR. Of the genes predicted by our pipeline system to show presence or absence during preimplantation development, 96% showed concordance with expression detected by RT–PCR. This means four UniGenes did not behave as predicted by the database. UniGene Mm.38878 (14064) was predicted by the database to be present at the unfertilized, fertilized and 2-cell stages (12, 16 and 3 normalized EST count, respectively), but was not detected by RT–PCR. Primer controls for this assay were positive. Given the low number of ESTs for this transcript in the database, it is probable that RT–PCR did not detect the message due to a sampling artifact. Optimization of the primers and further experiments may well verify expression by the preimplantation embryo. This is also likely to be true for UniGene Mm.287425 (19270).

Two UniGenes that were detected by RT–PCR were not in our database (Mm.40740 (11541) and Mm.4831 (15561)), suggesting the database suffers from an underestimation of gene expression. This is confirmed by a more detailed analysis of concordance between pipeline and RT–PCR data, where the smallest aggregate library examined (4-cell embryo) had the smallest number of concordant hits by RT–PCR. It is clear that the 4-cell library is too small to capture the full complexity of low-copy-number transcription, and that the same applies with less force to all the preimplantation embryonic libraries.

Further opportunities for developing UniGene
UniGene was intended by its founders to be a vehicle for identifying the transcript map of an organism (Boguski and Schuler, 1995). It was started long before mammalian genome sequences were available and it focused on clustering transcripts as the primary tool for identifying products of gene expression. In recent years, it has been possible to map transcripts to genomes. This mapping initially produced a sharp contraction in the number of UniGene clusters, mainly because isolated 3'- and 5'-clusters of the same gene could now be binned into a single new cluster. However, UniGene databases are again increasing in size, suggesting that complete transcriptomes have yet to be fully identified. A valuable outcome of UniGene has been an increasing documentation of alternative splicing sites, as transcripts from different tissues are mapped to the genome.

Despite the considerable power of the UniGene approach, it has the potential to be a still more powerful bioinformatics tool. Arguably, the kinds of biological questions we now wish to ask include: (i) what is the regional variation of gene expression and alternative splicing, not just at organ level but at the level of tissues and, increasingly, individual cells? (ii) what are the stochastic variations in transcript levels over time? and (iii) what is the role of small RNA species in regulation of transcript expression and translation? Ideally, we need firm expression landmarks in exploring this territory, and the firmest landmarks are likely to be inventories of gene expression at the level of individual cells, with rigorous characterization of transcript sequence.

UniGene, as currently constructed, allows us to begin this endeavour. However, it suffers four major drawbacks. These are: (i) a lack of some archival detail about the cDNA libraries deposited in dbEST, (ii) lack of sequencing depth in individual libraries, (iii) paucity of cDNA libraries derived from single cell types and (iv) size selection of mRNAs from which cDNA libraries in dbEST are constructed.

Our reasoning is as follows. First, the archival descriptions of dbEST library construction are lacking some detail. Data deposition in dbEST and UniGene is the responsibility of investigators and they frequently provide insufficient information (e.g. whether the library is normalized or not). An appropriate archival standard needs to be set by NIH. Second, it is doubtful whether any of the cDNA libraries in dbEST has been sequenced to the depth needed to retrieve all rare transcripts. The advent of cheap, high-throughput sequencing makes this goal feasible. Third, we need increasingly to understand the factors that control cell phenotype. There are between 300 and 1000 cell types in the mammalian body, yet neither the mouse nor human UniGene databases possess cDNA libraries from more than a handful of well-defined cell types. Finally, there is ample evidence that size selection of mRNAs has taken place during construction of most dbEST libraries. This is likely to have eliminated many small regulatory RNAs, and constrained our view of the transcriptome.

The mouse preimplantation data set, studied in this paper, is unusual, in the context of UniGene, because the cDNA libraries are from single cell types (early preimplantation embryo) or two cell types (blastocyst), the developmental stage is accurately identified, and the libraries are known to be non-normalized (although there is an underlying drive for full-length transcripts in the RIKEN libraries that may distort the distribution of transcript species).

The value cannot be underestimated of future UniGene databases in which transcriptomes of all major cell types of a complex body (plant or animal) have been obtained. They would provide a firm underpinning both for SAGE and MPSS, and DNA microarrays, and if rigorously built at the cellular level, would be the benchmark against which to triangulate changes in disease, infection, ageing, etc. It would be possible to build the transcriptomes of more complex systems, such as organs, from the ground up using component cellular transcriptomes and, because it is easy to track alternative splicing, it would provide a route to the tissue- and development-related instantiations of genes. If size selection was avoided, it would also provide access to small RNAs.


    Conclusions
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Acknowledgements
 References
 
We have shown that using a simple data pipeline process on UniGene we can create gene expression catalogues tailored to the preimplantation embryo. With this publication, we have made the database available to the researching public. These catalogues are robust. Our work showed a broad 96% concordance between database prediction and measured gene expression. The nature of our pipeline system means our approach has general application in building and comparing annotated gene expression profiles in the animal, insect and plant species for which there are UniGene databases. The need for flexible, user-defined gene expression warehousing systems will increase as such data becomes easier and more cost-effective to generate. The system we describe here can accommodate new information as UniGene expands, as well as programmed flexibility to incorporate expressed sequence data generated from other techniques such as SAGE or independent high-throughput transcriptome sequencing.


    Acknowledgements
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Acknowledgements
 References
 
This work was supported by a New Economy Research Fund grant from the New Zealand Foundation for Research, Science and Technology.


    References
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Acknowledgements
 References
 
Audic S, Claverie JM. The significance of digital gene expression profiles. Genome Res (1997) 7:986–995.[Abstract/Free Full Text]

Brenner S, Johnson M, Bridgham J, Golda G, Lloyd DH, Johnson D, Luo S, McCurdy S, Foy M, Ewan M, et al. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotechnol (2000) 18:630–634. Erratum in Nat Biotechnol 18, 1021.[CrossRef][Web of Science][Medline]

Boguski MS, Schuler GD. ESTablishing a human transcript map. Nat Genet (1995) 10:369–371.[CrossRef][Web of Science][Medline]

Stanton J-AL, Green DPL. Meta-analysis of gene expression in mouse preimplantation embryo development. Mol Hum Reprod (2001) 7:545–552.[Abstract/Free Full Text]

Velculescu VE, Zhang L, Vogelstein B, Knizler KW. Serial analysis of gene expression. Science (1995) 270:484–487.[Abstract/Free Full Text]

Submitted on May 9, 2007; resubmitted on June 28, 2007; accepted on July 10, 2007.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrowOA All Versions of this Article:
13/10/713    most recent
gam050v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Google Scholar
Right arrow Articles by Stanton, J.-A.L.
Right arrow Articles by Green, D.P.L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Stanton, J.-A.L.
Right arrow Articles by Green, D.P.L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?