US 20080228589 A1
Methods and systems for ordering assays which detect SNPs or gene expression are provided. The methods use PCR and RT-PCR procedures. Collections of stock assays are assembled using pre- and post-manufacturing quality control procedures and made available to consumers via the Internet. In addition, custom assays are prepared upon order from the consumer and these assays are also prepared using pre- and post-manufacturing quality control procedures. The assays are then delivered to the consumer.
1. A method for providing to a consumer assays configured to detect presence or expression of genetic material, said method comprising:
providing a web-based user interface configured to receive an order for one or more stock assays;
providing a web-based user interface configured to receive a request for design of one or more custom assays and an order for said custom assays; and
delivering to the consumer at least one custom or stock assay in response to an order for said one on more custom or stock assay placed by the consumer.
2. The method according to
3. The method according to
4. The method according to
5. The method according to
6. The method according to
7. The method according to
8. The method according to
9. The method according to
10. The method according to
11. The method according to
12. The method according to
13. The method according to
14. The method according to
15. The method according to
16. The method according to
17. The method according to
providing a graphical user interface configured for the consumer to perform at least one search for at least one information item used to identify genetic material for a stock assay.
18. The method according to
19. The method according to
20. The method according to
21. The method according to
providing a submission file builder configured to assist the consumer in preparing said submission file for ordering custom assays.
22. The method according to
This application is a continuation of U.S. patent application Ser. No. 10/334,793 filed on Jan. 2, 2003, which claims the benefit of U.S. Provisional Application No. 60/352,039, filed on Jan. 25, 2002, U.S. Provisional Application No. 60/352,356, filed on Jan. 28, 2002, U.S. Provisional Application No. 60/369,127, filed on Apr. 1, 2002, U.S. Provisional Application No. 60/369,657, filed on Apr. 3, 2002, U.S. Provisional Application No. 60/370,921, filed on Apr. 9, 2002, U.S. Provisional Application No. 60/376,171, filed on Apr. 26, 2002, U.S. Provisional Application No. 60/380,057, filed on May 6, 2002, U.S. Provisional Application No. 60/383,627, filed on May 28, 2002, U.S. Provisional Application No. 60/383,954, filed on May 29, 2002, U.S. Provisional Application No. 60/390,708, filed on Jun. 21, 2002, U.S. Provisional Application No. 60/394,115, filed on Jul. 5, 2002, and U.S. Provisional Application No. 60/399,860, filed on Jul. 31, 2002, all of which are hereby incorporated in their entirety by reference.
This application relates to methods for distributing products and services, and more particularly, to methods for placing, accepting, and filling orders for products and services, especially biotechnological products and services.
With the completion of the first draft of the human genome along with the sequencing of the genomes of other species, an enormous amount of genomic resource data has become available. This data has permitted extensive studies of gene expression as well as studies of single nucleotide polymorphisms and their linkage to disease conditions. However, these and other studies have been limited by the need of researchers to spend substantial time, money, and manual labor in the design of probes and primers for experimental assays. Once designed, the researcher can synthesize the probes and primers or order them from an oligonucleotide synthesis facility or service. Only a limited number of studies can be done given time constraints required for the individual researcher to complete each of the tasks leading up to a particular experiment, and, therefore, an overall provider of design, manufacturing, and validation services for probes and primers would be of significant value to the researcher.
Accordingly, the present inventors have succeeded in developing web-based systems for ordering assays, which, in various embodiments, can comprise probes and primers. Included among various of these systems are systems for ordering probes and primers that have undergone design, manufacturing, and validation procedures. In some of these various systems, the ordered probes and primers are delivered to the researcher along with information detailing various parameters associated with production of the assay delivered.
Thus, in various configurations of the present invention, there can be provided a method for supplying to a consumer assays useful in obtaining structural genomic information, such as the presence or absence of one or more single nucleotide polymorphisms (SNPs), and functional genomic information, such as the expression or amount of expression of one or more genes. As such, the assays can be configured to detect the presence or expression of genetic material in a biological sample. The method includes providing a web-based user interface configured for receiving orders for stock assays, providing a web-based user interface configured for receiving requests for design of custom assays and for ordering said assays, and delivering to the consumer at least one custom or stock assay in response to an order for the one custom or stock assay placed by the consumer. In certain other aspects, the present invention can also be directed; to a system and to methods for constructing a system for providing to a consumer assays configured to detect presence or expression of genetic material.
In various configurations of the invention as described above, the method can further include providing a web-based gene exploration platform configured to provide information to assist a consumer in selecting one or both of a stock assay and a custom assay.
The present invention, in various configurations, can also include a search resource provided to identify genetic material. The search resource may provide one or more parameters identifying gene structure or function for selection by the consumer. Assays that detect the presence or expression of genetic material may include assays for detecting SNPs or for detecting expressed genes. In various configurations, the ordering interface can be configured to receive criteria related to the SNP or to the expressed transcript for which an assay is ordered.
Stock SNP assays provided by the web-based user interface can include, in some configurations, a large number of SNP assays, for example, at least 40,000 SNP assays for detecting the at least 40,000 pairs of SNP alleles, or at least 100,000 SNP assays for detecting the at least 100,000 pairs of SNP alleles. In some configurations, SNP assays that can be ordered can be assays for SNPs that are known to be located in gene regions. In some configurations, SNPs that can be detectable may be located at intervals of about 10 kilobases (kb). Also in some configurations, the SNPs have a minor allele frequency of about 10% in a population (which may be, but is not necessarily, a human population).
Stock gene expression assays provided by the web-based user interface can include, in some configurations assays for at least about 10,000 or more expressed genes. In certain configurations, gene expression assays for multi-exon genes can be made up of probes and primers designed to lie on exon-exon boundaries to preclude amplification of genomic DNA.
For SNP assays and gene expression assays, either or both of pre-manufacturing quality control and post-manufacturing quality control can be provided in some configurations of the present invention. Pre-manufacturing quality control may include one or more of pre-processing selection, designing primers and probes, and performing in silico quality control. In the case of SNP assays, pre-manufacturing controls may include identifying optimal sequence regions which may not contain any SNPs or repeat sequences. In the case of gene expression assays, the optimal sequence regions in some configurations may not contain any SNPs other than a SNP for which the assay is designed to detect, and also does not contain any repeat sequences. The designing of primers and probes may comprise, in some configurations, avoidance of non-optimal regions as defined above as well as the use of specifications that optimize PCR reaction conditions for the designed assay. Such specifications include assay values for T.sub.m, GC content, buffer and salt conditions, oligonucleotide concentration in assay, low secondary structure of oligonucleotide, amplicon size and low incidence of primer-dimer formation. In silico quality control can ensure that probes and primers match target sequences but do not match other sequences in the genome or other transcripts.
Post-manufacturing quality control provided in some configurations includes one or more of synthesis yield testing, analytical quality control testing, functional testing, and validation testing.
In some configurations, assays can be shipped with a data sheet which “may be a hard-copy datasheet or an electronic datasheet, or both. The electronic datasheet may be in the form of a CD-ROM or other suitable machine readable form. Assays that are shipped can be identified, in some configurations, by identifiers which can include a two-dimensional (2-D) barcode, and an assay identification number. The assay components in certain configurations include, in a single tube, two primers and a TaqMan® probe. In the case of SNP assays, two primers and two TaqMan® probes can be included, i.e., one TaqMan® probe for each allele. In some configurations, the tubes also contain PCR reagents for performing assays.
In certain configurations, the present invention provides an assay kit. The kit contains at least one assay for detecting presence or expression of genomic material. The kit also contains an information source comprising an E-datasheet, an assay information file, or at least one printed-copy datasheet or combinations thereof.
Various configurations of the present invention also provide a method for building a submission file useful for ordering at least one of SNP genotyping assays and gene expression assays. The method includes providing a graphical user interface configured to accept, from a user, information relating to (a) recipient identification, (b) assay amount, and (c) at least one target sequence, electronically validating at least a portion of the information relating to the target sequence; and saving the information relating to recipient information, assay amount, and target sequence to a file, wherein the information relating to target sequence includes the validated information.
Various configurations of the present invention also provide genomic products and services to a consumer. The products and services provided can be used to detect presence or expression of genetic material in biological sample. The system comprises a first source of information regarding at least one of presence or expression of genetic material in biological samples, a second source of information regarding products and services for analyzing genetic material and an interface system communicating with the first source of information and the second source of information. The system is able to recommend to the consumer certain processes and services in response to inquires to said first source of information by the consumer.
In various configurations, the present invention can provide a web-based user interface configured to receive a request for design of one or more custom assays and an order for said custom assays; and deliver to the consumer at least one custom assay in a single tube in response to an order for said at least one custom assay placed by the consumer, wherein said assay comprises at least one probe, a forward primer and a reverse primer.
In various configurations, the present invention can also provide a web-based user interface configured to receive an order one or more stock assays; and deliver to the consumer at least one stock assay in a single tube in response to an order for said at least one stock assay placed by the consumer wherein said assay comprises at least one probe, a forward primer and a reverse primer.
The present invention also provides a web portal configured to provide an interface configured to accept orders for one or more stock assays; an interface configured to accept orders for one or more custom assays; a gene exploration platform configured to provide information to assist a user in selecting one or both of a stock assay and a custom assay.
The present invention will become more fully understood from the detailed description and the accompanying drawings, wherein:
Allele. One of several alternative forms of a gene or DNA sequence at a specific chromosomal location (locus). At each autosomal locus an individual possesses two alleles, one inherited from the father and one from the mother.
Allele-specific Oligonucleotide (ASO). A synthetic oligonucleotide, often about 20 bases long, which hybridizes to a specific target sequence and whose hybridization can be disrupted by a single base pair mismatch under carefully controlled conditions. ASOs can be often labeled and used as allele-specific hybridization probes. They can also be designed to act as allele-specific primers in certain PCR applications.
Allelic association. Any significant association between specific alleles at two or more neighboring loci.
Alternative splicing. The natural usage of different sets of exons, to produce more than one product from a single gene.
Assay any of a number of nucleic acid assay systems (for review see Kricka, Ann Clin Biochem. 39:114-129, 2002; Shi, Clin. Chem. 47:164-172, 2001; Baner et al., Curr. Opin. Biotechnol. 12:11-15, 2001; Wittwer et al., U.S. Pat. No. 6,174,670, 2001). In various embodiments an assay can comprise nucleobase polymers, such, as, for example, oligonucleotides, which constitute one or more probes and/or a forward and reverse primer. The assays can be configured to detect the presence of a SNP, the expression of a gene or the expression level of a gene. When using a TaqMan® procedure, the assay includes a TaqMan® probe, a forward primer and a reverse primer. See also “custom assay” and “stock assay.”
Alu repeat (or sequence). One of a family of about 750,000 interspersed sequences in the human genome that are thought to have originated from the 7SL RNA gene.
Amplicon. A region defined by pairing of forward and reverse primers around a target site.
Anticodon. A sequence of three consecutive bases in a tRNA molecule that specifically binds to a complementary codon sequence in mRNA.
Autocalling. The use of an automated system to make a determination of genotype.
Bioinformatics. The collection, organization and analysis of large amounts of biological data, using networks of computers and databases.
BLAST. Basic Local Alignment Search Tool—Algorithms for sequence searching. A fast technique for detecting subsequences that match given query sequence. BLAST is a heuristic search algorithm employed by computer programs to ascribe significance to sequence findings using well-known statistical methods, for example, a fast search algorithm to search DNA databases based upon sequence similarities. (See, for example, Altschul et al. J Mol Biol 215:403-10, 1990, Karlin et al., Proc. Nat'l Acad. Sci. USA 87: 2264-2268, 1990; Karlin et al., Proc. Nat'l Acad. Sci. USA 90: 5873-5877 1993; and Altschul et al., Nat. Genet. 6: 119-129 1994.) A BLAST analysis, in this context, refers to comparing sequences using a BLAST program such as blastp, blastn, blastx, tblastn, tblastx or MPBLAST (Korf et al., Bioinformatics 16: 1052-1053 (2000). “BLASTING,” in this context, refers to comparing a sequence to sequences in a database, and identifying sequences contained in the database that are similar or identical to the sequence or its complement.
BLASTn. Search of a DNA sequence against a DNA sequence database.
Calling. The process of determining a genotype.
cDNA. Complementary DNA—a single stranded DNA sequence that was generated from and complementary to an mRNA sequence by reverse transcription. cDNA sequences contain only genes that code for protein (no non-coding DNA is included).
cDNA Library. A collection of single stranded DNA sequences that represent DNA that is translated into protein. cDNA libraries are generated from mRNA. They designed to represent the portion of the genome that is present as mRNA in a given cell on its way to synthesizing the proteins represented in that cell.
Centimorgan (cM). A unit of measure of recombination frequency. One centimorgan is equal to a 1% chance that a marker at one genetic locus will be separated from a marker at a second locus due to crossing over in a single generation. In human beings, 1 centimorgan is equivalent, on average, to 1 million base pairs.
Common SNPs. SNPs which have a minor allele frequency equal to or greater than a minimum percent of occurrence in an overall population, e.g. a population of humans or, in certain subsets of the overall population. Such subsets can include ethnically defined subset population. This can be assessed using samples from mixed populations or from specific populations such as Caucasian populations or African American populations as are available from repositories such as, for example, the Coriell Cell Repositories (Coriell Institute for Medical Research, Camden, N.J.).
Conserved sequence. A base sequence in a DNA molecule (or an amino acid sequence in a protein) that has remained essentially unchanged throughout evolution.
Consumer. Encompasses customers and other users of the products and services provided in configurations of the present invention. Unless explicitly stated otherwise, it is permitted but not required that configurations of the present invention precondition distribution on receipt of a payment or a promise to pay from the consumer for the distributed products or services. The terms “consumer,” “requester,” “user” and “investigator” refer to entities different from the supplier and distributor. The terms “consumer,” “requester,” “user” and “investigator” are often used interchangeably herein. However, in any given situation, it is possible that the consumer, the requester, the user and/or the investigator are different entities or individuals, which themselves may (or may not) be related by agency. For example, the consumer, requester, user and investigator in one instance may be a single individual engaged in research, such as at a college or university. As another example, the consumer may be a medical institution, the investigator may be a physician or researcher employed by the medical institution, and the requester may be an assistant of the investigator. Also herein, the term “user” is frequently used to refer to an entity (such as a consumer, a requester, or an investigator) who can be accessing a computer system.
Contig display name. The contig display name is the genome assembly (GA) name as used in some configurations of gene exploration systems.
Cryptic splice site. A sequence that resembles an authentic splice junction site and which can, under certain circumstances, participate in an RNA splicing reaction.
Custom assay. An assay that is designed from specifications that are generally related to the target sequence, but that do not contain information on the specific sequence of the probe or probes and primers.
dbSNP rs#ID. A specific field for searching for a SNP according to a dbSNP reference cluster ID.
dbSNP ss#ID. A specific field for searching for a SNP according to a dbSNP assay ID.
Deletions can be generated by removal of a sequence of DNA, the regions on either side being joined together.
Discriminator. A procedure in which the “A-statistic” is used to screen out assemblies that are likely to be stacked regions of repetitive sequence that can be from more than one area of the genome.
Distribute. As used herein, the terms “distribute” and “provide” may be used synonymously, and are intended to encompass selling, marketing, or otherwise providing a product or service.
Distributor. As used herein the terms “distributor,” “provider” and “supplier” are used to refer to an entity or entities that distributes and/or supplies products and/or services. The terms “distributor,” “provider,” and “supplier” can encompass sellers, marketers, and other providers of such products and services. The distributor, supplier, and provider may refer to the same entity, to two different entities, or to three different entities. In the description herein, it may be generally assumed that the manufacturer can be the supplier and distributor of the assay-related products and services described herein. However, in some configurations of the present invention, the distribution of the assay-related products and services described herein may be performed by an entity other than the manufacturer who supplies them.
DNA sequence. The relative order of base pairs, whether in a fragment of DNA, a gene, a chromosome, or an entire genome. See base sequence analysis.
Domain. A discrete portion of a protein with its own function and structure. The combination of domains in a single protein determines its overall function. The domain of a chromosome may refer either to a discrete structural entity defined as a region within which a supercoiling can be independent of other domains; or to an extensive region including an expressed gene that can have a heightened sensitivity to degradation by the enzyme DNAase I.
ENTREZ. NCBl's (National Center for Biotechnology Information) search and retrieval system for their data sets. It organizes GenBank sequences and links them to the literature sources in which they originally appeared.
EST. Expressed Sequence Tag. A sampling of sequence from a cDNA library. A short sequence of a cDNA clone for which a PCR assay is available.
Euchromatin. The fraction of the nuclear genome that contains transcriptionally active DNA and which, unlike heterochromatin, adopts a relatively extended conformation.
Exon(s). The protein-coding sequences of genes. Exons only comprise about 10% of the human genome. A segment of a gene that is decoded to give a mRNA product or a mature RNA product. Individual exons may contain coding DNA and/or noncoding DNA (untranslated sequences). See introns.
FASTA (file or format). A DNA sequence format that begins with a single line of text description that is less than 80 characters in length, followed by the DNA sequence file.
FASTA Search: A database search tool used to compare a nucleotide or peptide sequence to a sequence database. The program is based on the rapid sequence algorithm described by Lipman and Pearson.
Fragments. Small sections of DNA.
Frameshift mutation. A mutation that alters the normal translational reading frame of a DNA sequence.
GenBank. The public DNA sequence database maintained by the National Center for Biotechnology Information (NCBI), part of the National Library of Medicine.
Gene Exploration Platform (also referred to as Gene Exploration System). A web-based user interface configured to provide searchable information related to one or more genomes and/or transcriptomes and/or proteomes.
Gene families. Groups of closely related genes that make similar products.
Gene Ontology (GO). A controlled vocabulary for the description of the molecular function, biological process and cellular component of gene products which can be applied to all eukaryotes. The GO terms can be used as search identifiers.
Gene prediction. The process of using computational methods that search for known indicators of coding regions in the raw genomic sequence. These indicators include codon use bias, lack of stop codons, similarity of the translated protein sequence to known proteins, upstream regulators, splice sites, start codon. The outcome can be a set of exons that define a predicted gene.
Gene region. A linear stretch of genomic DNA which serves as a functional gene region consisting of cis-acting regulatory regions, transcribed regions, and intervening sequences as well as 10 kilobase pairs of 5′ flanking sequence and 10 kilobase pairs of 3′ flanking sequence.
Genomics. The study of the genetic material of an organism; the sequencing and characterization of the genome and analysis of the relationship between gene activity and cell function. The genetic material includes exons, introns, regulatory sequences, repeat elements and all other unidentified regions of the genome.
GI. GenBank Identifier, a unique number assigned to protein and nucleotide sequences in the GenBank database.
GT-AG rule. Rule that describes the presence of these constant dinucleotides at the first two and last two positions of introns of nuclear genes.
Haplotype. A series of alleles found at linked loci on a single (paternal or maternal) chromosome.
Heterochromatin. A region of the genome, which remains highly condensed throughout the cell cycle and shows little or no evidence of active gene expression.
Homologies. Similarities in DNA or protein sequences between individuals of the same species or among different species. Homologous chromosomes: a pair of chromosomes containing the same linear gene sequences, each derived from one parent. Homologous chromosomes (homologs): two copies of the same type of chromosome found in a diploid cell, one having being inherited from the father and the other from the mother. Homologous genes (homologs): two or more genes whose sequences can be significantly related because of a close evolutionary relationship, either between species (orthologs) or within a species (paralogs).
HSPs. High-scoring Segment Pairs; two sequence fragments of arbitrary but equal length with an alignment that can be locally maximal and for which the alignment score meets or exceeds a threshold (cutoff) score. These can be generated by BLAST.
Informatics. The study of the application of computer and statistical techniques to the management of information. In genome projects, informatics includes the development of methods to search databases quickly, to analyze DNA sequence information, and to predict protein sequence and structure from DNA sequence data.
Introns. DNA sequences in genes, which have no protein-coding function. Other non-coding regions include control or regulatory sequences and intergenic regions whose functions are unknown. Noncoding DNA separates neighboring exons eukaryote genes. During gene expression, introns, like exons, can be transcribed into RNA, but the transcribed intron sequences can be subsequently removed by RNA splicing and are not present in mRNA.
Investigator. See “consumer.”
Linkage map. A map of the relative positions of genetic loci on a chromosome, determined on the basis of how often the loci are inherited together. Distance is measured in centimorgans (cM).
Linker (or adaptor oligonucleotide). A double-stranded oligonucleotide that can be ligated to a cloned DNA of interest in order, for example, to facilitate its ability to be cloned.
Marker. An identifiable physical location on a chromosome (e.g., restriction enzyme cutting site, gene) whose inheritance can be monitored. Markers can be expressed regions of DNA (genes) or some segment of DNA with no known coding function but whose pattern of inheritance can be determined. See RFLP, restriction fragment length polymorphism.
Master cluster. A “super cluster” that can be formed by joining clusters and singletons that have representative clones with significant matches (a Product Score of 40 or more) to the same gene. The master cluster is named after the cluster (or singleton) with the highest Product Score.
Mate pairs. A pair of reads that are in opposite orientations and at a distance from each other approximately equal to the insert length.
Messenger RNA (mRNA). RNA that serves as a template for protein synthesis. See genetic code.
Missense mutation. A nucleotide substitution that results in an amino acid change.
mRNA (Messenger RNA). The nucleic acid intermediate that can be used to synthesize a protein. The mRNA corresponds to one strand of the DNA and the sequence of the mRNA can be identical to the sequence of the DNA, except for the replacement of a T (thymine) with U (uracil).
Mutation frequency. Is the frequency at which a particular mutant can be found in a population.
NCBI. The National Center for Biotechnology Information.
Nonsense mutation. A mutation that occurs within a codon and changes it to a stop codon.
Normalized library. A cDNA library from which most of the highly expressed sequences have been removed in order to represent a greater proportion of low-abundance messenger RNAs. Normalized libraries are not an accurate reflection of a tissue's gene-expression profile.
Nucleobase. Any nitrogen-containing heterocyclic moiety capable of forming Watson-Crick hydrogen bonds in pairing with a complementary nucleobase or nucleobase analog, e.g. a purine, a 7-deazapurine, or a pyrimidine. The present invention in some configurations uses assays based upon probes that can be polynucleotides or polymeric forms of other nucleobases such as nucleic acid analogs. Typical nucleobases can be the naturally occurring nucleobases adenine, guanine, cytosine, uracil, thymine, and analogs (Seela, U.S. Pat. No. 5,446,139) of the naturally occurring nucleobases, e.g. 7-deazaadenine, 7-deazaguanine, 7-deaza-8-azaguanine, 7-deaza-8-azaadenine, inosine, nebularine, nitropyrrole (Bergstrom, (1995) J. Amer. Chem. Soc. 117:1201-09), nitroindole, 2-aminopurine, 2-amino-6-chloropurine, 2,6-diaminopurine, hypoxanthine, pseudouridine, pseudocytosine, pseudoisocytosine, 5-propynylcytosine, isocytosine, isoguanine (Seela, U.S. Pat. No. 6,147,199), 7-deazaguanine (Seela, U.S. Pat. No. 5,990,303), 2-azapurine (Seela, WO 01/16149), 2-thiopyrimidine, 6-thioguanine, 4-thiothymine, 4-thiouracil, O.sup.6-methylguanine, N.sup.6-methyladenine, O.sup.4-methylthymine, 5,6-dihydrothymine, 5,6-dihydrouracil, 4-methylindole, pyrazolo[3,4-D]pyrimidines, “PPG” (Meyer, U.S. Pat. Nos. 6,143,877 and 6,127,121; Gall, WO 01/38584), and ethenoadenine (Fasman (1989) in Practical Handbook of Biochemistry and Molecular Biology, pp. 385-394, CRC Press, Boca Raton, Fla.). Nucleobases that are nucleic acid analogs include peptide nucleic acids in which the sugar/phosphate backbone of DNA or RNA has been replaced with acyclic, achiral, and neutral polyamide linkages. The 2-aminoethylglycine polyamide linkage with nucleobases attached to the linkage through an amide bond has been reported (see, for example, Buchardt, WO 92/20702; Nielsen (1991) Science 254:1497-1500; Egholm (1993) Nature 365:566-68).
Open Reading Frame (ORF). A stretch of nucleotide sequence with an initiation codon at one end, a series of triplet codons and a termination codon at the other end: potentially capable of coding for an as yet unidentified peptide or protein.
Ortholog. One of a set of homologous genes in different species (e.g. SRY in humans and Sry in mice).
Panther. Celera Genomics's proprietary protein classification software that allows hierarchical classification of protein families and subfamilies to further aid in identifying probable protein function. Panther facilitates target identification and prioritization by allowing more accurate predictions of protein function.
Paralog. One of a set of homologous genes within a single species.
Pharmacogenomics. The study of the stratification of the pharmacological response to a drug by a population based on the genetic variation of that population.
Phrap. Developed by Phil Green at the University of Washington, “Phil's Revised Assembly Program” is a tool for assembling shot-gun sequenced DNA fragments.
PHYLIP. Program Package created by J. Felsenstein for Phylogenicity.
Physical map. A map of the locations of identifiable landmarks on DNA (e.g., restriction enzyme cutting sites, genes), regardless of inheritance. Distance can be measured in base pairs. The relative positions of regions can be determined by physical measurements, such as by electron microscopy, restriction analysis, or sequence determination. For the human genome, the lowest-resolution physical map is the banding patterns on the 24 different chromosomes; the highest-resolution map would be the complete nucleotide sequence of the chromosomes.
Point mutation. A mutation causing a small alteration in the DNA sequence at a locus, often a single nucleotide change.
Polygenic character. A character determined by the combined action of a number of genetic loci. Mathematical polygenic theory assumes there can be very many loci, each with a small effect.
Polygenic disorders. Genetic disorders resulting from the combined action of alleles of more than one gene (e.g., heart disease, diabetes, and some cancers). Although such disorders can be inherited, they depend on the simultaneous presence of several alleles; thus the hereditary patterns can be usually more complex than those of single-gene disorders.
Polymorphism. Difference in DNA sequence among individuals. Genetic variations occurring in more than 1% of a population would be considered useful polymorphisms for genetic linkage analysis.
Precomputes. A series of computational analyses of Celera Genomics data to public data. The analyses used include gene prediction (GRAIL, Genscan, FgenesH), BLAST computes using several public and proprietary datasets (nraa, CHGD, RefSeq) to show similarity, and polishing of the BLAST results to find consensus splice sites using SIM4 or Genewise with sequences that can be highly similar to the genomic sequence.
Primer. A primer comprises a polymer of nucleobases, such as, for example, an oligonucleotide, the sequence of which is complementary to a target sequence, or to the complement of a target sequence. In certain aspects, the 3′ end of an oligonucleotide primer can be extended by a DNA polymerase. The primer is short relative to the target nucleic acid. A primer sequence in some configurations comprises from about ten to about fifty nucleotides, and in some configurations comprises from about six, about eight, about ten, about thirteen up to about thirty nucleotides and any length there between. In most cases, PCR involves a forward primer and a reverse primer, which hybridize to opposite strands in a target sequence.
Probe. A “probe” comprises an oligonucleotide that hybridizes to a target sequence. In the TaqMan® assay procedure, the probe hybridizes to a portion of the target situated between the binding site of the two primers. A probe can further comprise a reporter group moiety. In some configurations, the reporter group moiety can be a fluorophore moiety. The reporter group can be covalently attached directly to the probe oligonucleotide, in some configurations to a base located at the probe's 5′ end or at the probe's 3′ end. The reporter group may also be attached to a minor groove binder (MGB), which can be itself covalently attached to the probe (Afonina et al., Nucleic Acids Research 25: 2657-2660 (1997); Kutyavin et al., Nucleic Acids Research 28: 655-661 (2000)). The MGB is, in some configurations, attached to the 3′ end of the probe, either directly to the oligonucleotide or else to the fluorophore moiety or to the quencher moiety. A probe comprising a fluorophore moiety may also further comprise a quencher moiety. The quencher moiety is, in some configurations, a non-fluorescent quencher (NFQ). In some configurations, in probes designed for SNP detection, the fluorophore and the quencher can be attached to the oligonucleotide on opposites sides of the SNP nucleotide. A probe comprises about eight nucleotides, about ten nucleotides, about fifteen nucleotides, about twenty nucleotides, about thirty nucleotides, about forty nucleotides, or about fifty nucleotides. In some configurations, a probe comprises from about eight nucleotides to about fifteen nucleotides. As used herein, the use of the term “a probe” (singular) is intended to include or refer to two bi-allelic probes in the case of SNP assays, unless stated otherwise.
Proteome: The full set of proteins encoded by a genome.
Provide. See “Distribute.”
Provider. See “Distributor.”
Query. The DNA sequence used to search a database.
Radiation hybrid. A type of somatic cell hybrid in which fragments of chromosomes of one cell type can be generated by exposure to X-rays, and are subsequently allowed to integrate into the chromosomes of a second cell type.
Real time. The term “real time” is always spelled out in full. The abbreviation “RT,” as used herein, always refers to “reverse transcriptase.”
Receptor. A molecule (usually a protein) that spans a cell membrane and received extracellular signals and transmits them into the cell.
Regional overlay. Celera regional overlays can be created from Celera fragments and mate pair links, and external finished clones and unordered contigs from unfinished clones, which are referred to as BACs. The Celera Regional Assembler takes the external data and uses Celera fragments and mate pairs to order and orient the contigs within BACs, filling in gaps where possible.
Regulatory regions or sequences. A DNA base sequence that controls gene expression.
Repetitive DNA. A set of nonallelic DNA sequences which show considerable sequence homology.
Requestor. See “consumer.”
Reverse transcriptase (RT). The abbreviation “RT” is used herein exclusively as an abbreviation for “reverse transcriptase.” The term “real time” is always spelled out in full.
Scaffolds. Sets of contigs that can be ordered and oriented using enforcing mate pairs.
Sequence homology. A measure of the similarity in the sequence of two nucleic acids or two polypeptides.
Sequence tagged site (STS). Short (200 to 500 base pairs) DNA sequence that has a single occurrence in the human genome and whose location and base sequence are known. Detectable by polymerase chain reaction, STSs can be useful for localizing and orienting the mapping and sequence data reported from many different laboratories and serve as landmarks on the developing physical map of the human genome. Expressed sequence tags (ESTs) can be STSs derived from cDNAs.
Significant complementarity. Includes complementarity sufficient to interfere with the analysis of a target sequence. Significant complementarity can comprise, in non-limiting example, at least about 40% or greater sequence identity with the complement of a target sequence.
Single Nucleotide Polymorphism (SNP). Replacement, loss, or addition of one nucleotide (either A, C, G or T) in the DNA sequence. There are probably several million SNPs throughout the genome, and these alleles account for much of the variation seen in the human population. These predominately biallelic polymorphisms may exist in varying ratios in the population ranging from very rare alleles (1-5% frequency) to common alleles (20-50% frequency).
Splice acceptor site. The junction between the end of an intron terminating in the dinucleotide AG, and the start of the next exon.
Splice donor site. The junction between the end of an exon and the start of the downstream intron, commencing with the dinucleotide GT.
Stock assay. A pre-designed assay that does not require custom design. In some configurations of the present invention, an inventory of stock assays may be maintained from which users may place orders.
Stringency. A parameter for filtering the results of a query based on how closely related the sequences in a cluster must be.
Subject. A DNA sequence that produces a match in a blast search.
Supplier. See “Distributor.”
SWISSPROT. European annotated non-redundant protein sequence database; most highly annotated protein database.
TA. Transcript assembly. Celera assembly of public EST.
Tandem repeat sequences. Multiple copies of the same base sequence on a chromosome; used as a marker in physical mapping.
Target. A biological sample comprising a nucleic acid. A target can comprise as ingle-stranded or double-stranded nucleic acid, and can comprise an RNA or a DNA. An RNA can be, in non-limiting example, a messenger RNA (mRNA), a primary transcript, a viral RNA, or a ribosomal RNA. A DNA can be, in non-limiting example, a single-stranded DNA, a double-stranded DNA, a cDNA, a viral DNA, an extrachromosomal DNA, or a mitochondrial DNA. A skilled artisan will recognize from the context of usage whether a target nucleic acid is single-stranded or double-stranded.
TBLASTn. A BLAST search of a protein sequence against a nucleotide sequence database that has been translated in all six frames.
Trace Files. The product of sequencing completed by the ABI 3700 Prism. After going through stringent quality control processes, trace files can be then used as data input for assembly.
Transcriptome. The full complement of activated genes, mRNAs, or transcripts expressed from a genome.
TREMBL. Translated EMBL, a compilation of the EMBL DNA data library.
UniGene database. A public database, maintained by NCBI, which brings together sets of GenBank sequences that represent the transcription products of distinct genes.
Unique clone. A sequence that has no match in GenBank or other public databases.
Unique singleton. A clone that does not cluster and has no match in the public databases.
UTR (untranslated region). Noncoding region found at the 5′ or 3′ termini of mRNA.
Untranslated sequences. Noncoding sequences found at the 5′ and 3′ termini of mRNA.
User. See “consumer.”
Overview of Assays:
SNP Genotyping Assays:
In some configurations, the present invention includes methods of providing investigators with assays useful for detecting the presence of SNP alleles as well as assays useful for detection or quantification of gene expression. The elucidation and cataloguing of the sequences of genomes of various species, particularly the human genome, including the identification in public and/or private databases of more than 4,000,000 SNPs distributed throughout the genome, as well as the identification and cataloguing of a significant fraction of the approximately 30,000 expressed genes, provides the basis for establishing a collection of validated assays for SNPs or gene expression. Such assays can provide investigators with analytical tools for investigating virtually any gene in a mapped genome.
In some configurations, SNP databases can be used to develop assays that provide an investigator with the ability to analyze samples for the presence of identified SNP alleles. Testing samples from a particular individual allows SNP genotyping of that individual. SNPs from public and/or private databases can be selected for assay development. A number of approaches can be used in constructing SNP databases that can be useful in SNP genotyping (for review of SNP databases, see McCarthy et al., Nat. Biotechnol 1 8:505-508, 2000; Judson et al., Pharmacogenomics 3:379-391, 2002, Miller et al., Hum. Mol. Genet. 10:2195-2198, 2001). In certain aspects, a gene-based approach can be used. In a gene-based approach, SNPs can be selected that reside on “gene regions.” For example, a gene region comprising a 60 kb sequence including 10 kb upstream and 10 kb downstream from known functional sequences, can in certain instances have at least seven identified SNPs associated with it including at least one identified SNP that maps to a location approximately 10 kb upstream to its 5′-most cis-acting regulatory Isequence, another identified SNP that maps to a location approximately 10 kb downstream to its 3′-most cis-acting regulatory or transcribed sequence, and at least 5 more identified SNPs mapping therebetween. Within the gene region, the SNPs can be selected such that they can be distributed across the gene region. As such, the selected SNPs can be located about 5 kb apart, about 10 kb apart, about 15 kb apart, or more, or at any selected separation distance between 5000 and 15000 bases, or at any selected separation distance without limitation. The availability of assays for SNP markers that can be spaced at intervals of approximately 10 kb for a gene affords an investigator the opportunity to obtain at least one SNP allele that can be used as a marker for the gene. SNP markers can serve as markers for genotypes or haplotypes and can be of value in investigating gene structure, haplotype structure, inheritance studies, and the like.
In certain aspects of the present invention, the inventors have focused on the selection of “common” SNPs. The minimum percent of occurrence in a population or population subset depends upon the requirements of a particular test and can be selected to be, in certain instances, a bout 8%, about 10%, a bout 15%, a bout 20% or greater or any value therebetween, or any selected frequency without limitation, depending upon the assay requirements. Particular minimum percent of occurrence values that can be considered to be generally applicable can be in certain embodiments, about 10% and in other embodiments, about 15%. In certain configurations, known SNPs that have been cataloged in at least one database can be subjected to a triage procedure to produce a reduced set of SNPs. In addition, SNPs can be selected whose minor alleles were observed in at least two distinct donors. Unless a minor allele is reported in at least two individuals, a SNP may be eliminated from further consideration for inclusion in the set of SNPs. Sequences comprising the selected SNPs, as well as sequences upstream and downstream from the SNPs, can be then analyzed to determine their suitability for use in SNP assays. SNPs deemed non-usable for assay development can be eliminated from further consideration for inclusion in the set. In a subsequent triage step, semi-empirical design quality control (QC) criteria can be used to reduce the SNPs included in the set.
The following properties of a candidate SNP may be considered in determining whether a candidate SNP is selected for inclusion in the reduced set of SNPs: 1) the SNP maps within or close to an annotated gene in a gene library, for example, within one of about 30,000 Celera-annotated genes or within 10 kb of an annotated gene; and 2) the SNP is spaced with respect to nearest neighbor to provide at least three SNPs per gene on intervals between SNPs of about 10 kb. Remaining gaps greater than about 10 kb in a gene region can be filled with at least two unscreened SNPs per 10 kb.
Assays that pass this selection procedure can be then validated in some configurations, based upon laboratory genotyping results using a panel of genomic DNA from, for example, about 90 individuals. For example, the DNA panel comprises genomic DNA from about 90 individuals representing a subset population of Caucasian individuals and a subset population of individuals of African American ancestry. Selected SNPs have a minor allele frequency of at least 10% or greater or at least 15% or greater in at least one population, or any selected minor allele frequency without limitation.
SNP assays can include any SNP assay known in the art. Methods for SNP detection include, in non-limiting example, variations of the INVADER™ method of Third Wave Technologies, and the TaqMan® method. In some configurations, assays can be developed for use in a TaqMan® method for identifying a SNP allele in a target sequence. The TaqMan® method uses two primer oligonucleotides and a DNA polymerase for PCR sequence amplification, as well as one or two probe oligonucleotides. For SNP detection using a TaqMan method, one primer oligonucleotide sequence maps to a site upstream from a target SNP sequence and a second primer oligonucleotide sequence maps to a site downstream from the target SNP sequence. A probe oligonucleotide sequence maps to the SNP, and comprises one allele of the target SNP, a reporter group moiety, which in some embodiments can be a fluorophore moiety, a fluorescence quencher moiety, which can be in some embodiments an NFQ moiety, and can also comprise an MGB moiety. In TaqMan analyses using two probes, the second probe sequence also maps to the SNP, and comprises an alternative allele of the target SNP, a second reporter moiety (for example, a second fluorophore moiety), a fluorescence quencher, and can also comprise an MGB. When two probes are used, the fluorophores can be selected to be distinguishable by virtue of their absorption or emission spectra. In non-limiting example, the fluorophores VIC™ and FAM as provided in kits by Applied Biosystems can be used as reporter fluorophore moieties in a SNP assay. The probe can further comprise an MGB. An MGB increases the melting temperature of a probe/target hybrid without increasing probe length, thereby allowing shorter probes to be used (Afonina et al., Nucleic Acids Research 25: 2657-2660 1997; Kutyavin et al., Nucleic Acids Research 28: 655-661 2000). In some configurations, the MGB moiety can be covalently attached to the 3′ end of the probe. The structure of the MGB can be, in non-limiting example, a trimer of 1,2-dihydro-(3H)-pyrrolo[3,2-e]indole-7-carboxylate. This oligopeptide binds double-stranded DNA in the minor groove, with a high affinity for A-T-rich sequences in double stranded DNA. Because the presence of an MGB increases the stability of hybrid nucleic acids, oligonucleotide-MGB conjugates as short as 8-mers, or G-C-rich 6-mers are able to form stable hybrids with complementary sequences. These properties allow the use of probes as short as six nucleotides. MGBs furthermore increase the specificity of probe-target hybridization.
In the TaqMan® assay, each probe can be non-fluorescent or poorly fluorescent in spite of the presence of a fluorophore moiety, by virtue of the presence of the NFQ. However, during PCR amplification of the TaqMan assay, a probe bound to a target SNP can be digested by the polymerase, because of the enzyme's 5′ exonuclease activity. Because the PCR conditions can be selected for high stringency hybridization, whereby a single nucleotide mismatch between probe and target does not permit stable hybridization, only probes perfectly complementary to the target are digested by the polymerase. Thus, if two probes representing alternative alleles of a SNP are used, only one probe will be subject to digestion by the polymerase. Because digestion of a probe releases a fluorophore from quenching by the quencher, measurement of the absorption or emission wavelength of a sample reveals which probe is digested by the polymerase, and hence, which SNP allele is present in the sample. Because SNPs can be heterozygous or homozygous, detection of absorption or emission spectra of one or both fluorophores in a sample during or following PCR amplification will reveal if the target sample is heterozygous or homozygous. Fluorophore released from a quenched primer can be quantified by any method known in the art. In some configurations, a fluorimeter can be used. In some configurations, the fluorimeter comprises a component of an integrated nucleic acid analysis system, in non-limiting example, an ABI PRISM® 7900HT Sequence Detection System.
In a SNP genotyping assay, two probes comprising identical sequences except for the SNP allele nucleotide, different fluorophores, and identical MGBs and NFQs can be used in various embodiments. For a biallelic SNP assay, any two spectrally distinguishable fluorophores for which the fluorescent signals can be quenched by the non-fluorescent quencher are used. In a non-limiting example, commercially available fluorophores, for example VIC™ and FAM™ from Applied Biosystems, can be used as probe labels in biallelic SNP genotyping.
In the design of an assay in various embodiments, at least one potential probe oligonucleotide sequence, as well as potential primer oligonucleotide sequences, can be analyzed in silico for suitability in a PCR assay. An in silico analysis of an oligonucleotide sequence can consider several criteria, such as, in non-limiting example, the predicted melting temperature of a duplex comprising the oligonucleotide sequence and its complement, the absence of significant self-complementarity (e.g., the absence of “hairpin loops”), the absence of significant complementarity with any other oligonucleotide expected to be used in the assay (e.g., “primer-primer dimerization”), and the absence of significant complementary with a genomic sequence outside of the target site. In certain embodiments, a candidate oligonucleotide sequence can be validated by “blasting” against the genome, and a candidate sequence is selected for further development for use in an assay only if its sequence appears no more than once in the genome.
Following in silico validation, each oligonucleotide designed for an assay can be synthesized using organic synthesis methods known in the art. The synthesis of probe oligonucleotides also includes the covalent attachment of a reporter group, a fluorescence quencher, and a minor groove binder.
Gene Expression Assays:
In some configurations of the present invention, information in databases on expressed sequences can be used to develop assays that provide an investigator with the ability to analyze a sample for the presence and quantity of expressed RNA. In certain configurations, a method is provided that permits an investigator to obtain a validated assay to a known expressed gene. In some configurations, assays can be designed for measuring gene expression levels using reverse transcription coupled to the polymerase chain reaction (Reverse Transcription-Polymeras-e Chain Reaction, RT-PCR) (Sambrook et al., 2d Edition, Cold Spring Harbor Laboratory Press, Cold Spring, N.Y. (1989)). In these configurations, primer- or probe oligonucleotides comprising DNA sequences corresponding to mRNA sequences (or the complement thereof for a “reverse” primer sequence) can be designed and validated. In some configurations, at least one probe or primer spans an exon-exon boundary within a target mRNA (or cDNA) sequence to diminish any contribution from genomic nucleic acids.
Once a target expressed gene has been determined or designated, gene expression can be detected and quantified by the investigator using an assay designed using any of a number of methods. Thus, in some configurations, assays can be developed for use in an RT-PCR analysis using the TaqMan® method for quantifying a PCR-amplified cDNA of an target expressed mRNA. A TaqMan® gene expression assay utilizes a pair of oligonucleotide primers for PCR, as well as a probe oligonucleotide. The primer oligonucleotides hybridize to different sites within a double-stranded cDNA of an mRNA, in opposite orientations. The probe oligonucleotide comprises a sequence that hybridizes to a site between the primer hybridization sites. The hybridization stringency conditions can be selected such that at least one of the probe and primer oligonucleotides hybridizes uniquely to the genome. In some configurations, at least one of the probe and primer oligonucleotides comprises a sequence that spans an exon-exon boundary, in order to minimize spurious signal generated by contaminating genomic DNA acting as template. In some configurations, the probe comprises a sequence that spans an exon-exon boundary. The probe oligonucleotide further comprises a reporter moiety, in some configurations a fluorophore, as well as a fluorescence quencher, in some configurations an NFQ. Any fluorophore which can be subject to quenching by a quencher may be used as the reporter moiety. In non-limiting example, the fluorophore VIC™, as provided in kits by Applied Biosystems, can be used as a reporter fluorophore moiety in an RT-PCR gene expression assay. The probe can further comprise an MGB. In some configurations, the MGB moiety can be covalently attached to the 3′ end of the probe. The structure of the MGB can be, in non-limiting example, a trimer of 1,2-dihydro-(3H)-pyrrolo[3,2-e]indole-7-carboxylate. Because the presence of an MGB increases the stability of hybrid nucleic acids, oligonucleotide-MGB conjugates as short as 8-mers, or G-C-rich 6-mers, are able to form stable hybrids with complementary sequences, and therefore allow the use of probes as short as six nucleotides. MGBs furthermore increase the specificity of probe-target hybridization.
Either a one-step or two-step process configuration can be used to analyze a sample for the presence or quantity of an RNA. In some configurations, a one-step process configuration can be used to detect and quantify an mRNA. In one-step process configurations, a thermostable polymerase that exhibits reverse transcription, DNA synthesis utilizing a DNA template, and 5′-to-3′ exonuclease activity, in non-limiting example recombinant Thermus thermophilus DNA polymerase (rTth polymerase), can be used in a TaqMan® analysis. Because rTth polymerase exhibits all enzyme activities involving nucleic acids needed for an RT-PCR expression analysis, an assay can be provided to an investigator comprising all of the components for an RT-PCR analysis except for the target sample. Thus, following an investigator's request, a pre-validated assay can be sent to the investigator as a mixture in a single tube. The investigator need only add a target sample to the mixture, then subject the mixture to a standard thermal cycling protocol. In certain alternative configurations, the oligonucleotides of an assay can be supplied in a single tube, and the buffers, salts, and thermostable polymerase can be supplied separately. As a result of thermal cycling, fluorophore can be released if probes and primers are hybridized to a cDNA target. Measurement of released fluorophore provides a quantifiable signal, wherein fluorescence intensity can be monotonically related to RNA concentration in the target sample. Fluorophore released from a quenched primer can be quantified by any method known in the art. In some configurations, a fluorimeter can be used. In some configurations, the fluorimeter comprises a component of an integrated nucleic acid analysis system, in non-limiting example, an ABI PRISM® 7900HT Sequence Detection System.
In yet other, “two-step” RT-PCR analysis configurations, reverse transcription and PCR amplification can be conducted separately. Reverse transcription can be catalyzed using a reverse transcriptase, such as, in non-limiting example, a reverse transcriptase from Avian Myeloblastosis Virus or Moloney Murine Leukemia Virus. Second-strand synthesis, and amplification of cDNA can be subsequently effected in a second step using a DNA polymerase, such as, in non-limiting example, a heat-stable polymerase such as Taq polymerase. The Taq polymerase can be, in some configurations, a Taq polymerase that can be supplied complexed with a heat-denaturable blocking agent, for example, an antibody directed against the Taq polymerase, in order to prevent elongation of an oligonucleotide prior to an initial heat denaturation step at the start of a thermal cycling protocol.
In both SNP and gene expression assays, the assays can be run under uniform conditions to allow high-throughput analyses of samples. High-throughput capability lends itself to automation and robotics, wherein hundreds or thousands of individual gene expression analyses can be conducted within a single day. For example, 384 samples can be analyzed simultaneously by setting up 384 separate RT-PCR assays on a single 384-well tray, and conducting the reactions in a single thermal cycler apparatus. Robotics can be used to facilitate the rapid and accurate handling of the samples.
In various configurations, the invention includes provision of an assay for analysis of a SNP or an expressed gene using PCR or a variant or modification thereof. Variations of PCR include, for example, the TaqMan® assay, in which a pair of primer oligonucleotides and at least one probe oligonucleotide can be hybridized to a target nucleic acid. The DNA polymerase, in particular a heat stable DNA polymerase such as a taq polymerase, catalyzes the hydrolysis of the probe as a result of the polymerase's 5′ to 3′ exonuclease activity. If the probe comprises both a fluorophore moiety as a reporter group and a quencher moiety, such as a non-fluorescent quencher, hydrolysis of the probe results in separation of the fluorophore and the quencher, leading to an increase in the fluorescent signal obtainable from the reporter group.
Various configurations of the present invention make available to an investigator a system for obtaining validated assays and protocols for studying SNPs and their connection with disease or conditions as well as for studying the expression of genes. The assays can be made available in large number and in a standard format for performing tests involving SNPs or gene expression. The present invention also provides a system for rapid development of new assays, which can be based upon a specified target sequence or gene region provided by the investigator.
In some configurations of the present invention, stock gene expression products can be off-the-shelf quantitative gene expression assays that have been built on the 5′ nuclease chemistry and that have been designed utilizing a bioinformatics pipeline that performs BLAST and other sequence analysis using, for example, either public or private data. An example of a database suitable for use with some configurations of the present invention can be the Celera Discovery System (CDS™), which is an example of a gene exploration system 19 (see
In contrast, in configurations supplying custom assays, requesters can perform upfront BLAST or sequence analysis themselves, if desired, and then provide a target sequence and desired location or locations of a TaqMan® MGB probe to the supplier. Configurations of the present invention then utilize a suitable program such as, for example, Primer Express or a modified version thereof (which may, for example, execute in batch mode) to design the TaqMan® MGB probe and primer set. The primers and probes can be quality control-tested by the supplier and then formulated into a single-tube mix having, for example, concentrations of 20.times., 60.times., or other concentrations. In some configurations, requesters may select a concentration by ordering specific part numbers. The supplier supplies requesters with primer and TaqMan® MGB probe sequences.
Some configurations of the present invention provide both “custom” and “stock” options, and provide one or more predesigned, preformulated, quality control-tested assay in a single tube.
Web Based Portal System
According to various aspects of the system disclosed herein, the user may be able to use a web based portal to order products associated with conducting assays. The web based portal may be used to order custom assays and/or stock assays. In this regard, the user may initially navigate to the portal as shown in block 10 of
Depending upon the type of assay which the user desires, the processing may differ. For example, if the user desires to obtain a custom assay, the system proceeds to obtain from the user information which may be useful to deliver the custom assay to the user as indicated by block 14. Similarly, if the user desires to obtain an assay for gene expression experimentation, the system proceeds to obtain the information which may be useful to generate such an assay as represented by block 16. In addition, if the user desires to obtain an assay for SNP genotyping, the system proceeds to collect information useful to providing such an assay as represented by block 18. Further, the user may desire to use the gene exploration system as indicated by block 19. The gene exploration system will be described below.
Gene Exploration System:
Some configurations of the present invention provide a gene exploration system or platform 19, that allows the user to perform in silico research which can assist the user in the process of assay selection. Gene exploration system 19 can be accessed directly from the portal 10 or from selection screens from custom assay and/or stock assay blocks 14, 16, and 18. For example, if a user has entered a custom or stock assay screen and wants to obtain further genomic information about a given assay, or if a user decides to perform further research prior to ordering a gene, an appropriate entry link to the gene expression system can be accessed.
Gene exploration platform 19 can provide access to a set of genomic and biomedical data from public and/or private sources. Some configurations provide integrated access to such data from Celera, GenBank, and other public and private data sources. Computational tools can also be provided to facilitate the viewing and analyzing of gene structure and function, genome structure and physical maps, and/or proteins classified by family, function, process, and/or cellular location. An intuitive user interface can be provided that organizes information for easy navigation and analysis.
In certain configurations the gene exploration system, block 19, can provide the user with a link to a genome navigation page such as that illustrated in
Protein classification option allows the user to browse and/or search one or more protein information databases. Database capabilities may include, for example, browsing and text searching Celera PANTHER™ families and gene ontology classification data.
The pharmacogenomics option available in some configurations can provide the user with the ability to search against one or more SNP databases, for example, the Celera Human SNP Reference Data database.
A navigation bar can be provided in some configurations of the present invention. The navigation bar provides access to one or more features, such as a biomolecule library; a text search (allowing the user to launch sequence analysis applications); a sequence analysis (allowing the user to launch sequence analysis applications); a workspace (allowing a user to start a new session and delete, rename, import, and/or export sessions, and/or select queries to delete and/or link with other queries, and perform complex queries); a queue display (permitting the user to display the status of his or her sequence analysis jobs and to retrieve the results); an options display (providing, for example, a display of user account information and/or display options); online help; and logoff. Some configurations can limit the number of sessions allowed to a user.
Some configurations of the present invention can provide a research facility based upon genome assembly and annotation data from one or more public and/or private databases. One or more of these databases may be Celera databases, from which chromosome map reports, scaffold reports, sequence reports, gene lists, chromosome map displays and/or biomolecule reports are available.
A representative example of a chromosome map report as provided in some configurations of the present invention is shown in
In some configurations of the present invention, to retrieve a chromosome map display, a user first searches a genome assembly (for example, the Celera genome assembly) to retrieve all scaffolds on a single chromosome. The user then clicks on a link to a chromosome map report from a scaffold report. A representative example of a scaffold report as provided in some configurations of the present invention is shown in
In some configurations of the present invention, sequence reports can be provided. A representative example of a sequence report as provided by some configurations of the present invention is shown in
Various configurations of the present invention make gene lists available to users. A representative example of a gene list as provided in some configurations of the present invention is shown in
Gene list information in some configurations can include, for example, one or more of the following items: gene ID, transcript ID, protein ID, gene name (if assigned), gene symbol (if assigned), gene alias (if assigned), reference sequence ID (if present).
Some configurations of the present invention provide a chromosome map display, as shown in
In some configurations a biomolecule report as provided as illustrated in
The mRNA view (a representative example of which is shown in
The chromosome view (of which a representative example is shown in
In some configurations of the present invention, a human gene mutation database (HGMD) report can be provided, as shown in
Some configurations of gene exploration platform 19 allow navigation of a genome by searching a genome map and/or by searching a genome assembly. For example, to search by chromosome number, some configurations allow a user to click on a “genome map” link (shown in
In some configurations, the user can search by gene ID, gene symbol, and/or RefSeq ID. To do so from the web page shown in
Some configurations permit a user to perform a search by cytogenetic band. In some of these configurations, the user can be presented with a “Search Genome Maps” page such as that shown in
Some configurations permit a user to search by position on a chromosome. For example, in the “Search Genome Maps” page shown in
Some configurations allow a user to search for STS markers from, e.g., a radiation hybrid database (RHdb) or a database of sequence tagged sites (dbSTS). To search for a region bounded by two markers in some configurations, a user clicks on “genome maps” (see
Some configurations of the present invention allow a user to search for a region between two BACs. For example, in the “Search Genome Maps” page shown in
Some configurations of the present invention provide a capability that allows a user to search a genome assembly by chromosome number or by genome assembly number to retrieve one or more of the following: a chromosome map report that can displays all scaffolds on a single chromosome; a scaffold report that can display all genomic assembly segments associated with a single scaffold; and/or a sequence report that can display a single genomic assembly sequence segment.
For example, in some configurations, to retrieve a list of all scaffolds on a single chromosome, the user can search by chromosome number to generate a chromosome map report by clicking on “genome assembly” on a page such as that illustrated in
Some configurations allow the user to search by genome assembly number to generate a scaffold report. For example, in the “search genome assembly” web page of
Some configurations allow the user to search by genome assembly number segment to generate a sequence report. For example, in the “search genome assembly” of
Some configurations of the present invention provide the user with the capability of finding genes by Panther families protein classification. Thus, some configurations provide a Panther protein function-family browser, which allows a user to perform one or more of the following: browse functional categories and protein families/subfamilies; text search functional categories or protein families/subfamilies; create a gene list; view the Panther tree for a given family; view the Panther multiple sequence alignment (MSA) for a given family; and/or view the Panther “Partial” MSA for a given family.
In some configurations, a Panther protein function-family browser can be made available when the user clicks on “Panther families” on the web page illustrated in
In various configurations, the browser may also provide facilities for accepting text searches (for example, the user might search for the text “kinase”), so that folders can be opened and categories containing the search term can be made visible (and can be highlighted, in some configurations). Some configurations also provide a sub-family search.
Some configurations of the present invention provide a Panther gene list. For example, a user can browse or text search to select desired protein families/subfamilies in the families panel, and go to a gene list listing all proteins assigned to the selected families/subfamilies. Various sorting and modification options can be provided, and export facilities can be provided (e.g., exporting the list to the user's local disk in a format suitable for other uses).
A Panther tree viewer can be provided in some configurations of the present invention. Panther distance trees allow users to explore the relationships between sequences in a particular family, and may also show some of the information used to annotate the families and subfamilies. In various exemplary configurations, the tree viewer has two panels that can be mapped to each other. One panel graphically displays the relationship between the different sequences. An attribute table contains one row for each sequence in the tree, and each column displays a different attribute of the sequence, such as the GenBank accession number for the sequence; the brief definition line parsed out of, for example, a SwissProt or GenBank record; the organism from which the sequence was derived; and/or links to open relevant abstracts from PubMed. In some configurations, the page also links to MSA views, and/or highlights selected subfamilies.
Some configurations also provide the user with a Panther MSA viewer. This viewer can be useful because Panther MSAs are used in producing Panther distance trees, and therefore, the family/subfamily classification. In some configurations, there can be two viewer modes: full MSA, which can include all publicly available sequences in the family that are related closely enough to produce an informative multiple alignment; and partial MSA, which shows the alignment only for the currently selected subfamilies. In some configurations, the MSA view can be divided into subfamilies in the same ordering as in the tree, so that the most closely related sequences appear closest to one another in the alignment. Also, some configurations of MSA viewers have two panels: an information panel, and an MSA panel. The information panel can contain information about each subfamily and sequence. This information may include hyperlinks to more detailed information. The MSA panel can display the multi-sequence alignment, which can be generated by aligning the sequences to the family hidden Markov model (HMM).
A Panther HMM alignment view can be provided. This view shows the query sequence aligned to the consensus sequence for the HMM. Also, a Panther family/subfamily hits view can be provided that shows all the Panther family/subfamily HMMs that hit a query sequence with a score better than a certain threshold.
In some configurations, certain genes (e.g., Celera genes) can be found by gene ontology protein classification. These configurations may provide either or both of a text search or a “drill down” search, for example.
To perform a text search in some configurations, a user clicks on “gene ontology” on the page illustrated in
In some configurations, a user may drill down gene ontology classifications. For example, in some configurations, from the “ontology” page of
In some configurations, GenBank human nucleotide sequences can be mapped to the human genome assembly (e.g., the Celera human genome assembly) using a combination of BLASTN and a modified version of the SIM4 algorithm. Also, some configurations map public sequences using repetitive hits (e.g., a sequence that maps to greater than 10 locations on the genome), orphans (e.g., a sequence fails to map to a genome), and best hit (e.g., if a sequence maps to between 2 to 10 locations, an attempt can be made to identify the best mapping.
Some configurations of the present invention provide browsing capabilities that permit a user to map public IDs (e.g., GenBank accession) to a human genome project (e.g., the Celera human genome) by searching a mapping database. In some configurations, for example, an ID mapper provides searching capabilities for one or more mapping databases, which may include GenBank DNA, GenBank mRNA, dbEST, and/or RefSeq.
In some configurations, text searches of data may also be performed by a user. For example, both Celera and non-Celera data may be searched by text.
Various configurations of the present invention can also include facilities for performing sequence analysis. For example, one or more of the following protein analysis types may be provided and made accessible to the user's browser window: BLASTP; TBLASTN; TFASTA; FASTA; PSI-BLAST; and/or HMMPFAM. Also, one or more of the following nucleotide analysis types may be provided and made accessible to the user's browser window: BLASTN; BLASTX; and/or TBLASTX.
Some configurations of the present invention provide workspaces that allow a user to start a new session, delete an entire session and its results, delete selected query results, rename a session, import session results, export session results, copy query results from one session to a different session, and/or perform additional queries from existing queries. For example, results can be exported to the user's local hard disk memory and re-imported for use later.
In various configurations of the present invention and referring to
The web page displayed on consumer computer 26 may include various types of introductory and sales information, provide a login for authorized user/purchasers, and solicit the DNA (or RNA) sequence and other information, as is necessary or desirable. In some configurations, the initial web page can be one of several web pages provided by server 30 that interact with consumer 28 to obtain information. For example, in some configurations, the initial web page accessed by consumer 28 can be a corporate web site that provides information for consumer 28 as well as a form in which consumer 28 types identifying information using consumer computer 26. Distributor computer 22 receives the information entered by consumer 28 and sent by consumer computer 26 via computer network 24.
In some configurations, distributor computer 22 verifies the identity of consumer 28 and his or her qualifications to access a sales page and to purchase assays from the distributor. For example, this verification may be performed by a web application server 32 (for example, the IBM® WEBSPHERE® Application Server available from International Business Machines Corporation, Armonk N.Y.) running on distributor computer 22 with reference to a consumer database 34 of qualified consumers and consumer identifications. If consumer 28 cannot be verified or is not qualified to make a purchase, this information may be returned by web application server 32 and web page server 30 via computer network 24 to consumer 28, and consumer 28 will not be allowed to complete a purchase and/or to access additional information.
Upon obtaining information from consumer 28, various methods of the present invention provide, at 50, a forward primer sequence, a reverse primer sequence, and a probe sequence having specified characteristics.
The forward primer sequence and the reverse primer sequence together define an amplicon sequence. The amplicon lies within the target nucleic acid sequence. The probe sequence can be complementary to a portion of the amplicon sequence. Next, in various configurations, one or more of the forward primer sequence, the reverse primer sequence, and the probe sequence can be validated at 52, using, for example, a genome database such as database 40. Validation may include BLASTing of one or more of the sequences, as described above. At least one assay can be manufactured at 54. The manufactured assay comprises a forward primer in accordance with the forward primer sequence, a reverse primer in accordance with the reverse primer sequence, and a probe in accordance with the probe sequence. In some configurations, the forward primer sequence, the reverse primer sequence, and/or the probe sequence can be a validated sequence from 52. The assay can be shipped at 58 to consumer 28. Some configurations of the present invention ship the assay in a single tube format with a two-dimensional bar code. In some configurations, the probe in the manufactured assay comprises a fluorescence quencher. The fluorescence quencher can be a non-fluorescent dye. In some configurations, the fluorescence quencher can be configured to reduce background fluorescence and increase quenching efficiency. The assay itself can be suitable for use in a sequence detection system, such as, for example, a real-time PCR system.
Some configurations test, at 56, the manufactured forward primer, the manufactured reverse primer, and/or the manufactured probe before delivery to verify that the assay meets specified characteristics. Tests at 56 may include, for example, performing mass spectroscopy on the manufactured assay to determine that an oligonucleotide sequence is correct, and/or performing a functional test to determine that an amplification has occurred and at least one allelic discrimination can be confirmed.
According to the various embodiments, if the user selects to obtain a custom assay at block 14 as shown in
As shown in
With respect to target sequence length, in certain embodiments the length of the sequence can range from about 60 bases to about 5000 bases. However, larger and shorter sequences may also be used. Short sequences (e.g., fewer than 300 bases) may limit the number of potential assays that can be designed. For this reason, in some configurations, a sequence length of approximately 600 bases can be submitted, though increasing the sequence length may increase the number of possible assays. In addition, the sequence may be selected such that the target site can be directed towards the center of the submitted sequence.
In addition, a user can determine the quality of the sequence (62), e.g., to determine whether the sequence is unique in public databases when selecting the submission sequence. If there are similar versions of the sequence in a public database, how closely they agree can be a factor that can be used to determine the quality of the sequence. If other versions of the target sequence are different in public databases, it is possible to mask the ambiguous bases using N's as described below. Examples of databases with curated sequences include RefSeq, which contains mRNA sequences, and dbSNP which contains SNPs. The NCBI RefSeq project provides reference sequence standards for the naturally occurring molecules of the central dogma, from chromosomes to mRNAs to proteins.
When ambiguous bases are determined to exist, it may be desirable to annotate the submission sequence to avoid ambiguous bases in the regions of the sequence used for designing assays. When an ambiguous base occurs, the ambiguous base may be substituted with an N. For example, if the lowercase bases in the sequence
are ambiguous, the lowercase bases can be substituted as follows:
It may be desirable to minimize the number of substitutions of ambiguous bases with Ns. This is because the system does not include Ns in the primer or probe and therefore sequences with Ns reduces the number of available primer and probes from which to select the optimal assay. In addition, it may also be desirable not to have Ns that are too close to the target site. In this regard, it may be desirable not to have Ns within five bases of the target site when submitting sequences for gene expression assays, as well as 2 bases of the target site when submitting sequences for SNP assays. It will be understood, however, that a larger or smaller separation between the target site and the location of Ns may be used.
In various configurations, the user can, if desired, further assess the quality of the sequence (62) by determining whether unique primers and probes can be generated for the specific sequence. Various methods may be used to determine whether unique primers and probes may be manufactured. In one non-limiting example for determining whether a unique primer and probe can be generated for a DNA sequence a BLAST search tool can be used as follows. Such a BLAST search tool can be useful for determining the uniqueness of the target region. Using either the entire target sequence or a portion thereof, e.g. 50 bp upstream and downstream from a SNP a BLAST search can be performed. The BLAST search can detect regions with sequence similarities and repetitive elements.
After the sequence has undergone a BLAST search, the sequence can be run through a program such as Repeat Masker to detect common repetitive elements. Repeat Masker may be found at http://repeatmasker.genome.washington.edu. If many regions with similar sequences are located after running a program such as Repeat Masker, a filter may be used to limit the number of regions with similar sequences. For example, it may be useful to limit the search to human genomic DNA for SNPs or mRNA/cDNA for gene expression. It will be noted that the BLAT server at the University of California, Santa Cruz carries out searches using assembled genomic sequence. The BLAT server at the University of California, Santa Cruz, is located at http://genome.ucsc.edu/golden Path/octTracks.html.
In another non-limiting example a user can assess whether useful probes and primers can be manufactured for a gene expression assay by performing a BLAST search of a target region which encompasses an exon-exon boundary. N's can be substituted for small regions of repeats, SNPs, and ambiguous sequences. If the target region is found not to be unique, a different exon-exon boundary can be selected and a BLAST search performed on a target region which encompasses the alternate exon-exon boundary.
After the target sequence has been selected but before the submission file is prepared, the sequence data again can be reviewed to determine whether sequence problems may cause failure in the assay design. As discussed above, these problems may occur if the sequences are too short, a low confidence in the sequence is present, the sequence is of poor quality, there are masked bases, too many Ns limit the design, and there are Ns too close to the target site for the probe. Each of these issues are discussed above.
After the user selects the target sequence at block 60, and assesses the quality of the sequence at block 62, the user can prepare the submission file which includes the relevant information for ordering the assay, such as the target sequence data from which the primers and probes can be designed. As a design choice, programs utilized in configurations of the present invention may impose formatting requirements on input data to simplify parsing of the input data. For example, a submission file in some configurations can contain a header line and one sequence record for each assay, and some configurations may require the submission file to be formatted in this manner. An example of a submission file for a SNP assay (showing SEQ ID NO: 1, SEQ ID NO: 3, and SEQ ID NO: 4) with the header line and sequence records formatting according to exemplary formatting requirements can be as follows:
Similarly, an example of a submission file for a gene expression assay, including a header line and sequence records, can be as follows (showing SEQ ID NO: 2, and SEQ ID NO: 5):
According to various embodiments of the present invention, user 28 may prepare the submission file manually. Alternatively, user 28 may use a file builder program (described below) which queries user 28 for relevant information, automatically constructing the sequence file, and allows user 28 to upload the sequence file through the portal. As shown in
Manual Preparation of Submission File:
If user 28 selects to prepare the submission file manually at 64, then user 28 prepares the submission file without using the file builder program. The structure of the submissions file will now be described. The contents of the submission file may vary depending on whether the assay being designed is to be used for creating a SNP genotyping assay or an assay which will be used for gene expression.
As discussed above, the submission file may contain two components: a header line and one or more sequence records. The header line contains information regarding the individual ordering the assay, and may have the same contents if user 28 orders one or more SNP genotyping assays or one or more gene expression assays. The header line of a submission file may contain one or more the following fields: a greater-than (>) symbol (or another symbol or token that can serve to identify the line as a header line of a submission file), a name field, a telephone number field, and a part number field. In some configurations, this formatting may be imposed as a requirement. In addition, also as a design choice, some configurations limit the header line to no more than 255 characters. The orientation of these fields in the header line is as shown in the
To create a header consistent with these formatting conventions, a standard text editor such as Microsoft®) Notepad can be used. A greater-than symbol (>) can be entered as the first character, followed by the contact name and phone number. A part number can be then entered which is used to select the parameters of the resulting assay. In some configurations, part numbers can be assigned by the supplier that indicate a type of assay and a scale of synthesis. The supplier may, but need not, require separate submission files for each requested assay. In some configurations, SNP human assays, SNP non-human assays, and gene expression assays can be assigned different part numbers. Also in some configurations, different part numbers can also be assigned according to the scale of the assay. As a non-limiting example, in some configurations in which SNP human assays, SNP non-human assays, and gene expression assays are each supplied in three different scales, a total of nine different part numbers can be used.
A non-limiting example of part numbers and designations are shown in the tables reproduced below:
It will be noted that in this example there can be only one part number for each record. Accordingly, a separate submission file can be created for each assay type or each scale which is desired. A completed header line may be varied so long as the general rules here are satisfied.
The sequence record contains the sequence data for designing the primers and probes and may vary depending upon whether the assay being requested is a SNP assay or a gene expression assay. If the assay is a SNP assay, then the sequence record may have the following fields as shown in
Although other conventions may be used, configurations can be permitted to require that SNP target sites be indicated with square brackets around each site, with two sequences corresponding to the individual alleles separated by a forward slash. For example, ACAC[G/T]TCT can be denoted by two alleles: ACACGTCT or ACACTTCT. Also, configurations can be permitted to require that indel target sites be indicated with square brackets around each site, and that, within the brackets, base(s) present be indicated, followed by a forward slash and an asterisk, wherein the asterisk indicates a deletion. For example, ACAC[GA/*]TC can denote two alleles: ACACGATC or ACACTC. It will be noted that the indel target sequence can in various embodiments contain 6 bases in addition to the insertion/deletion base or bases.
Finally, the coordinate field identifies and names a marked target site. Configurations can be permitted to require that the target site be indicated in 5′ to 3′ order. Although other conventions may be used, configurations can be permitted to require that the coordinate field include the target site order position, an equal (=) sign, and an alphanumeric target site name of no more than four characters. Multiple coordinates may be specified in some configurations, and it can be permissible for these configurations to require that the coordinates be separated by commas without spaces. For example, in the sequence record shown in
In some configurations, only one assay will be synthesized for each record. The assay name associated with a particular assay that can be ultimately synthesized may be defined by the record name and coordinate. For example, in the sequence record shown in
As discussed above, configurations of the present invention can be permitted to require (or allow) that the format of a sequence record for a gene expression assay vary from the sequence record for a SNP genotyping assay. In this regard, the sequence record for a gene expression assay can include three fields: a record name field, a sequence field, and a coordinate field. The record name field may be a unique name that can be used to identify the sequence record. Configurations of the present invention can be permitted to impose restrictions on the unique name, for example, limiting it to no more than 10 characters. In the example of sequence record for a gene expression assay shown in
Sequence field information can be (and in some configurations, can be required to be) arranged in 5′ to 3′ orientation and it can be permitted in some configurations to limit the sequence field information to no more than about 5,000 characters. However, it is to be understood that the sequence field may have (or may be allowed to have) more than 5,000 characters in some configurations. By design choice, configurations may also require that there be no spaces or tabs between the characters, and that only permissible characters can be A, C, G, T, or N. Configurations can be permitted to automatically convert lowercase letters to uppercase, for example, for ease in processing.
Although not required, at least one coordinate in the coordinate field of the sequence record can contain the target position, an “equals” sign, and a target site name for each site. It is permitted to require that the coordinate field contain no spaces, and that multiple sites be separated by commas. As discussed above, at least one coordinate can be required for each sequence record. If a specific target site is not present, multiple sites can be selected across the sequence.
When entering sequence records, the record name can be entered according to the guidelines set forth above. A single space or a tab may then be entered followed by the sequence data also according to the guidelines discussed above. Another space or tab can then be entered followed by the coordinate(s) also set forth above, then the enter key can be depressed. These steps can be repeated for each sequence record.
In some configurations, File>Save can be selected to save the file as a text (i.e., “.txt”) document. If Microsoft Notepad is being used on a Microsoft WINDOWS® 2000 operating system, ANSI encoding can be selected. Configurations of the present invention can be permitted to impose restrictions on the name selected for the saved file. For example, some configurations can require file names of no more than eight alphanumeric characters, and may require the extension .txt to be present.
After the file has been saved, a further check may be performed to determine whether the submission file satisfies the format requirement set forth in
Once the submission file has been checked for errors and is ready for submission, an order message can be prepared as indicated by block 70. The order message contains order information which includes the submission file and the part number listed in the header of the submission file. If more than one submission file is being submitted, the submission file and the corresponding part number for each submission file can be present. In addition, the order message can include either a purchase order number or credit card information with the name as it appears on the card, the card number and the expiration date. The order message can also contain contact details such as name, e-mail address, phone number, address and e-mail address of primary contact in case of difficulties with the submitted file. Shipping information can also be provided which can include identification of the person to receive shipment, for example, that person's name, address (including room number, building and department) and/or phone number. An invoice number and identification of a purchasing agent or person to receive invoice details may also be included, and such identification may include that person's name, address, e-mail address and/or phone number.
Once the submission files have been checked for errors, the submission file can be submitted to the system either by e-mail, by regular mail or by web access. If the order is to be sent by e-mail, the submission file can be attached to the order message and an indicia of the processing can be placed in the subject line of message. For example, the text “CA” may be placed in subject line to indicate that the order can be processed as a custom assay. The e-mail message can then be sent to the facility conducting the design process.
If the order is being submitted by regular or express mail, a copy of the order message can be included. The submission file may be placed on a machine readable medium, for example, a 3.5 inch floppy disk or CD ROM in a format readable on (for example) Microsoft Windows operating systems. The order message and submission file can be then submitted to the service using the invention.
To assist user 28 in preparing a sequence for submission to the custom assay system, various embodiments of the present invention include a file builder program to prepare the submission file as represented by block 74. The file builder program can be used for submitting sequences for SNP genotyping assays and for submitting sequences for gene expression assays. File builder program configurations of the present invention can include a DNA sequence checker as well as a text editor to facilitate building, editing, and correcting new as well as validating imported sequence submission files. Once the submission files are created using the file builder program, the submission files can be uploaded over the Internet to the system for synthesis or otherwise submitted. A file builder program may be resident on consumer computer 26, or it may be a web-based application or resident on the host computer.
Exemplary configurations of a file builder program will now be described in greater detail with reference to
In some configurations, user 28 can also select an option of viewing a file builder demonstration program at decision block 82. The file builder demonstration program shows how user 28 can complete the fields for preparing a submissions file using the file builder program (as will be described below). In this regard, the file builder demonstration program provides step-by-step instructions regarding the use of the file builder program to format an assay request. If user 28 selects to view a file builder demonstration at decision block 82, file builder demonstration program can be displayed for user 28 at block 84. As a design choice, some configurations of the file builder demonstration program may utilize Macromedia Flash. An exemplary window pane generated by the file builder demonstration program is shown in
User 28 may also select to view the submission guidelines for preparing the submission file as indicated by decision block 86. If user 28 selects to view the submission guidelines at decision block 86, the file builder program displays at block 88 a file containing the submission guidelines in a suitable display format, one example of which is portable document format (PDF). An exemplary window pane showing the submission guidelines displayed at block 88 illustrated in
User 28 can also select to build a submission file at decision block 90. If user 28 selects to build a submission file at decision block 90, user 28 can be directed to a series of window panes at block 92 that allow user 28 to enter header line information of the type described above. In this regard, user 28 at block 100 of
After user 28 enters the relevant header file information at block 100, the file builder program requests entry at block 102 of a sequence name which can be the name given by user 28 to the specific sequence. In addition, user 28 can also be requested to provide a target sequence as indicated by block 104. Finally, user 28 also provides at block 106 the target coordinates. An exemplary window pane for which this information can be entered, is shown in
After the sequence name, target sequence and target coordinates have been entered, user 28 is able to validate the sequence (i.e., check for formatting and typographical errors) at block 108. User 28 may instruct the file builder program to validate the sequence by clicking on a “validate” button, such as that shown in
If the file builder program detects typographical errors in the target sequence, some configurations generate a window pane that indicates to user 28 that typographical errors are present. In the example shown in
After the information from user 28 has been successfully validated, the information can be saved to disk as indicated by block 116 (see
By convention, files in some configurations can be saved with a file extension of “.txt”. After the file has been saved, user 28 is able to upload the submission file at block 118 to the system by clicking on or depressing an appropriate button. Before or after user 28 has requested sequence information be uploaded to the system, user 28 may be requested to provide appropriate identification and password information. Configurations of the present invention can be permitted to make such identification mandatory. A non-limiting example of a window pane requesting such identification information is shown in
In some configurations, if user 28 selects to process the order, the store provides stored contact and shipping information and asks that user 28 verify the information as well as provide any special instructions. User 28 can then verify payment information and place the order if all the information can be correct.
Upon successful validation, oligo factory 42 accepts the order from consumer 28, manufactures at least one assay having components including a forward primer, a reverse primer and a probe and ships the manufactured assay to the consumer. The forward primer, reverse primer, and probe can be manufactured in accordance with the validated sequences.
In some configurations and referring to
In various configurations, input to assay design program 38 includes a parameter file 126 that specifies design rules and one or more sequence data files 124. Output includes a log file 132 that reports system settings and attributes describing each successful reagent design (including probe, primer, and amplicon sequences). Additional output indicating a system status can be reported to a display screen as the program is running, in some configurations.
Sequence input file 124 can contain formatted and annotated sequence data. Parameter file 126 can contain keyword-associated settings that govern rules and scoring applied during designs. Prior to attempting any designs, the format of supplied sequence data can be checked at 128 for errors. If errors are found at 130 in the sequence data from input file 124, they can be reported to an error log 140 and the process can be terminated.
In various configurations, assay design program 38 starts by parsing parameter file 126 to set up rules and scoring schemes. If initialization errors occur, they may be caused by conflicting options or incorrect file names or formats. If there are any errors encountered during the initialization phase, they can be reported to log file 140 and assay design program 38 can then stop. Following successful initialization, assay design program 38 sequentially attempts to design assay sets for each target site in each sequence listed in the input sequence data from parameter file 126. As designs are processed, they can be recorded in a design log file 132. Design attempts that fail can also be recorded in log file 132. Design failures can occur when no acceptable set of reagents satisfying all rules and scores can be found for a sequence target.
If, at 134, there are no valid designs present in design log file 132, this fact can be reported in error report 140. Otherwise, following the core design process, design log file 132 may be used to generate output sequence data in a number of different formats. Log pick program 136 can perform this post-processing of design log 132 data to produce formatted outputs 138. A script can be implemented utilizing the UNIX operating system to integrate the whole system by tying together all of the processes shown in
Separate design rules and constraints can be applied to potential probes, primers, and amplicons. All designs resulting from a given run share a common set of rules. Probe constraints include limits on size (i.e., probe length), T.sub.m (target, minimum, and maximum temperatures), internal loops (total and contiguous matching bases in a “hairpin stem”), G+C content (i.e., combined G and C percentage), and runs of a given base, such as G. Analogous constraints can also be separately applied to primers, which have an additional limit on G+C at the 3′ end (5 bases) of the primers. Constraints applied to amplicons include length (including primers), G+C content, and the number of ambiguous bases (note that ambiguous bases are generally not allowed within probes or primers). In addition, the primers defining amplicons can be constrained to limit the maximal size of internal priming sites (i.e., the number of contiguous matching bases starting at the 3′ end of one primer that complements any part of the other primer).
For many of the constraints listed above, system 122 may apply either a filter or a score. When applied as a filter, a constraint will be either satisfied or not with the corresponding design being either accepted or rejected. When applied as a score, attributes may be given a graded value that reflects how “optimal” a given design is. For example, a design with all constrained attributes near optimum values will be favored over one with attributes deviating from the optimum values. Scoring provides finer tuning of the constraints that system 122 will use to evaluate and select designs.
Logic flow representative of some configurations of assay design program 38 is shown in more detail in
Various configurations of assay design program 38 attempt to acceptably design assay sets for each target site at 156. These designs can be logged at 158. An attempt can be made to identify acceptable designs at 160 for each input sequence record from sequence data file 124. When records are exhausted at 152, assay design program 38 is done at 150. Otherwise, for each record, each target can be tried at 156 in the order listed. If no target information is supplied, the sequence midpoint (if the sequence contains no SNP annotations) or the first SNP (if annotated) can be used as a target. When no targets are left for a given record at 154, assay design program 38 progresses at 152 to the next record.
For target sites, some configurations of assay design program 38 identify, at 160, successful and unsuccessful designs, according to the design metrics and scoring metrics. If program 38 fails to design for a target, this fact along with the corresponding unsuccessful design can be reported to log file 132 and the program progresses at 154 to the next target associated with the record. If it succeeds to design for a target, the details of the chosen record can be reported to log file 132. Normally, a single successful design causes assay design program 38 to move to the next record at 152. However, in some configurations, if an option to evaluate all targets listed for each record is enabled, assay design program 38 progresses, at 162, to the next target at 154 rather than to the next record at 152 following a successful design.
Representative logic for designing reagents for a simple target suitable for various configurations of procedure 156 is shown in more detail in
If no problems are encountered, placement of probes can be normally attempted next at 172, unless an option to design only primers is enabled at 170, in which case, execution continues at 176. (A primer-only option may be enabled, for example, by a command line option, such as “-op”.) Probe placement at 172 yields either one or two acceptable probes (non-SNP and SNP cases, respectively), or not. If acceptable probes are not identified at 174, target design process 156 fails at 188. Otherwise, bounds can be set for primers at 176.
In some configurations of the present invention, to set primer bounds at 176, three sub-regions within the design window can be defined. In cases in which probes can be designed (e.g., cases in which not only primers are designed), a central mask region corresponding to coordinates of the probes can be defined. Bounds for the mask region may be explicitly designed relative to target site coordinates. For example, in some configurations, a command line option (such as “-pm”) can be used to specify that the mask region is to be designed relative to target site coordinates. In this case, the actual mask region can be the larger of the specified bounds or the mask formed by the probes. Fixing the central mask region determines the two sub regions where primers may be designed. The “upstrand” sub-region begins at the start of the window and extends to the start of the mask region. The “downstrand” sub-region follows the mask and extends to the end of the window. The three sub-regions of the window (i.e., upstrand, mask, and downstrand) do not overlap.
With the uprstrand and downstrand sub-regions determined, design procedure 156 attempts to collect a number of primers in each sub-region at 178. Forward primers can be taken from the upstrand region and reverse primers can be taken from the downstrand region. Potential primers can be evaluated at each nucleotide position starting from the coordinates closest to the mask (i.e., the end and start coordinates of the upstrand and downstrand regions, respectively). Such evaluation may, for example, determine whether a potential primer is acceptable according to standards known and recognized in the art. In some configurations, design procedure 156 collects up to ten forward and ten reverse primers, but by setting a command line option (such as “-np”), the limit of ten can be changed to another number.
If at least one potential forward and one potential reverse primer is not found at 180, design process 156 fails at 188. With two lists of primers, design process 156 next attempts to identify an acceptable forward/reverse pair at 182. If no acceptable primer pair is identified at 184, design process 156 fails at 188. Otherwise, a complete design has been found at 186.
The logic of a representative configuration of procedure 172 for placing probes in various configurations of the present invention is shown in more detail in
For SNP target sites, sequences corresponding to both alleles (only bi-allelic SNP sites can be supported in some configurations) can be explicitly constructed and the best probes for both strands of both allele sequences can be identified as described above. An acceptable pair of SNP probes must target the same sequence strand. If acceptable probe pairs can be found for both strands, the strand yielding the pair with the largest total score can be selected. When input sequences have multiple SNP sites denoted, the non-targeted SNP sites can be masked (i.e., set to base N) when the sequences for each explicitly targeted allele are constructed.
If no acceptable probe (or, for SNPs, no acceptable probe pair) can be found for a given target, the system reports this fact and attempts to continue, depending upon the number and format of sequence targets supplied. If a single sequence is supplied as input, failure to select a probe (or pair) results in a program termination. If multiple target coordinates (or SNPs) are listed for a given sequence, failure to place a probe at one target coordinate causes probe placement process 172 to consider the next listed coordinate until all listed targets are exhausted. For multiple sequence input, failure to place a probe at any target coordinate leads the program to address the next listed sequence until all input sequences are exhausted. If there are multiple targets for a given sequence, whether or not a probe can be placed on any one individual target, all targets will be tested and the best design chosen.
Once a probe (or probe pair) sequence is selected, a list of upstream (forward) and downstream (reverse) primers can be delineated starting immediately before and after the probe position. These can be delineated via T.sub.m (in some configurations using a different algorithm than used for probe design), and filtered or scored. If SNP probe pairs can be being designed, primers are delineated starting immediately before and after the footprint corresponding to both SNP-targeting probe positions. At least one forward and one reverse primer must be identified. By default, up to ten forward and ten reverse primers can be collected, but the number of upstream and downstream primers may be changed, such as by using a command line switch. Failure to identify any forward or any reverse primers results in probe placement process 172 to report the problem and continue with the next target coordinate or next sequence as described above.
Forward and reverse primers can be checked for pair-wise compatibility and the corresponding amplicons can be filtered or scored. The compatibility check can include screening the 3′ ends of the primers across the amplicon associated with a given primer pairing. If too great a 3′ match is identified, the primers may not be paired. The pair of primers with the best score, by default, the shortest amplicon, can be chosen in some configurations of the present invention. Failure to select an acceptable primer pair results in probe placement process 172 reporting the problem and continuing as described above.
Acceptable designs comprising one or two probe sequences (such as, for example, probe sequences that can be used to make TaqMan® probes) together with corresponding forward and reverse primer sequences can be recorded in the log file. Along with the sequences, the coordinates, T.sub.m values, and scores may be reported for each probe and primer. Any associated auxiliary data (e.g., tracking information) loaded during sequence and target input may be also reported to the log file when a successful design is obtained. If no acceptable designs can be found for a target sequence, only the target name may be recorded in the log file.
Stock Assays: Gene Expression
In some configurations, custom gene expression products include off-the-shelf assays. In some configurations, assays can be provided for 15,000 genes based upon the NCBI Reference Sequence Database Project (RefSeq). In some configurations, off-the-shelf assays can be provided for about 30,000 genes (i.e., every human gene or almost every human gene). Various configurations use 5′ nuclease chemistry with TaqMan® MGB probes and/or operate with universal formulation and thermal cycling parameters (for example, in some of these configurations, 900 nM primers, 250 nM probe). Some configurations provide assays designed utilizing a bioinformatics pipeline that includes private and public data, such as a combination of Celera data and Public data, or either private data or public data alone.
Gene Expression Assay Preparation:
In some configurations, gene expression assays include two unlabeled oligonucleotide primers and a single TaqMan® probe (Livak et al., PCR Methods Appl 4:357-362) with an MGB, moiety. Assay design can include transcript pre-processing, actual design of the primers and probe and in silico quality control prior to manufacturing the probe.
Pre-Processing: In some configurations, certain sequence regions within the transcript can be identified in the pre-processing step for designing the oligonucleotide primers and probe for a 5′ nuclease assay. For example, sequence regions may be selected that do not contain any known single nucleotide polymorphisms or repeat sequences. Also, 5′ nuclease assays for gene expression may be designed across exon-exon boundaries, and thus, in some configurations, the position of each of the exon boundaries within a multi-exon transcript can be determined prior to the design of each assay.
In some configurations, transcript pre-processing begins once a batch of transcripts is compiled into a multi-fasta file. Repetitive and low complexity regions in each transcript can be masked (i.e. nucleotides replaced by an N) in some configurations. Repetitive sequences that can be masked include, for example, simple repeats (di- and tri-nucleotide repeats), Alu restriction site repeats, long interspersed nuclear elements (LINEs), and short interspersed nuclear elements (SINEs).
Exon-exon boundaries can be identified by mapping the masked transcripts to the human genome using alignment software. The positions of each exon-exon boundary can be marked for each multi-exon transcript, with single-exon transcripts being identified as such. Mapping may be performed against the Celera genome assembly, with supplemental mapping information provided by public sequence data. If sequence discrepancies are found between the public transcripts and the Celera genome during this step, the discrepant bases may be masked.
In some configurations, in the final pre-processing step, all known single nucleotide polymorphisms (SNPs) can be masked after performing a BLAST analysis against a genomic database using methods known in the art (see Altschul et al., J. Mol. Biol. 215:403-410, 1990). All of the known SNPs can be identified within each transcript. Both the SNP-masking and sequence discrepancy-masking steps can be useful in preventing oligonucleotide primer and probe assays from being designed over ambiguous or known variant nucleotide(s).
Assay Design: The gene expression assay design can be based upon specifications as described above including optimal Tm requirements, GC-content, buffer/salt conditions, oligonucleotide concentrations, secondary structure, optimal amplicon size, and reduction of primer-dimer formation. As noted above, each gene expression assay can include, in some configurations, two unlabeled oligonucleotide primers and a single TaqMan® probe. The TaqMan® probes incorporate both an MGB and an NFQ at the 3′ end of the oligonucleotide. The use of MGB probes increases the probability of designing an assay in traditionally difficult sequence regions (e.g., AT-rich sequences). Additionally, the relatively short MGB probes increase the probability that a probe can be designed over every exon-exon boundary of a multi-exon gene.
For transcripts from multi-exon genes an assay target position can be selected at each exon-exon boundary. The probe rather than one of the primers can be generally, but not always placed over the exon-exon boundary to ensure that the primers bind in two distinct exons. Placing the probe over the exon-exon boundary ensures that the primers can be in two different exons, and that fluorescent signal can be only generated from amplicons to which the probe can specifically bind and be cleaved. Assays designed over exon-exon boundaries can be designated by Hs********_m*, where the “m” indicates multiple exons.
For single-exon genes, both the primers and probe must be placed within the exon. Any assays that have the primers and probe placed within a single exon can, therefore, be designated Hs********_s*, where the “s” indicates a single exon. This designation provides an indication to users that there can be the potential to amplify contaminating genomic DNA in an RNA sample, and thus the appropriate experimental design controls can be implemented to avoid this problem.
For multi-exon genes, n−1 assays can be designed where n can be the number of exons. For transcripts from single-exon genes, multiple assays can also be designed by designating target positions that can be dispersed across the entire length of the transcript. The design of multiple assays for each transcript provides two advantages: 1) it increases the probability that a successful assay will emerge at the end of the entire design and quality control process, and 2) having assays that can be designed from the 5′ to the 3′ ends of every transcript provides great flexibility in the choice of a high-quality assay at any position on the transcript.
In Silico Quality Control: In some configurations, after design, primer and probe sets are processed through a quality control step. This process penalizes, and thus helps to screen out: 1) assay designs that are not highly specific for the gene of interest, and 2) assay designs that may not accurately report the quantitative expression results for a particular target (i.e., an accurate threshold cycle (Ct) value) in a 5′ nuclease assay.
In some configurations, the in silico quality control comprises three major parts, and each step generates a penalty score specific to a given assay design. A final penalty score for each assay design comprises the sum of each of the three individual penalty scores. The assay design with the lowest cumulative penalty score for each transcript can be the assay that can be chosen for manufacturing.
In some configurations, the three parts comprising the in silico quality control process include:
1) Transcript BLAST Scoring, which comprises determining the degree of homology, through BLAST, between the assay and other closely-related transcripts. A penalty can be assigned if an assay detects any closely homologous transcript(s) other than the intended target.
2) Genome BLAST Scoring, which comprises determining the degree of homology, through BLAST, between the assay and non-self regions of genomic DNA (e.g. homologous genes and pseudogenes). A penalty can be assigned if an assay hits a second (or greater number) physical location on the genome in addition to the location of the gene-of-interest.
3) Determining the size of the intron across which the probe spans (for assays to multi-exon genes). A penalty can be assigned when the assay is designed across an exon-exon boundary that spans a small intron (for example, <2 Kb).
In various configurations, for all BLAST searches, a quality control query construct can be made by generating an amplicon sequence that includes each of the two primers and the intervening probe; the amplicon can be created by padding the specific number of nucleotides between the primers and the probe with N's (
1) Transcript BLAST Scoring: The quality control query construct for each 5′ nuclease assay can be BLASTed against transcript database(s) in some configurations to ensure that 1) each primers/probe trio in the quality control query sequence matches the target transcript sequence, and that 2) each assay can be specific for the gene of interest and will not amplify transcripts from other genes. Primers with homology to other genes (with an intervening homologous probe) can produce an unwanted fluorescent signal, and thus an artificially low Ct value. Primers to homologous genes (without an intervening homologous probe) may amplify homologous transcript(s) in addition to the target transcript and cause competition for reagents in the PCR reaction, resulting in an artificially high threshold cycle (Ct) value if the competing homologous transcript is expressed at high levels. These types of side reactions can skew the Ct for the gene of interest and thus produce an erroneous quantitative result for the target transcript. If homology exists, an assay can be assigned a penalty score based on the degree of homology to other transcripts. In some configurations, three sets of numbers can be reported in this transcript BLAST step as described below.
(a) BLAST Hit to Self (Transcript_SelfHSP):
The high scoring pair (HSP) from this BLAST can produce a match of 100% homology with self. This HSP represents the alignment of the quality control query construct to the target transcript in the transcript database, and shows a “0 0 0” (representing 0 mismatches in the forward primer sequence, 0 mismatches in the probe sequence, and 0 mismatches in the reverse primer sequence) result when BLASTed against the database from which the target transcript was retrieved (
(b) Continuous BLAST Hits to Non-Self Transcripts (Transcript_HomoHSP)
In this set of BLAST results the top non-self HSPs can be reported (i.e. BLAST results to homologous transcripts). The highest penalty can be assigned to the HSP that is the closest homolog but that is not a perfect match to the quality control query construct. If two HSPs have the same homology score to the query construct, then the one with the higher homology to the probe region can be chosen as the top hit.
This approach will skip all of the homologs that have a “0 0 0” match and will only report the top non-zero HSPs. Therefore, a primer/probe set that can amplify alternative splice variants for the same gene will not be penalized, since these alternatively-spliced transcripts may be present as unique transcripts within the database being queried. This step helps to ensure that assays can be gene-specific, but not necessarily transcript-specific.
Two or more highly homologous genes may end up with identical assay design in regions where the genes have identical sequence. In such a situation a transcript penalty can not be assigned (because of the “0 0 0” match). Situations in which an assay could detect transcripts from more that one gene can be penalized in a downstream part of the in silico quality control process when BLASTing can be done against the genome assembly (see below). Designing the process in this manner facilitates differentiation between an assay detecting an alternatively-spliced variant of the same gene versus an assay that detects a transcript from a different gene locus.
(c) Non-Continuous BLAST Hits to Non-Self Transcripts (Transcript_HomoHIT):
In some configurations, a BLAST query can be performed to analyze any alignments with high homology to each of the two primers, but which come from non-continuous regions of a homologous transcript. The quality control query construct hits two different (non-contiguous) parts (HSPs) of a non-self transcript. This BLAST result can be indicative of an amplicon from a homologous transcript being of a different size than the target amplicon. These BLAST results can be from two different HSPs (
2) Genome BLAST Scoring: In some configurations, the same quality control query construct that can be BLASTed against the transcript databases can also be BLASTed against the human genome assembly and the output can be reported in a similar manner. This quality control step avoids missing homologous transcripts that may not yet be known in transcript databases, facilitates, via genomic alignment, the distinguishing of different genes from alternative splice variants of the same gene, reduces amplification of artifacts due to the possible presence of contaminating genomic DNA in a total RNA sample, and penalizes those primers/probe that would amplify pseudogenes in total RNA samples that contain contaminating genomic DNA.
(a) Blast hit to self (Genome SelfHIT). As with the BLAST search to align the primers and probe to the target sequence in the transcript databases, similar BLAST searches can be used to align the primers and probe to the unique gene in the genome to which they were designed. For multi-exon genes the match must be “0.times.0” for the primer/probe set to avoid a penalty. The two zeros represent no mismatches between the forward and reverse primer sequences and the genome sequence, and the fact that they come from two different HSPs indicates that the primers can be on two different exons, separated by an intron. The non-zero value of X reflects the fact that the probe is interrupted by an intron, and thus does not align itself to contiguous sequence in genomic DNA. For single exon genes, the BLAST search alignment returns a value of “0 0 0” because there are no intronic regions to interrupt the probe sequences and lead to mismatches.
(b) Continuous BLAST hits to non-self gene(s) (Genome_HomoHSP): The Genome_HomoHSP BLAST results identify genomic regions that have high homology to the primers and probe, and can amplify a PCR product of similar size to the target transcript from contaminating genomic DNA present in an RNA template. This situation can most often occur because of the presence of a pseudogene in genomic DNA. This BLAST result identifies the HSP with the highest homology to the amplicon, with the focus primarily in the two primer regions. If two HSPs have the same degree of homology in the primer sequences, then the HSP with a higher homology to the probe region can be chosen as the top hit, and the degree of mismatch in the primers and probe can be used to generate the penalty. The higher the degree of homology between the primers and probe and the HSP, the greater the penalty. This, in effect, over-penalizing assays by assigning this genomic DNA penalty. However, this penalty can be applied in order to maximize the ability of an assay to accurately quantitate the target of interest in RNA preparations that may be contaminated with genomic DNA.
(c) Non-continuous BLAST hits to non-self gene(s) (Genome_HomoHIT): This genomic BLAST alignment identifies the genomic sequences that have the highest homology to each of the primers but come from two different HSPs. If the intervening sequence between the two HSPs is short, then the penalty can be high. This minimizes the chance of amplifying a non-target template in an RNA preparation with genomic DNA contamination. If the genomic interval between the two primers is large the penalty can be smaller because it is unlikely the primers can actually produce an amplicon from this type of secondary template.
As described above, there can be no penalty for non-self “0 0 0” hits in the transcript BLAST quality control step, and thus the Genome_HomoHIT BLAST results can be used to penalize assays that cannot discriminate between homologous genes. If two or more highly homologous genes have identical assays designed (for example, in a region where the two different genes have identical sequence) then the assays can be penalized at this step. If the Genome_HomoHIT results shows “0.times.0” hits at least one genomic location in addition to self, then the assay can be assigned a large penalty because it can be assumed that this second hit is to a separate and distinct gene.
3) Intron Size Scoring: The third part of the in silico quality control scoring process can be the determination of intron size for assays to multi-exon genes that have the probe spanning an exon-exon boundary. Although a penalty for small intron size can be integrated into the genome_HomoHIT rule, a separate rule also penalizes primer/probe sets that span introns of small size. This reduces the possibility of competition for reagents in RNA samples contaminated with genomic DNA, and also decreases the chance of amplifying incompletely spliced transcripts. The intron penalty can be based on the size of the intron: the larger the intron, the smaller the penalty.
Linking Assays to Transcripts:
A large number of BLAST searches against a variety of databases can be performed during the assay design process, as outlined above. In one non-limiting example, as many as about 100 BLAST results can be stored for each assay. The BLAST files that can be loaded into TaqDB contain the mismatch information resulting from the comparison of the primers and probe to these various databases. When there is a BLAST file showing a perfect match (0,0,0) to a transcript (this will, by definition, occur for the transcript from which the assay was originally designed) then a link can be created in the database between the assay and the accession ID of that transcript. When there are additional transcripts that perfectly match the primers and probe, they can also be added to the database and “virtually” linked to that particular assay. These links can be considered virtual because they can be links to transcripts that the assay was not originally designed to detect, but which it will detect. Alternative splice forms of a particular gene are the most common source of virtual links. Cross referencing all of the BLAST files with all of the assays in this manner allows the creation of many-to-many relationships between assays and transcripts, thus defining which transcripts an assay may amplify. As a result of this process, an assay can match multiple transcript accession IDs, for example, multiple RefSeq entries. In addition, other BLAST files that contain small mismatches can also be loaded into the database and linked to the assay as BLAST quality control data.
The assay-to-many-transcripts relationships can be displayed on the website online ordering system so that a researcher will have information on all of the transcripts an assay will detect, prior to purchasing the assay.
Transcript databases change over time inasmuch as new transcripts are continually being discovered, and occasionally entries that were originally thought to be transcripts can be found to be faulty and can be purged. In certain configurations, in order to keep the collection of assays current, BLAST searching can be used to map the assays to the new set of transcripts after a new transcript database is released (e.g., RefSeq is updated approximately every four weeks). This process keeps the information current through the identification of every known transcript that a particular assay can amplify, and it also allows the removal of any assay in the collection that no longer maps to the up-to-date transcripts. An additional benefit of the remapping process is that it is not necessary to design assays for every sequence in every transcript database. Rather one can often find a link from an existing assay to new sequences, and thus save time in delivery of assay products to researchers.
From failure analyses, it can be possible to recognize oligonucleotide sequences that can be problematic so that subsequently assays can be designed to be robust 5′ nuclease assays. Thus, a database containing failed assay designs can provide a basis for improving the design process. For example, extracting the oligonucleotide sequences from assays that failed in the manufacturing process (e.g., quantitation, or analytical quality control) allows comparison of problematic sequences to identify commonalities. Certain types of sequences may tend to be difficult to manufacture and such difficult to manufacture sequences can be assessed a penalty for oligonucleotides containing such problematic sequences. This in turn, decreases the failure rate in subsequent manufacturing, and results in better functional assays.
Evaluation of Designed Assays:
In a non-limiting example of an assay design process, over 16,000 RefSeq transcripts were run through the assay design process. From these transcripts, 13,633 assays were sent to manufacturing. There are .apprxeq.2000 transcripts for which no order was to manufacturing, and these assays fall into the following categories:
1. No assay designed
2. No designed assay passes the current penalty cut-off
a. intron size penalty (multi-exon genes only)
b. Transcript penalty
c. Genome penalty
Although many of the assays that do not pass the in silico quality control standards may be suitable assays under certain circumstances, especially rigorous standards can be used in certain embodiments, to avoid manufacturing assays that have the potential to produce difficult-to-interpret quantitative gene expression results. There can be a variety of reasons why a designed assay may not be a robust assay for quantitative determination of mRNA transcript levels in a particular RNA sample. Thus, not all of these in silico quality control steps may be important to all users of an assay, but it can be, nevertheless desirable to provide the most robust quantitative assays that will fit the requirements of the entire spectrum of sample types and sample preparation methodologies utilized by the broad range of users of a particular assay.
Table 1 provides an example of how the process works, showing all of the assays designed across the exon-exon boundaries of the human plakophilin 4 (PKP4) mRNA (RefSeq ID NM—003628).
As shown in the table, seventeen assays were designed for this transcript. Of the 17 assays designed, only the top-scoring assay that had no design penalties assigned was sent to manufacturing. However, there are six other candidate assays that met the manufacturing quality control cut-off for this particular target that can be chosen if for some reason the top-scoring assay fails along the downstream manufacturing and functional testing processes. Of the assay designs that did not pass the in silico quality control cut-off, one had a mid-level score because it was designed over an intron shorter than 200 bp. The rationale for this penalty score is that if the assay was being used to detect the transcript in a total RNA sample contaminated with genomic DNA, then the contaminating genomic DNA could be co-amplified with the mRNA target, potentially leading to inaccurate quantitation of the mRNA template. The likelihood of this occurring is low, since the primers are at 900 nM each in the final reaction and the probe does not detect genomic DNA, but these assays can be still penalized to provide a robust assay to the customers. Co-amplifying targets that do not bind to the probe will not interfere with quantitation when present in small amounts. Such targets can be often spiked into a reaction to serve as Internal Quantitation Controls (IQC) for quantitation (Furtado et al., N. Engl. J. Med. 340:1614-1622, 1999; Mulder et al., J. Clin. Microbiol. 32:292-300, 1994). Ten of the assays designed to the PKP4 target received a low final score because the primers/probe sequences for these assays exhibited high homology to at least one other portion of the genome. This penalty signals one of three possible situations: 1) that the domain which these exons encode is conserved and is present in other genes, 2) that there exists at least one pseudogene elsewhere in the genome, or 3) that there is random sequence at another site in the genome with very high homology to these particular exon sequences. Regardless of the reason, the potential exists for these low-scoring assays to generate less accurate quantitative results in a total RNA sample contaminated with genomic DNA than in a highly purified RNA sample. This points to the need for high-quality RNA template preparation upstream of any RT-PCR methodologies.
In some configurations, gene expression products ordered by a requester on demand can be available from the supplier with a FAM label and the TaqMan® MGB probe technology, which utilizes a nonfluorescent quencher for improved sensitivity and quantitation precision. Addressing of the whole collection of human genes can be facilitated by advantageously utilizing the design flexibility of the shorter MGB probes. Also, in some configurations, TAMRA TaqMan probes can be made available to requesters by the supplier of customized products.
PCR efficiency of a given assay (or PCR reaction) can be defined as follows. An assay that results in a doubling of the amplicon with each PCR cycle has an efficiency of 100%. Efficiency can be of interest when using the comparative Ct method of quantification. One assumption in the equations used to calculate fold-differences by the comparative Ct method is that the assays/genes being compared must be of equivalent efficiency. A test can be conducted in some configurations to find outliers, i.e., assays of clearly poor efficiency, which may result from design, as opposed to contamination. Subsets of genes designed and tested for high efficiency can be offered in some configurations.
Ordering Gene Expression Assays:
As discussed above, if the user desires at block 12 (see
In some configurations, custom gene expression assays made available for purchase may be selected by accession number (NCBI RefSeq ID) gene name, gene family, and/or functional groups and categories. For example, “Oncogene” is a category comprising three groups. For each group, some configurations provide a list of assays that a requester can order as a set or individually. If a requester does not find their particular gene expression assay of interest, the requester can check back on a regular basis to determine if a new assay has become available for the gene expression of interest. Alternatively, a requester may use the by design service. In some configurations, stock assays and custom assay designs can be made available for key splice variants. In addition, other search options and information associated with assays can be made available as desired.
A non-limiting example of a window pane which initiates the collection of information for gene expression assays is shown in
The user may also be able to request documentation from the system as indicated in
Further, the user may also be able to request reference information at decision block 230. If the user requests reference information at decision block 230, the user can be provided at block 232 with reference information which may be links to publicly available databases. For example, the user at block 232 may be linked to the NCBI Reference Sequence Project (RefSeq) database. It is to be understood, however, that other suitable database may be referenced.
The user may also decide to search gene expression assays as represented by block 230. If the user decides to search gene expression assays at block 234, the user can be requested to accept certain terms and conditions of use for the assay search at block 240 (see
If the user accepts the terms and conditions of use, the user can be directed at block 242 to a window pane which allows the user to search for stock assays for gene expression products. An exemplary window pane is shown in
If the user selects to perform a keyword search at block 242, the user may be able to perform either a basic or an advanced keyword search. If a basic keyword is to be performed, the user is able to select the search field in which the search is to be conducted, as well as enter a specific search term. The specific fields which can be searched include the non-limiting examples:
AB Assay ID
Celera gene (hCG)
Celera transcript (hCT)
Celera protein (hCP)
GenBank Nucleotide ID
GenBank Protein ID
A non-limiting example of a window pane which permits the entry of information for basic keyword searching is shown in
If an advance keyword search is selected by the user, the user can insert search criteria for all of the fields described above. A non-limiting example of a window pane which permits entry of information for advanced keyword searching is shown in
If the user determines that it is desirable to conduct a batch ID search at block 246, a batch ID search can be conducted at block 248. The batch ID search finds assays by using a list identification numbers. In this regard, the user is able to search by identification numbers from a variety of sources such as:
RefSeq accession number
GenBank Protein (GenPept) accession number
GenBank GI number
LocusLink gene symbol
Celera Gene (hCG)
Celera Transcript (hCT)
Celera Protein (hCP)
AB Assay ID
The information can be entered in a number of formats such as, for example, the identification numbers can be separated by either a, tab, carriage return, line return, comma or space. In addition, it is possible to upload a file containing the identification numbers, or identification numbers, such as a file which was previously exported following a gene expression search. An exemplary window pane which allows the user to enter information for a batch ID search is shown in
Finally, the user may, also be able to decide at block 250 whether a classification search, such as using the Celera panther classification system, is be conducted. The Celera Panther classification system is a system for classifying and predicting the functions of proteins in the context of sequence-relationships (see for example, U.S. patent application Ser. No. 60/[serial number not yet assigned] filed Dec. 14, 2002, Attorney Docket No. 9692-30USB, entitled “Methods for identifying, viewing, and analyzing syntenic and orthologous genomic regions between two or more species,” which is hereby incorporated by reference in its entirety). Assays can be assigned to a Panther category based upon a match to equivalently assigned Celera gene data. The Panther categories can be constructed up to three levels deep with assay assignments at any one of the three levels.
If the user desires to perform a classification search at block 250, a classification search can be conducted at block 252. The user is then able to search by molecular function categories involving a property of the protein or of a particular biochemical reaction performed by a protein, such as receptor, kinase or hydrolase. In addition, the user may also be able to search by biological process categories involving the biochemical reactions that work together towards a common biological objective. The process can be at the cellular level, such as glycolysis and signal transduction, or at the system level, such as immunity and defense, in sensory perception.
An example of the manner in which a classification search at block 252 can be conducted is shown in
Similarly, the user can select at block 258 to conduct a search based on biological processes. If the user makes the selection, the user selects one of a number of broad categories of biological processes which the system provides to the user at block 266. An exemplary window pane showing the biological processes from which the user may select is illustrated in
After selecting one of the broad categories of biological process at block 266, the user determines whether the search hierarchy has been completed at block 268. If the user has not completed the search hierarchy (i.e., the relevant biological process displayed to the user contains subcategories), the user then again selects one of the subcategories at block 266. If the user has completed this search hierarchy at block 268, the user then identifies and orders the assay at 270.
A non-limiting example of a classification search relating to biological processes is shown in
After the user inputs the search information, the results of the search can be provided to the user. One non-limiting example of a window pane providing results to the user is shown in
If the user selects a given assay, information concerning this specific assay can be presented in a manner similar to that shown in
In some configurations, endogenous controls can be available for relative quantitation of gene expression. For easy identification and ordering, the controls can be highlighted in the ordering system in some configurations.
Stock Assays: SNP Genotyping
In some configurations, at least 40,000 stock SNP genotyping products can be available. In some of these configurations, at least 77,000 such products can be available. In some of these configurations, at least 150,000 such products can be available, and in some of these configurations, at least 200,000 stock SNP genotyping products can be available.
In various configurations, SNP genotyping products can include, for example, 2 primers and 2 probes, each probe having a different label such as vic or fam, in a single tube, with or without assay information which can be provided on CD or other media. Various configurations can include some or all of the above.
In various configurations, at least 40,000 assays can be available using TaqMan® MGB probe technology under universal assay conditions. In some of these configurations, at least 150,000 such assays can be available, and in some of these configurations at least 200,000 assays can be available using TaqMan® MGB probe technology under universal assay conditions.
SNP Genotyping Assay Preparation:
In some configurations, SNP detection products include off-the-shelf assays. Various configurations use 5′ nuclease chemistry with TaqMan® MGB probes and/or operate with universal formulation and thermal cycling parameters (for example, in some of these configurations, 900 nM primers, 250 nM probe). Some configurations provide assays designed utilizing a bioinformatics pipeline that includes private and public data, such as a combination of Celera data and public data, or either private data or public data alone.
The design of SNP genotyping assays can be similar to the design of gene expression assays in a number of aspects In some configurations, each SNP assay can include two unlabeled oligonucleotide primers and two TaqMan® probes, each probe having a fluorophore, a fluorescence quencher, and a minor groove binder. Assay design can include selection of SNPs for assay design in a pre-processing selection process, design of the primers and probes, and in silico quality control prior to manufacture of the primers and probes.
Pre-Processing: In some configurations, certain sequence regions within the transcript can be identified in the pre-processing step for designing the oligonucleotide primers and probes for a 5′ nuclease assay as described above for gene expression assay design. Repetitive and low complexity regions in can be masked (i.e. nucleotides replaced by an N) along with any SNP other than the SNP for which the assay is to be designed. Non-limiting examples of repetitive sequences which can be masked include simple repeats (di- and tri-nucleotide repeats), Alu restriction site repeats, long interspersed nuclear elements (LINEs), and short interspersed nuclear elements (SINEs).
SNPs can be identified in a gene region by performing a BLAST analysis against genomic databases using methods known in the art (see Altschul et al., J. Mol. Biol. 215:403-410, 1990), or can be identified in a SNP database. If discrepancies are discovered, discrepancy masking steps can be used to help ensure that no oligonucleotide primers or probes are designed over ambiguous nucleotide(s).
Assay Design: The SNP assay design can be based upon specifications such as optimal T.sub.m requirements, GC-content, buffer/salt conditions, oligonucleotide concentrations, secondary structure, optimal amplicon size, and reduction of primer-dimer formation as described above for gene expression assays.
In silico Quality Control: In some configurations, after design, primer and probe sets can be processed through a quality control step. This process, although conceptually similar to that described above for gene expression assays, involves quality control steps applicable to SNP genotyping assays as described below. The quality control step penalizes an assay at each phase of testing to generate a penalty score specific to a given assay design. A final penalty score for each assay design comprises the sum of each of the individual penalty scores. The assay design with the lowest cumulative penalty score for each SNP can be the assay that is chosen for manufacturing.
In some configurations, the in silico quality control process for SNP genotyping assays can involve genome BLAST scoring, which involves determining the degree of homology, through BLAST, between the assay and non-self regions of genomic DNA (e.g. homologous genes and pseudogenes). A penalty can be assigned if an assay hits a second (or greater number) physical location on the genome in addition to the location of the gene-of-interest.
For all BLAST searches, a quality control query construct can be made by generating an amplicon sequence that includes each of the two primers and the intervening probes; the amplicon can be created by padding the specific number of nucleotides between the primers and the probes with N's. The quality control query construct for each 5′ nuclease assay can be BLASTed against a genomic database to ensure that 1) each primer/probe set in the quality control query sequence matches perfectly to the target SNP sequence (except for the SNP alleles in the probes), and that 2) each assay is specific for the SNP of interest and will not detect SNPs from any other regions of the genome. Primers with homology to other genes (with an intervening homologous probe) can produce an unwanted fluorescent signal, and thus mask the analysis of a true SNP. Primers to homologous genes (without an intervening homologous probe) may amplify homologous genes in addition to the gene comprising the target SNP and cause competition for reagents in the PCR reaction, causing spurious results. If homology exists, an assay can be assigned a penalty score based on the degree of homology to other SNPs. Two sets of numbers can be reported in this SNP BLAST step and are described below.
(a) BLAST Hit to Self.
The high scoring pair (HSP) from this BLAST can produce a match of 100% homology with self. This HSP represents the alignment of the quality control query construct to the target SNP in a SNP database, and shows a “0 0 0” (representing 0 mismatches in the forward primer sequence, 0 mismatches in the probe sequence (except for the SNP allele), and 0 mismatches in the reverse primer sequence) when BLASTed against the database from which the target SNP was retrieved. If the quality control query construct has no hits against any SNP in a SNP database, then the mismatch can be reported as “50 50 50” (an artificially high mismatch value) and the assay can be flagged as being problematic.
(b) Continuous Blast Hits to Non-Self SNPs.
In this set of BLAST results the top non-self HSPs can be reported (i.e. BLAST results to homologous SNPs). The highest penalty can be assigned to the HSP that is the closest homolog but that is not a perfect match to the quality control query construct. If two HSPs have the same homology score to the query construct, then the one with the higher homology to the probe region can be chosen as the top hit.
Two or more highly homologous genes may end up with an identical assay design in regions where the genes have identical sequence. If two or more highly homologous genes have identical assays designed (for example, in a region where the two different genes have identical sequence) then the assays can be assigned a large penalty because it can be assumed that a second hit is to a separate and distinct gene.
Linking Assays to SNPs:
A large number of BLAST searches against a variety of databases can be performed during the assay design process, as outlined above. In one non-limiting example, about 100 BLAST results can be stored for each assay. The BLAST files that can be loaded into a database such as TaqDB contain the mismatch information resulting from the comparison of the primers and probes to these various databases. When there is a BLAST file showing a perfect match to a set of primers and SNP probes (this will, by definition, occur for the SNP from which the assay was originally designed) then a link can be created in the database between the assay and the accession ID of that SNP. When there are additional SNPs that perfectly match the primers and probe, they can also be added to the database and can be “virtually” linked to that particular assay. These links can be considered virtual because they can be links to SNPs that the assay was not originally designed to detect, but which it will detect. Cross referencing all of the BLAST files with all of the assays in this manner allows creation of many-to-many relationships between assays and transcripts, thus defining which SNPs an assay may amplify. As a result of this process, an assay can match multiple SNP accession IDs, for example, multiple RefSeq entries. In addition, other BLAST files that contain small mismatches can also be loaded into the database and linked to the assay as BLAST quality control data.
The SNP to genome relationships can be displayed on the online ordering system so that a researcher will have information on the SNP prior to purchasing the assay (see, for example,
In certain configurations, BLAST searching can be repeated for SNP genotyping assays as updated SNP databases are released in a manner similar to that described above for gene expression assays. This process keeps the information current through the identification of every known SNP that a particular assay can amplify, and it also allows the removal any assay in the collection that no longer maps to the up-to-date SNPs. In addition, it can often be the case that a link from an existing assay can be found to new sequences identified in an updated SNP database, thus saving time in delivery of assay products to researchers.
In certain configurations, as was described above for gene expression assays, analysis of assay design failures can be performed to provide information for improving the design process. This decreases the failure rate in subsequent manufacturing, and results in better functional assays.
Evaluation of Designed Assays:
In a non-limiting example of an assay design process, several hundred thousand SNPs were run through the assay design process. From these SNPs, over 100,000 assays were sent to manufacturing. There can be many SNPs for which no order has been sent to manufacturing, and these assays fall into the following categories:
1. No assay designed
2. No designed assay passes the current penalty cut-off
Although many of the assays that do not pass the in silico quality control standards may be suitable assays under certain circumstances, especially rigorous standards can be used in certain situations, to avoid manufacturing assays that have the potential to produce difficult-to-interpret results. There can be a variety of reasons why a designed assay may not be a robust assay for a SNP. Thus, not all of these in silico quality control steps may be important to all users of an assay, but it can be, nevertheless desirable to provide the most robust quantitative assays that will fit the requirements of the entire spectrum of sample types and sample preparation methodologies utilized by the broad range of users of a particular assay.
Ordering SNP Genotyping Assays:
If the user desires to obtain a stock assay for SNP genotyping products, the user selects this feature at block 12, as shown in
After the user reviews an overview of the stock assay system for SNP genotyping at block 272 (see
Pre-formulated Assays (187.5 .mu.L, 20.times. mix)
2 unlabeled primers
1 FAM™ dye-labeled TaqMan® MGB probe
1 VIC® dye-labeled TaqMan® MGB probe
Compact disk containing:
Product insert hard copies of these documents can also be provided.
Assay Information File containing: sales order number, well location, assay ID, vial ID, Celera ID, gene name, gene symbol, category, category ID, group name, group ID, chromosome, cytogenetic band, NCBI gene reference, NCBI SNP reference, SNP minor allele frequencies, SNP type, context sequence, reporter dyes
With each order:
2-D barcode laser-etched on the bottom of each assay tube
1-D barcode printed on each rack of tubes
7 Instrument platform: 7900 HT 7700, 7000 Reaction volume: 5 .mu.L 25 .mu.L Reactions/tube 750 150
The user may also be able to also order documentation describing the SNP genotyping products. In this regard, the user selects to order documentation at block 278. If the user selects to order documentation at block 278, documentation relating to SNP genotyping assays can be provided to the user at block 280. In particular, the documentation can be provided by a window pane in a manner similar to that associated with gene expression assays (see
In addition, the user may also be able to decide whether the user would like to receive reference information at block 282. This information may include the general steps for using SNP genotyping products as well as for providing general information regarding allele frequency. If the user decides to obtain reference information at block 282, the user can be provided with this information at block 284. In addition, the user may be able to search for assays used for SNP genotyping at block 286. If the user decides to search for assays at block 286, the user conducts a search at block 288.
If the user decides to perform a keyword search at block 294 with respect to SNP genotyping, the user is able to search by selective fields for select terms. The specific fields which may be searched can be as follows:
LocusLink Gene Name
AB Assay ID
Celera SNP (hCV)
Celera gene (hCG)
Celera transcript (hCT)
Celera protein (hCP)
In addition, it is possible to filter the search by the specific SNP type which include:
acceptor splice site
donor splice site
putative utr (untranslated region) 3
putative utr 5
In addition, it is possible to search all these SNP types together.
In addition, the system also permits the use of a filter to exclude 10 kb flanking sequence. For a given gene, all the RefSeq sequence data associated with the gene in LocusLink are mapped on the genome. A gene may be defined as the furthest 5′ and furthest 3′ base of RefSeq sequence data associated with the gene. When searching on gene-related fields, the user may choose whether to include or exclude 10 Kb flanking sequence. Accordingly, when the system searches it can include up to 10 Kb of upstream sequence and downstream sequence in the query. This filter can be valid for the following fields:
LocusLink Gene Name
Celera Transcript (hCT)
Celera Protein (hCP)
Celera Gene (hCG)
In certain configurations, the system will ignore this filter if the user searches on fields not listed above. If searching for SNP assays by Celera Gene (hCG) ID, a user can select a search within a gene by setting a search filter at 0 Kb or within a gene region which includes 10 Kb of 5′ and 3′ flanking sequence. An exemplary window pane permitting the user to perform a keyword search with search filter is shown in
Alternatively, it can also be possible to perform an advance keyword search in which search terms can be placed in one or more various field for SNP genotyping as described above. An exemplary window pane allowing the user to perform an advanced keyword search is shown in
In addition, the system also allows users to select ranges of Caucasian and African-American minor allele frequency. The allele frequency indicates the number of occurrences of an allele seen in the total number of chromosomes sequenced at the SNP site. The allele frequency for stock assay SNP genotyping products may be obtained from 90 individual human genomic DNA samples, 45 African-Americans and 45 Caucasian from the Coriell Human Variation collection. The samples can be run in a validation laboratory in order to ensure that every SNP provided in the stock assay SNP genotyping product is polymorphic and that the allele frequency can be adequate for association studies in a variety of populations. The results obtained from such a validation step also allow inference of haplotype blocks and the analysis of the extent of linkage disequilibrium among these makers. A selection and validation criteria for a set of SNP Genotyping Assays is described in Francisco de la Vega, et al., “Selection of Single Nucleotide Polymorphisms for a Whole Genome Linkage Disequilibrium Mapping Set”.
When the user selects to search by location at a block 298, the user initially selects the mapping type and relevant identification information at block 300. In this regard, the user can select the assay by SNP, gene or marker location within a given range. Alternatively, the SNP assay may also be determined on the position of the chromosome. This can be done by initially selecting the available chromosome and then the position within the chromosome, which may be reported in units of megabases. Alternatively, the SNP assay may be determined by location using ABI PRISM® Linkage Mapping Sets v2.5. ABI PRISM® Linkage Mapping Sets v2.5 consist of 811 fluorescent-labeled PCR markers selected to amplify high informative two base-pair repeat microsatellite loci. These markers can be arranged in two sets to provide coverage of the human genome at 5 C.sup.M and 10 C.sup.M average resolution. The markers can be from the 1996 Genethon Human genetic map and were selected based on chromosomal location and heterozygosity. More information regarding ABI PRISM® Link Mapping Sets v2.5 can be obtained from Applied Biosystems.
After the mapping type and identification information has been entered, the user also has the opportunity to select flanking region display results. A flanking region of 10 Kb can be selected, or alternatively, the system can be configured so that the user can select 0 Kb, 100 Kb, 500 Kb, and 100 Mb. Finally, the user may be able to filter the results using Caucasian and African-American minor allele frequency as well as SNP type. An exemplary window pane allowing the user to search by location is shown at
Finally, the user can also decide whether to search for SNP genotyping assays using a batch ID search at decision block 302. In this regard, the user enters a valid ID type into the system at block 288. The valid ID types can be, for example, one or more of the following:
dbSNP reference cluster or assay ID
AB Assay ID
LocusLink gene symbol
RefSeq accession number
Celera Gene (hCG)
Alternatively, the user can upload a file on the user's computer from a previously exported results from a SNP genotyping search. In this case, if the file contains a list of identification numbers, each of the identification numbers can, in various embodiments, be separated by a tab, a carriage return, a line return, a comma or a space. If the user selects to use previously exported assays results, then a tab delimited file resulting from the stock assay export feature may be used. An exemplary window pane allowing the user to conduct a batch ID search is shown in at
The results from a SNP genotyping assay search can be provided to the user in a manner similar to that described above for gene expression assays. One non-limiting example of a window pane providing results to the user is shown in
If the user selects a given assay, information concerning this specific assay can be presented in a manner similar to that shown in
High Capacity Manufacturing
In various configurations, high capacity and high throughput equipment can be used for oligonucleotide synthesis and validation. Manufacturing activity can be process driven with well defined and validated procedures for every step in the manufacturing process.
DNA synthesizers are well know in the art. A DNA synthesizer may be used to manufacture oligonucleotides beginning with a primary residue which is the 3′-most nucleotide, anchored to a solid support. Each additional nucleotide can then be added in the desired order to assemble the nucleotide chain while proceeding in the 3′-to-5′ direction.
Phosphoramidite chemistry may be employed for the addition, although alternative chemistries such as the H-phosphonate method can be used (for review see Brown et al, “Modern machine-aided methods of oligonucleotide synthesis”, in Oligonucleotides and Analogues a Practical Approach. Ed. F. Eckstien, IRL Press, Oxford UK, 1995). Four steps are performed in the synthesis. The first base is attached to a solid support which can be typically controlled pore glass, via an ester linkage to the 3′-hydroxyl of the base. The 5′-trityl blocking group of the base can be then cleaved to initiate synthesis using brief treatment with an acid such as, for example, dichloroacetic acid or trichloroacetic acid in dichloromethane. The next monomer of the oligonucleotide being synthesized is then added in the form of a DNA phosphoramidite in tetrazole and coupled to the available 5′-hydroxyl group of the first base. The resulting phosphite linkage is then oxidized to phosphate by treatment with iodine in an aqueous solution containing THF and pyridine to complete the first cycle of oligonucleotide synthesis. This can be then repeated for each base being added.
The DNA synthesizer used in some configurations of the present invention can be capable of producing oligonucleotides in amounts of about 40 nmol, about 0.2 .mu.mol or about 1 .mu.mol. In various configurations, a DNA synthesizer can be used that can produce at least about 100, at least about 200 or more primer length oligos in 40 and 200 nmol amounts over a period of about 10 hours.
The DNA synthesizer used in some configurations of the present invention can also be capable of attaching appropriate fluorophores or quenchers to probes after synthesis.
For SNP assays based upon TaqMan® methods, probes and primers can be synthesized for performing TaqMan® assays. In certain embodiments, two TaqMan® MGB probes can be designed and manufactured to distinguish between two SNP alleles. Each TaqMan® MGB probes contains, in some configurations a reporter dye at the 5′ end of each probe. The reporter dyes can be any of a number of suitable dyes, such as, for example, a VIC™ dye or a b-FAM™ dye. Thus, for example a VIC™ dye can be linked to the 5′ end of a first probe specific for one allele of a SNP and a 6-FAM™ dye can be linked to the 5′ end of a second probe specific for the second allele for use in a given assay. An MGB, as described above, can also be included in each probe. This increases the melting temperature (T.sub.m) without increasing probe length, thereby permitting the design of shorter probes. The use of MGBs results in greater differences in T.sub.m values between matched and mismatched probes, which produces more accurate allelic discrimination.
In certain other configurations, probes and primers of gene expression assays can be synthesized. In some configurations a reporter dye can be attached at the 5′ end of each probe. The reported dye can be any suitable dye, for example, a dye such as a VIC™ dye or a 6-FAM™ dye. An MGB, as described above, can also included in the probe. Thus, for example, one FAM™ dye-labeled, TaqMan® MGB probe can be synthesized along with two target-specific primers for use in a given assay.
In certain aspects, a quencher can also be attached to the probes for both SNP and gene expression assays. The quencher, in various configurations, can be an NFQ attached to the 3′ end of each probe.
In various configurations of the present invention, the synthesized oligonucleotide can be subjected to purification methods which may include, for example, polyacrylamide gel electrophoresis (PAGE) for oligonucleotides of greater than 50 bases in length and high performance liquid chromatography (HPLC) for oligonucleotides of less than 50 bases in length. A typical anion-exchange HPLC profile of a 23-mer is shown in
The DNA synthesizer used in some configurations of the present invention can be coupled to a computer which allows conditions to be set for automatic performance the DNA synthesis.
In some configurations of the present invention, the DNA synthesizer used can be capable of synthesizing DNA oligonucleotides with rapid cycle times, low reagent consumption and reliability. One such high-capacity, high-throughput DNA synthesizer suitable for use is the commercially available ABI 3900 DNA Synthesizer (Applied Biosystems, Foster City, Calif.).
In various configurations, a large number of at least 10, at least 20, at least 50, at least 70 or more DNA synthesizers can be employed in the manufacturing facility. Multiple manufacturing facilities can also be used and the production of oligonucleotides in the individual facilities can be coordinated if desired. The multiple manufacturing facilities may be located in strategic geographic sites so as to efficiently supply a world-wide market.
Post-Manufacturing Validation and Quality Control
In various configurations of the present invention, selected quality checks can be performed by the supplier. Quality checks may include synthesis yield, analytical quality control (which may be performed, for example, using mass spectrometry), functional testing and validation testing. Validation testing can be performed on the manufactured assay prior to delivering to the consumer to verify that the assay meets the specified characteristics. If the assays do not meet the quality check or checks, they may be resynthesized before shipping or other appropriate corrective action taken before the assays are shipped to the requester. The testing may include confirming that a synthesized oligonucleotide sequence is correct by testing primers and/or probes individually by mass spectroscopy, and/or, for human SNP assays, functionally testing using human genomic DNA to confirm that amplification occurs and at least one allelic discrimination cluster (heterozygous or homozygous, compared to no template controls).
Synthesis Yield Testing:
In various configurations, each component that makes up an assay. i.e. probes and primers, can be tested for yield after synthesis. Such testing can be done as part of the purification process and any suitable method known in the art can be used including PAGE and HPLC. Ion exchange HPLC can be, in various configurations, used for oligonucleotides having a length of less than about 40 to about 50 bases. Such anion exchange HPLC can be performed as an integrated function of the ABI 3900 DNA Synthesizers (see
In various configurations, individual components of assays must meet a minimum yield specification. Such minimum yield specification may be, for example, at least about 60% (w/w), at least about 80% (w/w), at least about 90% (w/w) or at least about 95% (w/w) or greater expressed as the weight of the desired oligonucleotide to the total weight of the synthesis product multiplied by 100. The particular percent yield set as the minimum yield specification will depend upon the application, however, typically at least about 90% yield is desirable. Low yield synthesis reactions, i.e. reactions producing less than about 40%, less than about 80%, less than about 90% or less than about 95% can be rejected in some configurations of the present invention.
In certain aspects, the synthesis yield testing can be performed for each of the probes and primers of every assay.
Analytical Quality Control:
In various configurations, each of the probes and primers can be individually tested to ensure the accuracy of its sequence. Any method known in the art can be used to validate the sequence accuracy of the probes and primers. One such method used in some configurations of the present invention and which is adaptable to high-throughput manufacturing and validation is mass spectrometry. Mass spectrometry is an analytical tool that detects ions and measures their mass to charge ratio. Ionization techniques such as matrix assisted laser desorption-ionization and electrospray ionization allow the measurement of high molecular weight molecules such as DNA. The matrix assisted laser desorption ionization coupled with time of flight mass spectrometry (referred to as MALDI-TOF) allows high-throughput analysis of DNA molecules. One such mass spectrometer suitable for use in analytical quality control is the commercially available ABI Voyager-DE™ STR MALDI-TOF Mass Spectrometer (Applied Biosystems, Foster City, Calif.)
In various configurations, the DNA sample can be mixed with an organic matrix and co-crystallized on a sample plate. A fixed, pulsed laser beam then irradiates the sample plate. The matrix absorbs and transfers the laser energy to the DNA to produce an ionized gaseous phase. An electric field then accelerates the ionized DNA molecules according to their mass such that molecules of smaller mass are accelerated faster than molecules of larger mass. Thus, the mass of the DNA molecule can be determined.
The measured mass can be then compared to the calculated mass of the probe or primer. The probe or primer must be of the same mass as calculated or within acceptable deviation to pass specification. Acceptable deviations in various configurations of the present invention can be, for example, such that the actual mass of the DNA molecule may be not more than about 1%, not more than about 2%, not more than about 5%, not more than about 10% or not more than about 20% greater or lesser than the calculated mass.
In some configurations, this analytical quality control test can be performed for every assay.
In various configurations of the present invention, functional testing can be performed on the assays as well, however, different functional tests can be performed on the SNP assays and gene expressions assays in some configurations.
In various configurations, all human SNP assays can be tested on samples from a panel of at least 10 to 20 human genomic DNA samples. A sequence detection system capable of performing the assays of the present invention can be used. In some configurations, the sequence detection system can be capable of performing fluorogenic 5′ nuclease chemistry assays using TaqMan® probes. One suitable sequence detection system is the ABI Prism® 7900HT Sequence Detection System (Applied Biosystems Foster City, Calif.).
Reference human genomic samples can be from a mixed ethnic group or from a single ethnic group and samples can be obtained from human cell repositories such as the Coriell Cell Repositories (Coriell Institute for Medical Research, Camden, N.J.).
In some configurations, a universal master mix, including test probes and primers, can be added directly to plates of dry or fresh DNA samples using standard robotics. Plates can be sealed and cycled using a standard thermal cycler such as, for example, Applied Biosystems Dual 384-well GeneAmp® PCR System 9700 thermal cycler (Applied Biosystems, Foster City, Calif.). Following cycling, plates can be automatically read on the 7900HT Sequence Detector. The availability of thermal cyclers such as the 9700 with automated lid handling can increase throughput by enabling robotics integration for 24-hour unattended operation.
In a two-allele system, TaqMan® probes for each allele can be multiplexed in a single tube, each probe having a different 5′ fluorescent dye. End-point fluorescence can be measured by the 7900HT system and experimental results can be displayed on an allelic discrimination viewer. The discrimination viewer displays fluorescence values of one of the dyes which represents one allele against fluorescence values of the other dye.
Typically four clusters of points, each from a different sample, fall into separate quadrants of a rectangle (
Pseudo-SNPs can be a common problem that arises from misassemblies, paralogs, or repeat elements. Similar sequences from different regions in the genome may erroneously align due to matching at only a few bases. These differing bases may then incorrectly assumed to be SNPs. If a pseudo-SNP is genotyped, every sample will appear to be heterozygous since each sample contains both the pseudo-alleles (see
Another problem that can arise can be the unexpected clustering of dye intensities as shown in
Although, clustering can be normally in four quadrants as shown in
Determination of genotype can be done by a trained observer in some configurations or by an automated system in others (see for example, Mein et al., Genome Research 10:330-343, 2000).
In various configurations, an assay can be considered to meet specifications if it amplifies at least one cluster and it can be distinguishable from the No Template Control (NTC). Excess scattering of clusters such that genotype cannot be distinguished results in the assay not being considered to meet specifications. In some configurations, this test can be performed for both custom assay products and stock assay products.
Gene Expression Tests:
In various configurations, gene expressions assays can be tested against both a genomic DNA (gDNA) template and a no-template control (NTC).
In some configurations, gene expression assays can be performed in a two step RT-PCR reaction. In the reverse transcription (RT) step, cDNA can be reverse transcribed from total RNA samples using a reverse transcriptase. Commercially available RT kits can be used such as the High-Capacity cDNA Archive Kit (Applied Biosystems, Foster City, Calif.). The PCR step uses a DNA polymerase. The process involves preparing the master mix from the kit, preparing the cDNA archive reaction plate and performing the reverse transcription. The RT reaction can be performed in any suitable system such as, for example, the Applied Biosystems Dual 384-well GeneAmp® PCR System 9700 thermal cycler (Applied Biosystems, Foster City, Calif.) or the ABI PRISM™ 6700 Automated Nucleic Acid Workstation, (Applied Biosystems, Foster City, Calif.). Target amplification, using cDNA as the template, can be the second step in the gene expression assays in various configurations of the present invention. In this step, AmpliTaq Gold DNA polymerase from the TaqMan® Universal PCR Master Mix (Applied Biosystems, Foster City, Calif.) can be used. This amplifies target cDNA synthesized from the RNA sample, using sequence-specific primers and TaqMan® MGB probe from the Gene Expression Assay Mix (Applied Biosystems, Foster City, Calif.). The PCR step must be performed on an ABI PRIS™ Sequence Detection System such as, for example the 7900HT Sequence Detection System. Performing the PCR step for singleplex assays in 384-well format may involve configuring the sequence detector plate document, preparing the reaction plate and running the plate.
In various configurations, assays to multi-exon genes (denoted with an “_m” in the Assay ID) must show no amplification against gDNA, while assays to single-exon genes (denoted with an “_s” in the Assay ID) will amplify the target in gDNA.
In various configurations of the present invention, SNP assays and Gene Expression assays undergo validation testing.
In some configurations, for all human SNP assays, each target can be run against a large number of human genomic DNA samples to verify functionality, judge the “robustness” of the assay and validate an allele frequency. One such group of 90 human genomic samples has been obtained from both Caucasian and African American populations. Genomic DNA samples of 45 African Americans and 45 Caucasians can be obtained from the Coriell Human Variation Collection (Coriell Cell Repositories, Coriell Institute for Medical Research, Camden, N.J.).
In various configurations of the present invention, SNP assays can be performed as described above. This validation process provides allele frequency data and confirms assay performance. In some configurations, to pass validation, SNPs must have a minimum defined allele frequency to provide a meaningful assay. In various configurations, the minimum allele frequency can be at least about 8%, at least about 10%, at least about 12%, at least about 15%, at least about 18% or at least about 20% or more or at any desired allele frequency. This test verifies that the SNP can be a true SNP, that the allele frequency meets the minimum defined allele frequency and that the system performs in a manner suitable for a viable assay.
Thus, in various configurations, manufactured and validated products can exhibit low background signal, adequate signal generation, allele signal specificity and at least 2 allele clusters.
In some configurations, only assays that yield a minimum allele frequency and produce robust assay may be manufactured for sale.
Gene Expression Tests:
In various configurations of the present invention, for gene expression assays, each target can be run against one or more pools of human cDNA produced from RNA to verify functionality. In certain aspects, at least about 10 human cDNA samples comprise such pools.
In various configurations, functional testing of custom assays can be performed in accordance with the procedures described above. For example, a primary template useful in some configurations can be the Universal Human Reference RNA (Stratagene, La Jolla, Calif.); while useful secondary templates include Discovery Line™ pre-isolated human total RNA (Invitrogen, Carlsbad, Calif.) from brain, heart, kidney, liver, and lung, and a pool of the 5 tissues; and Raji-Control human Total RNA (Applied Biosystems, Foster City, Calif.).
As seen in the tables in this example, approximately 98% of the manufactured assays gave positive results in a functional use test in at least one tissue sample tested.
In some configurations, only assays that yield amplification on the human cDNA pools within the specifications, i.e. showing expression against sample tissue RNA references may be manufactured for sale.
Overall manufacturing and validation systems for some configurations of the present invention are illustrated for SNP and Gene Expression assays in
As shown in
As shown in
In various configurations of the present invention, customers can be informed of assays accepted for order, and of final shipment of assays passing quality control (QC) functional testing. Depending at least in part upon the capacity of the supplier's manufacturing and testing facilities, delivery of assays together with associated information and materials (the “assay kit” 308, a non-limiting example of which is illustrated in
In some configurations, the assay probes include a non-fluorescent dye that can be configured to reduce background fluorescence and increase quenching efficiency. Thus, such assays can be particularly suitable and provide a substantial benefit to consumers using PCR sequence detection systems such as the Applied Biosystems PRISM® 7900HT Sequence Detection System, enabling high-throughput SNP genotyping in which approximately 250,000 genotypes per day can be analyzed, each needing only a small amount of sample DNA. In some configurations, MGB technology can be utilized with non-fluorescent quenchers. Shorter MGB probes provided in these configurations provide more flexibility in assay design, yielding more robust assays as well as a larger number of assays for more targets. The non-fluorescent quencher eliminates background fluorescence, and improves sensitivity.
In various configurations, components of SNP assays (human or non-human) supplied by the supplier may include one or more of the following:
One TaqMan® MGB 6-FAM™ dye-labeled probe;
One TaqMan® MGB VIC™ dye-labeled probe; and/or
Two target-specific primers configured to distinguish between two alleles.
The two TaqMan® MGB probes can be configured to distinguish between two alleles. Each TaqMan® MGB probe contains, in some configurations:
a reporter dye at the 5′ end of each probe, wherein a VIC™ dye is linked to the 5′ end of the allele 1 probe and a 6-FAM™ dye is linked to the 5′ end of the allele 2 probe;
an MGB, which increases the melting temperature (T.sub.m) without increasing probe length, thereby permitting the design of shorter probes. The use of MGBs results in greater differences in T.sub.m values between matched and mismatched probes, which produces more accurate allelic discrimination; and
an NFQ at the 3′ end of the probe. Because the quencher does not fluoresce, various sequence detection systems, including those of Applied Biosystems, can measure reporter dye contributions more accurately.
During PCR, each TaqMan® MGB probe anneals specifically to a complementary sequence between the forward and reverse primer sites. When the probe is intact, the proximity of the reporter dye to the quencher dye results in suppression of the reporter fluorescence primarily by Forster-type energy suppression.
AmpliTaq Gold® DNA polymerase cleaves only probes that can be hybridized to the target. (AmpliTaq Gold® DNA Polymerase is a thermostable polymerase complexed with a non-thermostable polymerase inhibitor, for example, an antibody directed against the polymerase. The combination has its activity inhibited until it is heated.)
Cleavage separates a reporter dye from the quencher dye, which results in increased fluorescence by the reporter. The increase in fluorescence signal occurs if the target sequence is complementary to the probe and is amplified during PCR. Thus, the fluorescence signal generated by PCR amplification indicates which alleles are present in the sample.
A correlation exists between fluorescence signals and sequences present in a sample, in various configurations of the present invention. More particularly, in various configurations, a VIC dye fluorescence without a 6-FAM dye fluorescence indicates a homozygosity for allele 1. A 6-FAM dye fluorescence without a VIC dye fluorescence indicates a homozygosity for allele 2. Fluorescence of both dyes indicates an allele 1-allele 2 heterozygosity.
Also in various configurations, components of gene expression assays supplied by the supplier include one or more of the following:
One TaqMan MGB 6_FAM dye-labeled probe; and/or
Two target-specific primers.
In some configurations, custom assays combine two PCR primers and one FAM™ dye-labeled, TaqMan® MGB probe in a single-tube, ready-to-use, 20.times. mix (250 uL). Various configurations can be designed and optimized for two-step RT-PCR using TaqMan® Universal PCR Master Mix and complementary DNA (cDNA). An AB High Capacity cDNA Archive Kit (P/N 4322171) for converting RNA to cDNA, for example, can be used. Assays may also be tested for use on the ABI PRISM® 7900HT, 7700, and 7000 Sequence Detection Systems. In various configurations, products can be formulated at preselected universal concentration conditions (for example, final reaction concentrations of 900 nM primer and 250 nM probe) and configured to run using preselected universal thermal cycling parameters. As a result, in a variety of configurations, multiple assays can be run on a single plate, laboratory methods can easily be transferred to other researchers, and gene expression results can be directly compared to those of other researchers and other labs. In some configurations, assays can be configured for running in singleplex format with external endogenous controls run in separate wells on a plate.
Gene expression products ordered from stock may be used in RT-PCR protocols in configurations in which assays can be optimized for the two-step RT-PCR protocol. If, to use these products with RNA, RNA must be converted to cDNA, an AB High Capacity cDNA Archive Kit (P/N 4322171) or other suitable conversion product can be used for this conversion. A one-step protocol may be used in some configurations, such as by using the TaqMan® One-Step RT-PCR Master Mix Reagents Kit Protocol (P/N 4310299).
Stock assays for gene expression provided by some configurations of the present invention can be used for multiplexing. To use in single-plex reactions, users choose an appropriate endogenous control to be run in a separate well. A set of external, endogenous controls can be provided that have the same concentration and labeling (e.g., a TaqMan® MGB probe, labeled with the FAM™ dye) as the gene expression products. For multiplex reactions the endogenous control of choice can be run in separate wells (single-plex) as it does not require time-consuming validation experiments for the user to confirm that there is no PCR competition. However, if users choose to try multiplex experiments, the user can perform an experiment in which a multiplex versus singleplex assay can be performed to confirm that the PCR reactions and relative quantitation calculations can be unaffected by multiplexing.
Stock assays may be delivered with certain sequence information. For example, some sequence context information (forward primer location in the RefSeq sequence) and denote which exon-exon junction the assay covers so that users can get a sense of where the assay can be positioned in the transcript. More information can also be provided.
In some configurations, standardized assay designs can be provided for custom assays and/or stock assays, including either universal concentration or uniform thermal cycling parameters, or both, allowing results to be more easily compared with and/or transferred to other researchers and labs. Also, in some configurations, assays can be formulated in a single-tube 20.times. mix format that is convenient and easy to use, requiring no preparation or clean-up and providing faster time to results.
In some configurations, the manufactured assays can be shipped as homogenous assays in a single tube format. For example, in at least some configurations, a single tube, ready to use format can be provided that is suitable for immediate use on an ABI PRISM® Sequence Detection System platforms for one or more applications.
In various configurations, an E-datasheet, or Assay Information File, can be provided with an assay. The E-datasheet or Assay Information File can be, in some configurations, an electronic file or data electronically stored on a data storage medium 318. This file or data can contain, for example, information on one or more assays, information on one or more polynucleotide sequences, an alphanumeric sequence representing a polynucleotide sequence, or the like. Alternatively, or in addition, a print copy or a printout of the E-datasheet or E-datasheet information can be provided.
In some configurations, a printed copy of a data sheet can also be provided, containing information about each assay. This information may include, among other things, the position of each assay in the plate rack. Some configurations provide, either in place of, or in addition to the printed copy of the data sheet, a CD-ROM with one or more data files recorded thereon. The data files may include, for example, any or all of the following files: an electronic assay workbook, including data sheet(s) and shipped worksheet(s); an electronically readable and/or printable copy of instructions for assay protocol; an electronically readable and/or printable copy the order request as well as the submission request protocol; and/or an electronically readable copy of a product insert.
In some configurations of the present invention, a data sheet and/or an electronic assay workbook can be provided with custom assays. In some configurations, an electronic assay workbook can be included with each order of up to 92 assays. In various configurations, the workbook file name includes the number on the bar code on the plate for easy correlation. In some configurations, the workbook contains two worksheets, namely, the “data sheet” worksheet and the “shipped” worksheet. Also in some configurations, the workbook can be a spreadsheet file, such as a Microsoft® Excel® file, which may contain macros and/or be password protected. Cells of the workbook can be copied and pasted into a new worksheet and modified in the new worksheet. A printed copy of the datasheet from the electronic file may be included with a shipment of assays ordered by design. The datasheet includes a correlation of the 2-D barcodes on the tubes to the corresponding assay names and primer and probe specific information.
In some configurations, a datasheet included with an order includes all of the following information: an identification of the assay in each tube; assay names; which target site was used, if the requester submitted a sequence record that included more than one target site; locations of each tube in the assay rack; sequences of the primers and probes; and concentrations (.mu.M) of primers and probes. Other configurations do not necessarily include all of this information and may include either more or less information.
For example, in some configurations, data sheets have the following columns:
The shipped worksheet can be provided to enable a user of the assays to determine that the tubes can be in the same positions in the plate rack as when the assays were shipped. For example, in some configurations, the following columns appear in the shipped worksheet:
Usage of Assays:
The 5′ nuclease allelic discrimination method used in TaqMan® platforms utilized by some configurations of the present invention reduces human labor while in the laboratory. Unlike other methods that may require hybridization to chips or separate allele reactions, TaqMan® PCR preparation avoids hybridization to chips or separate allele reactions by adding a pre-made master mix containing buffer, deoxyribonucleotides, and DNA polymerase to the sample template and SNP specific oligonucleotides.
TaqMan® chemistry for SNP genotyping assays employs two allele specific probes for each SNP in addition to the common PCR primers. Each probe contains a 5′ fluorescent dye, such as, for example, VIC or FAM, to detect the presence of the specific allele, and a 3′ quencher to absorb fluorescence when the allele may not be present. The result can be much like any microarray or molecular beacon technology, one of the dyes will fluoresce for homozygous alleles and both dyes will fluoresce for heterozygotes.
In some configurations, ABI Prism® 7700 and ABI Prism® 7900HT Sequence Detection Systems available from Applied Biosystems may be used for endpoint analysis of 96 and 384 well plates, respectively, to record the fluorescence of the PCR product of each well. The latter may be bundled with an 84 plate robot for long term hands-free automation.
About 26 dual plate GeneAmp® PCR System 9700 thermal cyclers can be used in some configurations to keep one 7900HT supplied with an adequate number of PCR plates for continuous operation. However, different quantities and/or types of thermal cyclers may be used in some configurations, for example, if continuous operation and/or greater or lesser capacity is desired. Also in some configurations, barcoding can be used to record information hardware used, plates, assay probes and primers, technicians and times to evaluate performance.
In some configurations, assays themselves can be configured to be stored at between −15 and −25.degree. C., but the number of freeze-thaw cycles can be minimized by storing multiple aliquots of the working stocks. In addition, the fluorescent probes can be protected by avoiding direct exposure of the assays to light.
In some configurations, assays can be diluted and aliquoted for routine use to minimize freeze-thaw cycles and to protect assay mixes from exposure to light. To dilute assay mixes, 40.times. or 80.times. SNP assay mixes can be diluted to a 20.times. working stock with 1.times. TE. The 1.times. TE can be 10 mM Tris-HCl, 1 mM EDTA pH 8.0, and made using DNase-free, sterile-filtered water. Multiple aliquots of the assay mixes may then be stored at −15 to −25.degree. C.
A manual method may be used by a user of the assays to validate each tube position in the rack plate. In these configurations, the rack plate position and assay name on the tube label can be compared with the values in the well location and set ID columns in the data sheet worksheet. (This “validation” can be different from the validation of assays, in that validation of each tube position in a plate rack can be performed by the user, and merely confirms that the tubes are in positions matching the “shipped” worksheet. If the tubes are not in the correction position, they may be rearranged to match the worksheet. The operational quality of the assays contained within the tubes can be validated at the supplier's factory.)
In some configurations, an automated method can be used by a user of the assays to validate each tube position in the rack plate. This method includes scanning the plate and tubes using a 2-D bar code reader, and executing a plate validation spreadsheet macro (for example, a Microsoft® Excel® macro). In some configurations, to scan the plate and tubes, the plate rack can be placed on the 2-D bar code reader in a standard orientation. For example, tube position “A1” can be placed in the top left corner of the reader. The 1-D bar code on the plate rack can be then scanned. The bar code reader can be then configured, if necessary, to read positions in one column and to read bar codes in a column next to the positions column. Next, the plate rack can be scanned and the results can be saved to a directory that can be accessed from the computer containing the electronic file. In some configurations, the scanning results can be saved as a tab-delimited file.
To validate, the “shipped” worksheet can be opened in the spreadsheet, macros can be enabled, and the validation macro can be run. In some configurations utilizing software that can generate a text file, the validation can be performed by opening the electronic workbook, clicking a mouse on a “shipped” tab to view the worksheet containing the validation macro, clicking on the “validate” button to start the plate validation macro, and, when an “import plate scan” dialog box is presented, selecting “browse” to locate the file from the 2-D bar code scan. After “browse” is selected, the file that resulted from the 2-D bar code scan can be selected and imported into a new worksheet, which, in some configurations, can be called “received”. The macro then compares each bar code and its position in the plate rack with the corresponding bar code in the “shipped” worksheet (i.e., the value in the “Vial ID” column). The macro then enters the result in a “validation” column in the “shipped” worksheet. The results for each entry may either be “OK” (or any entry understood as indicating a match) or “ERROR” (or any other entry understood as indicating a non-match). Next, a “shipment validation” dialog box alerts that the validation is complete, and the user clicks “OK” to dismiss the dialog box.
Plate validation errors indicate that the tubes may not in the same position as they were shipped by the supplier to the requester. The user can resolve plate validation errors by rearranging the tubes to match the “shipped” worksheet; The user can then rescan the plate and execute the validation macro again to validate the plate.
Laboratory Information Management System
In various configurations of the present invention, oligo sets may be supplied in one tube, or in 96 well microtiter plates that can be already barcoded, as described above, to facilitate use of a laboratory information management system employed by the user of the oligo sets. In various configurations, supplied oligos can be scanned into the database, inventory can be tracked, and a nightly report can be generated to notify lab managers of sets ready to be run the following day.
In some configurations of the present invention, the samples supplied to the requester can be arrayed in 96 or 384 well plates and a map of the plate entered into the database. To conserve clinical DNA, various configurations of the present invention supply only SNPs that pass validation and meet the required population frequencies on the clinical samples.
In various configurations of the present invention, an assay can be prepared for a given run using the probe and primer set and a TaqMan® Universal PCR Master Mix. A robot, such as a Protedyne robot prepares daughter sample plate by adding the assay mixture to the plate wells. The plate can be thermal cycled using, for example, a GeneAmp® 9700. Each step in the assay performance can be logged in the LIMS to allow software to automatically trigger and create a sequence detection system (SDS) binary file that can be used by the 7900HT. This procedure allows laboratory staff to simply place the plate into a stacker of one of the 7900HTs and select a pre-created file in the robot program. In various configurations, an SDS file need not be manually created using SDS software.
In some configurations, the scanned data file from the 7900HT can be recognized by software that passes it to multicomponent analysis software. This analysis software creates a multicomponent file containing the dye intensities of each well and subsequently passes the file to an autocaller program. As discussed, in more detail below, the autocaller programidentifies the genotype clusters and assigns appropriate calls to the wells. In some configurations, the putative genotypes can be loaded into the database for either manual review or immediate release, depending on the confidence of the autocaller.
In various configurations, the 7900HT and multicomponent analysis software can be controlled by a combination of automated software and triggers which allow the anticipation and detection of the steps in the laboratory performance of assays, thereby allowing continuous scanning by the 7900HT without having to manually create, identify, locate, analyze, call genotypes, or export data files.
In some configurations of the present invention, a laboratory information management system can also be used in the post-manufacturing validation process. Thus, an automated computer system can be provided to support high throughput SNP genotyping that satisfies the increasing demand that disease association studies are placing on current genotyping facilities. This system provides target SNP selection, automated oligo design, in silico assay quality validation, laboratory management of samples, reagents and plates, automated allele calling, optional manual review of autocalls, regular status reports, and linkage disequilibrium analysis. In some configurations, it has been found practical to generate over 2.5 million genotypes from more than 10,000 SNPs, with a target capacity of at least 10,000 genotypes per machine per hour utilizing only limited human intervention and laboratory hardware.
In various configurations, information gathered throughout the genotyping process can be stored in a central database, which can be divided into project management and laboratory schemas.
The project schema facilitates management of abstract entities such as SNP, sample donor, or genotype. For example, projects can be created by indicating an intended customer and loading desired SNP information. The requester determines what SNP is ordered, scanned, considered validated, possibly discarded or re-designed, and delivered to the requester. In various configurations, reports can be generated regarding the current progress of a SNP, failure rate of samples, or allele frequencies per population.
The project management component permits fast data analysis, by allowing efficient phenotype relations to both donors and SNPs. In various configurations, the project schema also has the ability to store haplotypes constructed from specific SNP alleles after analysis. The schema may also track literature references for individual SNPs and donors.
In various configurations, a the laboratory component provides tracking details of the process taken by the actual physical aspects of the laboratory performance and this can be mirrored in the project management component. Samples can be received, barcoded, and placed into plates and freezers. Oligos can be received, diluted, assigned into sets, and also placed into freezers. Plates can be arrayed with particular samples and oligos for specific projects. Each well can be scanned (and, in some configurations, re-scanned many times) to provide high accuracy. However, in various configurations, only a ‘final’ genotype is copied to the project management component where it may eventually be delivered to the customer.
An advantage of having common but separated partitions of the project management and laboratory components is that the laboratory space provides a tracking environment in which experiments can be re-arrayed, rerun, and reviewed multiple times, whereas the project management component remains uncluttered with details as analysis requires a compact schema designed for speed and clarity. This integration of LIMS and data analysis provides for segregated storage to satisfy each schema's different requirements, while keeping the data in one repository for the ability to track an individual genotype's entire history.
The database schema also supports large scale resequencing laboratories by adding relatively few tables, thereby combining SNP discovery, validation, and genotyping into one central repository.
As various changes could be made in the above methods and compositions without departing from the scope of the inventions, it is intended that all matter contained in the above description be interpreted as illustrative and not in a limiting sense. Where examples are recited herein, such examples are intended to be non-limiting. Also as used herein, unless otherwise explicitly stated, the terms “a,” “an,” “the,” “said,” and “at least one” are not intended to be limited in number to “one,” but rather are intended to be read as encompassing “more than one” (i.e., a plurality) as well.
The description of the invention is merely exemplary in nature and, thus, variations that do not depart from the gist of the invention are intended to be within the scope of the invention. Such variations are not to be regarded as a departure from the spirit and scope of the invention.
All references cited herein are hereby incorporated by reference in their entireties—U.S. application, Attorney Docket No. 4797 (5010-022-12) entitled “Methods of Validating SNPs and Compiling Libraries of Assays”, inventors De La Vega, Francisco et al., filed Jan. 2, 2003 and U.S. application, Attorney Docket No. 4797 (5010-002-13) entitled “Single-tibe, Ready to Use Assay Kits, and Methods Using Same”, inventors De La Vega, Francisco et al., filed Jan. 2, 2003 are both hereby incorporated by reference in their entireties.