The Single Nucleotide Polymorphism Database ( dbSNP) is a free public archive for genetic variation within and across different species developed and hosted by the National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute (NHGRI). Although the name of the database implies a collection of one class of polymorphisms only (i.e., single nucleotide polymorphisms (SNPs)), it in fact contains a range of molecular variation: (1) SNPs, (2) short deletion and insertion polymorphisms (indels/DIPs), (3) microsatellite markers or short tandem repeats (STRs), (4) multinucleotide polymorphisms (MNPs), (5) heterozygous sequences, and (6) named variants. The dbSNP accepts apparently neutral polymorphisms, polymorphisms corresponding to known phenotypes, and regions of no variation. It was created in September 1998 to supplement GenBank, NCBI's collection of publicly available nucleic acid and protein sequences.
In 2017, NCBI stopped support for all non-human organisms in dbSNP. As of build 153 (released in August 2019), dbSNP had amassed nearly 2 billion submissions representing more than 675 million distinct variants for Homo sapiens.
Originally, dbSNP accepts submissions for any [[organism]] from a wide variety of sources including individual research laboratories, collaborative polymorphism discovery efforts, large scale genome sequencing centers, other SNP databases (e.g. the SNP consortium, [[HapMap]], etc.), and private businesses. On September 1, 2017, dbSNP stopped accepting non-human variant data submissions and two months later, its interactive websites and related NCBI services stopped presenting non-human variant data. Now dbSNP only accepts and presents human variant data.
Every submitted variation receives a submitted SNP ID number (“ss#”). This accession number is a stable and unique identifier for that submission. Unique submitted SNP records also receive a reference SNP ID number (“rs#”; "refSNP cluster"). However, more than one record of a variation will likely be submitted to dbSNP, especially for clinically relevant variations. To accommodate this, dbSNP routinely assembles identical submitted SNP records into a single reference SNP record, which is also a unique and stable identifier (see below).
To submit variations to dbSNP, one must first acquire a submitter handle, which identifies the laboratory responsible for the submission. Next, the author is required to complete a submission file containing the relevant information and data. Submitted records must contain the ten essential pieces of information listed in the following table. Other information required for submissions includes contact information, publication information (title, journal, authors, year), molecule type (genomic [[DNA]], [[cDNA]], [[mitochondrial]] DNA, [[chloroplast]] DNA), and organism.
| Sequence Context (Required) | An essential component of a submission to dbSNP is an unambiguous location for the variation being submitted. dbSNP now minimally requires that you submit variant location as an asserted position on RefSeq or INSDC sequences. |
| Alleles (Required) | Alleles define each variation class. dbSNP defines single nucleotide variants in its submission scheme as G, A, T, or C, and does not permit ambiguous IUPAC codes, such as N, in the allele definition of a variation. |
| Method (Required) | Each submitter defines the methods in their submission as either the techniques used to assay variation or the techniques used to estimate allele frequencies. dbSNP groups methods by method class to facilitate queries using general experimental technique as a query field. The submitter provides all other details of the techniques in a free-text description of the method. |
| Asserted Allele Origin (Required) | A submitter can provide a statement (assertion) with supporting experimental evidence that a variant has a particular allelic origin. Assertions for a single refSNP are summarized and given an attribute value of germline or unknown. |
| Population (Required) | Each submitter defines population samples either as the group used to initially identify variations or as the group used to identify population-specific measures of allele frequencies. These populations may be one and the same in some experimental designs. |
| Sample Size (Optional) | There are two sample-size fields in dbSNP. One field, SNPASSAY SAMPLE SIZE, reports the number of chromosomes in the sample used to initially ascertain or discover the variation. The other sample size field, SNPPOPUSE SAMPLE SIZE, reports the number of chromosomes used as the denominator in computing estimates of allele frequencies. |
| Population-specific Allele Frequencies (Optional) | Frequency data are submitted to dbSNP as allele counts or binned frequency intervals, depending on the precision of the experimental method used to make the measurement. dbSNP contains records of allele frequencies for specific population samples that are defined by each submitter and used in validating submitted variations. |
| Population-specific Genotype Frequencies (Optional) | Similar to alleles, genotypes have frequencies in populations that can be submitted to dbSNP, and are used in validating submitted variations. |
| Individual genotypes | dbSNP accepts individual genotypes from samples provided by donors that have consented to having their DNA sequence housed in a public database (e.g. HapMap or the 1000 Genomes project). |
| Validation Information (Optional) | Assays validated directly by the submitter through the VALIDATION section show the type of evidence used to confirm the variation. |
There are two exceptions to the above merging criteria. First, variation of different classes (e.g. a SNP and a DIP) are not merged. Secondly, clinically important refSNPs that have been cited in the literature are termed “precious”; a merger that would eliminate such a refSNP is never performed, since it could later cause confusion.
The dbSNP is also linked to many other NCBI resources including the nucleotide, protein, gene, taxonomy and structure databases, as well as PubMed, UniSTS, PubMed Central, OMIM, and UniGene.
Errors in the dbSNP can hamper candidate gene association studies and [[haplotype]]-based investigations. Errors may also increase false conclusions in association studies: increasing the number of SNPs that are tested by testing false SNPs requires more hypothesis tests. However, these false SNPs cannot actually be associated with traits, so the alpha level is decreased more than is necessary for a rigorous test if only the true SNPs were tested and the false negative rate will increase. Musemeci ''et al.'' (2010) suggested that authors of negative association studies inspect their previous studies for false SNPs (SNDs), which could be removed from analysis.
|
|