Abstract

Position weight matrix (PWM) is not only one of the most widely used bioinformatic methods, but also a key component in more advanced computational algorithms (e.g., Gibbs sampler) for characterizing and discovering motifs in nucleotide or amino acid sequences. However, few generally applicable statistical tests are available for evaluating the significance of site patterns, PWM, and PWM scores (PWMS) of putative motifs. Statistical significance tests of the PWM output, that is, site-specific frequencies, PWM itself, and PWMS, are in disparate sources and have never been collected in a single paper, with the consequence that many implementations of PWM do not include any significance test. Here I review PWM-based methods used in motif characterization and prediction (including a detailed illustration of the Gibbs sampler for de novo motif discovery), present statistical and probabilistic rationales behind statistical significance tests relevant to PWM, and illustrate their application with real data. The multiple comparison problem associated with the test of site-specific frequencies is best handled by false discovery rate methods. The test of PWM, due to the use of pseudocounts, is best done by resampling methods. The test of individual PWMS for each sequence segment should be based on the extreme value distribution.

1. Introduction

Most genetic switches are in the form of sequence motifs that interact with proteins [1]. Position weight matrix or PWM [26] is one of the key bioinformatic tools used extensively in characterizing and predicting motifs in nucleotide and amino acid sequences. The popularity of PWM has been further increased since its implementation as a component in PSI-BLAST [7], which is frequently used to generate PWM for motif characterization and prediction [811].

PWM has been applied extensively in studies of cis-regulatory elements in the genome such as translation initiation sites [12], transcription initiation sites [13], transcription factor binding sites [1422], yeast intron splicing sites [23], whole-genome identification of transcription units [24], and whole-genome screening of transcription regulatory elements [25, 26]. The PWM scores (PWMSs) for individual motifs have been found to be useful as a measure of the motif strength, for example, PWMS for individual splice sites has been used as a proxy of splicing efficiency in eukaryotes [27, 28].

PWM has been used not only as an independent tool for summarizing and predicting sequence motifs, but also as a key component in more advanced bioinformatic algorithms such as the variable-order Bayesian network [29], Gibbs sampler [3032, pp. 113–147] and related algorithms based on the Monte Carlo method [33], MEME [34], and support vector machines [3539]. While PWM has been used mainly to characterize and predict motifs in nucleotide sequences, recent studies have demonstrated its potential in characterizing and predicting functional protein motifs [4042], signal peptides [43], and protein-protein-binding sites [44]. In particular, the method was successful in predicting tyrosine sulfation sites [4548].

A PWM-based sequence analysis involves three types of output: the site-specific frequency distribution, the PWM itself, and PWMS for each input sequence (and optionally PWMS resulting from scanning new sequences with the trained PWM). Here I briefly review the PWM method in the context of motif discovery, followed by a detailed illustration of the Gibbs sampler of which PWM is a key component, and then propose statistical significance tests appropriate for each of the three types of PWM output.

2. PWM in the Context of Motif Discovery Methods

The simplest input for a PWM-based method consists of an aligned set of sequences and the specification of the background (prior) frequencies. The main output of PWM, other than the PWM itself, consists of the site-specific information content and the motif information content [6] as well as PWMS for individual motifs, together with the associated statistical tests.

We first illustrate the PWM method by applying it to the 246 donor splice sites of yeast introns each represented by 5 nucleotide sites on the exon side and 12 nucleotide sites on the intron side (Table 1). The four columns on the left of Table 1 headed by A, C, G, and U are the site-specific counts of nucleotides A, C, G, and U. When all site-specific counts are greater than zero, each element in the PWM, designated by (where , and 4 corresponding to A, C, G, and U, respectively, and is site index), is computed as where is the background frequency of nucleotide , and is the site-specific nucleotide frequency for nucleotide at site (e.g., in Table 1). Plotting these site-specific values graphically over sites yields the sequence logo [49, 50]. The PWM score (PWMS) for a particular motif is computed as where is the length of the motif which equal 17 for our example shown in Table 1.

Note that PWMS is the logarithm of a likelihood ratio, or log-odds. Given a 17 mer, say, = ACGGTACCACGTAAGTT, we have two hypotheses. The first hypothesis is that the 17 mer belongs to a motif, constrained to have specific nucleotides at specific sites (), and the second is that each site in the 17 mer is sampled from a nucleotide pool with no site-specific constraints (). The likelihoods of observing sequence , given the two different hypotheses, are specified, respectively, as

This leads to the following that is identical to (2):

2.1. Specification of the Background Frequencies

The background frequencies () have been specified in three different ways in previous publications. The first is simply to assume equal background frequencies [51, 52] in characterizing splice sites with PWM. This is equivalent to the classic sequence logo method for graphic display of site patterns [49] which does not take background frequencies into consideration. In sequences with biased nucleotide frequencies, equal values will generate a false site pattern when there is in fact no pattern. For example, the AT-biased background genome in the yeast implies that and will be greater than and on average even when the sequences contain no site-specific information. Similarly, the classic sequence logo will display A and T more prominently than C and G even when the sequences of interest contains no site-specific information.

The second approach to specify is to compute it from the input sequences. As such, in our example, can be computed from the four columns headed by A, C, G, and U on the left side of Table 1. This approach also has a problem. Suppose a certain motif is a poly-U sequence, and all input sequences are “UUUUUUUU”. This will generate background nucleotide frequencies with and . Note that the site-specific frequencies, given the input sequences all being “UUUUUUUU”, are and . So the resulting PWM would then suggest that the motif is not informative, which is contrary to our intuition, that is, a stretch of UUUUUUUU conserved across a set of aligned sequences is likely to be biologically informative.

The third approach is to specify according to the specific problem one wishes to solve. For example, when characterizing splice sites of introns in a particular species, one may use the nucleotide frequencies of all transcripts (including all exons and introns) annotated in the genome as the background frequencies [28]. Similarly, a study of site patterns of branchpoint sequences in introns could have values computed from all intron sequences. I suggest that only this third approach be used to avoid the impression that PWM could have an infinite number of null hypotheses (each associated with a different specification of ).

The computer program DAMBE [53, 54] offers different choices for specifying in computing PWM. Similarly, the new sequence logo method allows more appropriate specification of background (prior) frequencies [50]. The resulting PWM for the 246 donor splice sites of yeast introns, with background frequencies computed from all introns, is shown in Table 1.

2.2. Specification of Pseudocounts

When some values are zero, as is the case in our example, (1) is inapplicable because the logarithm of zero is undefined. Three approaches can be taken to avoid this problem [55, 56]. The first is to compute by which approaches with increasing (where is the site-specific count for nucleotide or amino acid at site ). values can then be computed by (1). This approach is poor when is small.

The second approach is to use explicit pseudocounts by defining where is the frequency of nucleotide , and is then

It is important to keep small (e.g., 0.0001) because the expected from random sequences is 0 in (1). A large will substantially increase above 0 with random sequences.

The two approaches above share one main disadvantage. Suppose we have 10 aligned motifs of 10 amino acids each. Position 3 is occupied by amino acids K (lysine) and R (arginine) and position 5 by amino acid E (glutamic acid). The two approaches above will specify pseudocounts for positions 3 and 5 in the same way, which is unreasonable for the following reason. If position 3 requires a positively charged amino acid, and position 5 a negatively charged amino acid, then amino acids K, R, and H (histidine) should be more likely found than other amino acids at position 3, and amino acid D (aspartic acid) should be more likely found than other amino acids at position 5. By using other aligned protein sequence data of roughly the same divergence we can derive frequency distributions for positions that require a positively charge or negatively charged amino acid and use these frequency distributions to produce pseudocounts [56]. In our case, the pseudocounts at positions 3 and 5 will be assigned quite differently because the frequency distribution for a position requiring a positively charged amino acid is typically quite different from that for a position requiring a negatively charged amino acid.

PWM and PWMS can potentially be used to measure codon usage bias. For example, given the frequency of nucleotide as , the background frequency of a codon, say AGC, can be specified as , and compared to the observed frequency of AGC. Such an approach would eliminate one major weakness of commonly used codon bias indices such as CAI [57, 58] and Nc [59, 60].

3. Gibbs Sampler with PWM as a Key Component

While PWM is a technique for characterizing a set of identified motifs, Gibbs sampler [61], named after the mathematical physicist, J. W. Gibbs, is for de novo motif discovery. For example, given a set of yeast intron sequences, what and where is the branchpoint site? All information we have is that each intron should have one branchpoint site, but what sequence signature does it have and where is it located along the intron sequence? This scenario (Figure 1) is where the Gibbs sampler will shine.

A similar scenario involves the discovery of regulatory equence motifs given a set of coexpressed genes (i.e., genes that increase or decrease their transcription level synchronously over time) by microarray [62, 63], SAGE [64, 65], or deep-sequencing [6668] experiments. If the coexpressed genes are also coregulated, then they may share a certain yet-unknown transcription factor binding site controlled by the same or similar transcription factor. Given that the binding site is often located upstream of the translation initiation codon, one may extracted the upstream sequences from these coexpressed genes and let the Gibbs sampler to find the candidate regulatory motifs. A recent study has shown that shared motifs may also present in the 5′ UTR of mRNA to modulate translation initiation [69].

Gibbs sampler is one of the Monte Carlo algorithms that rely on repeated random sampling to estimate desired parameters. Monte Carlo method was envisioned by the famous mathematician Stanislaw Ulam, following the successful assembly of the first electronic computer ENIAC in 1945, and further developed by physicists and mathematicians working on nuclear weapon projects in the Los Alamos National Laboratory in mid-1940s [70]. The term “Monte Carlo method” was coined by Nicholas Metropolis to designate this class of computational algorithms. While the general application of the method unsurprisingly followed the operation of ENIAC in 1945, the physicist Enrico Fermi is known to have independently developed and applied the method nearly 15 years earlier with mechanical calculators [70].

Gibbs sampler simplifies computation in parameter estimation when analytical solution is very difficult or impossible to obtain. In biology, it has been used in the identification of functional motifs in proteins [31, 71, 72], biological image processing [73], pairwise sequence alignment [74], and multiple sequence alignment [75, 76]. However, the most frequent biological application of Gibbs sampler remains in the identification of regulatory sequences of genes [30, 7784].

There are two slightly different applications of Gibbs sampler in motif prediction. The first assumes that each sequence contains exactly one motif [30] and the associated algorithm is called a site sampler. The second is more flexible and allows each sequence to have none or multiple motifs [71] and the algorithm is termed a motif sampler. We will illustrate the site sampler and then briefly discuss the motif sampler.

I numerically illustrate the Gibbs sampler algorithm for motif discovery. The main output of the Gibbs sampler is typically of three parts. The first is the shared motif in an aligned format (bottom panel in Figure 1). The second is a PWM summarizing the discovered motif, and the third contains the associated significance tests which will be reviewed in a later section. The derived PWM, just like any other PWM, can be used to scan sequences not in the input data to discover the presence of the motif present elsewhere.

3.1. Computational Details of the Gibbs Sampler

We will use the erythroid nucleotide sequences [85], listed in Figure 2, to illustrate the Gibbs sampler algorithm. Our main objective is to infer the location and sequence of the unknown motif shared among the sequences so that we can align the motifs as shown in the bottom panel of Figure 1. The aligned motifs will allow us to generate a PWM that characterizes the motif by site-specific nucleotide frequency distributions. The PWM can be used to scan for the presence of the identified motif in other sequences.

We need first to count all nucleotides, with their numbers designated as , , , and , respectively, in the sequences. The total number of nucleotides of all 29 sequences (Figure 2) is 1209, with , , , and equal to 325, 316, 267, and 301, respectively. These values are needed for specifying pseudocounts (which we encountered in the previous section on PWM).

Let be the number of input sequences designated as . Let be the length of , and be the length of the motif, which typically is of length 4–8. For our illustration, we will use . One typically would run the Gibbs sampler several times with different values if one knows little about the length of the motif. The PWM is of dimension for nucleotide sequences, and for amino acid sequences. Let be the unknown starting position of the motif in .

The main algorithm of Gibbs sampler is of two steps. The first is random initialization in which a random set of values is assigned and site-specific nucleotide frequencies are calculated. The second step is predictive updating until a local solution of values is obtained, together with site-specific nucleotide frequencies that can be made into a PWM. This is repeated multiple times and previously stored locally optimal solutions are replaced by better ones. Convergence is typically declared when two or more local solutions are identical. These steps are numerically illustrated in the following sections.

3.2. Initialization

The initiation step randomly assigns a value to , with the constraint that . So our first set of “motifs” is essentially a random set of sequences of length m and is not expected to have any pattern. For readers who are curious, the first set of 29 random values happen to be: 29, 31, 23, 28, 10, 2, 18, 32, 20, 15, 11, 25, 24, 30, 18, 15, 10, 23, 14, 15, 26, 36, 8, 6, 30, 19, 27, 26, and 14. The site-specific distribution of nucleotides from the 29 random motifs is shown in Table 2. There is hardly any site-specific pattern, as one would have expected.

The second column in Table 2 will be referred to as the C0 vector with , , , and equal to 278, 279, 230, and 248, respectively. The matrix, occupying the last six columns in Table 2, will be referred to as the matrix. The matrix is tabulated from the 29 random motifs whereas the C0 vector is tabulated from nucleotides outside of the motifs. Thus, the sum of the first, second, third, and fourth rows of Table 2 should be equal to , , , and , respectively. Also note that each of the six columns in the matrix should add up to 29.

3.3. Predictive Update

The predictive update consists of obtaining (= 29 in our example) random numbers ranging from 1 to , and use these numbers as an index to choose the sequences sequentially to update the site-specific distribution of nucleotides (the matrix) and the associated frequencies (the C0 vector). For example, the random numbers in my first run of the Gibbs sampler happen to be 11, 18, 26, 22, 2, 28, 12, 9, 7, 3, 17, 16, 1, 4, 21, 15, 14, 24, 19, 27, 29, 6, 10, 20, 13, 8, 23, 25, and 5, respectively. This means that will be used first, and last, for the first cycle of the predictive update. It is important to use a random series of numbers instead of choosing sequences according to the input order. The latter increases the likelihood of trapping Gibbs sampler within a local optimum.

Our first randomly chosen sequence happens to be and its randomly chosen motif starts at site 11, that is, , with the motif being AGTGTG. This initial motif will now be taken out of the matrix and put into the C0 vector. This motif has one A, zero C, three G’s, and two U’s. By adding these values to the C0 vector in Table 2, we obtain the C0 vector in Table 3. We also need to take this motif out of the matrix by subtracting the first A from the first value in the first column in the matrix in Table 2 (i.e., new = old ), the second G from the third value in the second column in the matrix in Table 2 (i.e., new = old ), and so on. This converts the matrix in Table 2 to the C matrix in Table 3.

At this point the matrix is made of the 28 randomly chosen motifs, one from each sequence (excluding ). You will notice that each of the six columns in the matrix has a sum of 28. The reason for taking the initial motif in out of the matrix and put it back into the C0 vector is that we are going to find a better motif in , and put it into the matrix so that the matrix will again be based on 29 motifs. How are we going to get a better motif? Recall that a position weight matrix (PWM) can be used to scan a sequence in a sliding window of length m to get position weight matrix scores (PWMSs) for each window. We will make a PWM out of the C0 vector and the matrix and use the resulting PWM to scan and get a new motif that has the highest PWMS.

One may wonder why such a practice would get us anywhere given the fact that the matrix is initially made of random motifs. The resulting PWM would exhibit no pattern, and the resulting PWMSs will therefore be uninformative. The key concept here is that when one takes a random walk over a terrain with multiple peaks, one sooner or later will encounter a peak, and climbing the peak will at least bring us to a local maximum. After reaching the top of one peak and recording the height, we will land ourselves at another randomly chosen location and start climbing local peaks again. This process continues until we reach the highest peak or after a fixed number of computer iterations without finding any higher peak.

Typically, the PWM is generated by using the C0 vector as background frequencies () and the matrix as site-specific frequencies . However, although most algorithmic illustration of the Gibbs sampler computes this way (e.g., [32, pp. 133–147]), computed from the C0 vector has serious problems when input sequences are almost as short as the motif. For example, if the true motif has many nucleotide A and few nucleotide U, then the C0 vector will also have many A and few U. Now a motif with a few nucleotide U will be taken as deviating substantially from the background and will tend to have a high PWMS, leading to a biased estimate of the true motif. Thus, when input sequences are short, one should specify the background frequencies instead of using C0 to compute . One may refer to the previous section on PWM for more information on background frequencies.

For pseudocounts, we may use . The resulting PWM is then used to scan which is 40 bases long, with 35 possible motif starting points (i.e., possible values along the sequence). The 35 PWMS values for these 35 possible motifs in (Table 4) are normalized to have a sum of 1 ( in Table 4). We now proceed to update the initial () by a new value based on result in Table 4. How should we choose the new value?

There are two strategies to choose the new value. The first is to randomly pick up an value according to the magnitude of (Table 4). You may visualize a dartboard with 35 slices with their respective areas being proportional to values. When you throw a dart at the dartboard, large slices will have a better chance of being hit than small slices. If the dart happens to land on the 7th slice, then the initial will be updated to , with the original motif AGTGTG replaced by the new motif CTCAAG.

The second strategy is simply to use the largest value for updating initial to the new value. As the motif starting at site 25 has the largest , we will set the new equal to 25 and replace the initial motif (= AGTGTG) by the new motif (= TCACAG). With this approach we do not need norm as we can choose based on the largest odds ratio in Table 4. This strategy is faster than the first, but did not seem to lose any sensitivity in motif discovery based on limited simulation studies. However, if one is concerned about the possibility of missing motifs, one should use the first strategy.

Regardless of how the new is chosen, the updating is the same. Suppose we have taken the second strategy and set the new equal to 25. The matrix in Table 2 is then revised by replacing the original motif (= AGTGTG) by the new motif (= TCACAG). This leads to an updated C0 vector and matrix (Table 5).

We repeat this process for the rest of the sequences to update the rest of values. After the last sequence has been updated, we have obtained a new set of values, a new set of 29 motifs, together with the PWM based on the associated C0 vector and C matrix. At this point we compute a weighted alignment score (i.e., a weighted PWMS) as follows: where is the motif width, and is the number of different symbols in the sequences (4 for nucleotide and 20 for amino acid sequences). is a measure of the quality of alignment of the motifs. The larger the value, the better.

The value, as defined in (8), has many different names. It has been called the Kullback-Leibler information or Kullback-Leibler divergence in information theory [8688], or large-deviation rate function in statistical estimation [89]. In bioinformatics, especially in motif characterization and prediction involving a PWM, it is most often referred to as the information content [6]. The fact that the Kullback-Leibler information is a special case of the so-called -divergence that measures the difference between two probability distributions and leads naturally to the use of the letter in (8).

The predictive updating is repeated again and again. Each time when we get a new set of values, a new set of motifs and the PWM based on the C0 vector and the matrix, we compute a new value. If the new value is greater than the previously stored value, then the new value, the new set values, and the new set of motifs will replace the previously stored ones. This continues until we reach a local maximum of or when the preset maximum number of local loops has been reached. The resulting value, the set of values, the new set of motifs and the associated PWM are stored as the locally optimal output. In the hill-climbing analogy, represents the height of a local peak.

The entire process is now repeated from the very beginning, that is, we again perform the initialization by choosing another random set of values, and go through the local iteration to obtain another locally optimal output. If the new locally optimal output is better than previously stored ones (i.e., the new value is larger than the previously stored one), the new output will replace the previously stored output. This process is repeated multiple times until convergence is reached, that is, when new values are consistently the same as the previously stored one, or until a fixed number of computation iteration has been reached without finding an value better than what has already been recorded. The final site-specific nucleotide distribution (Table 6) displays a much stronger pattern than the initial distribution (Table 2) from 29 randomly chosen motifs.

The final aligned motifs (Figure  7-2 in [32]) share in general a consensus of (C/T)TATC(A/T). Its reverse complement (A/T)GATA(A/G) is known to be the binding site of GATA-binding transcription factors [9095]. This discovery of the motif suggests that this set of sequences may indeed be coregulated by the same type of GATA-binding transcription factors. Such findings are crucial in transcriptomic and proteomic studies aiming to understand gene regulation networks. Algorithms such as Gibbs sampler help us understand interactions among genes and gene products.

It might be relevant here to summarize essential biology about the GATA box and GATA-binding transcription factors. A living cell is a system with many genetic switches that can be turned on or off in response to intracellular and extracellular environment. It is these switches that distinguish a normal living cell from a cancer cell or a dead cell. The GATA motif (or GATA box) is one of such switches and it is switched on or off by specific transcription factors (which are proteins that bind to the motif and turn on or off the transcription of the gene containing such motifs). One of the better known GATA-binding transcription factors is GATA-1 which binds to the GATA motif found in cis-elements of the vast majority of erythroid-expressed genes of all vertebrate species examined [96, 97]. The core promoter of the rat platelet factor 4 (PF4) gene contains such a GATA motif and the binding of such GATA motif by GATA-binding proteins such as GATA-1 suppresses the transcription of the PF4 gene [91]. It is now known that GATA regulatory motifs and the GATA-binding transcription factors are present in a variety of organisms ranging from cellular slime mold to vertebrates, including plants, fungi, nematodes, insects, and echinoderms [98], suggesting that the function of the genetic switch is far beyond erythropoiesis. In human, the GATA motif and the GATA-binding proteins are implicated in several diseases [99]. The sequence divergence of GATA motifs and their binding proteins should shed light on the coevolution of the components of genetic switches.

One may have noted that some sequences have a strong (C/T)TATC(A/T) motif, whereas others (e.g., the second, the fourth and the fifth sequences) have only weak and highly doubtful signals. Computer programs implementing Gibbs sampler typically would output a quantitative measure of the strength of the signal, and PWMS is the most often used index for this purpose (Table 7). Recall that PWMS is the log-odds, but one may use the odds ratio directly as a measure of relative motif strength. Also recall that an odds ratio is the ratio of two probabilities associated with two hypotheses. Define as the hypothesis that the 6-mer is a motif with its site-specific constraints, and as the hypothesis that the 6-mer is not a motif and has its probabilities specified only by the four overall nucleotide frequencies. The odds ratio is the ratio of the probability that is true over the probability that is true. One generally should take a cut-off value of 20, that is, is 20 times more likely than .

One should note that Gibbs sampler, being started from a random set of values, may not necessarily converge to the same motif. This is both an advantage and a disadvantage of the algorithm. The advantage is that repeated running of the algorithm will allow us to identify other types of hidden motifs (i.e., other than the reverse complement of the GATA motif) in the sequences. The disadvantage is that users not familiar with the algorithm often get confused when the same input generates quite different results. For example, another set of putative motifs, in the form of RGVAGR (where R is A or G and V is “not T”), has been found to be shared among the sequences [32, p. 146].

It is possible that the input sequences may contain two or more different biologically significant motifs. If one motif is much stronger (more over-represented among the input sequences) than other motifs, and if the search by the Gibbs sampler algorithm outlined before is exhaustive, then we will always end up with the strongest motif and miss all other biologically interesting motifs. However, one could run Gibbs sampler by specifically exclude the strongest motif already identified so that weaker motifs can then be identified.

3.4. Motif Sampler

The Gibbs sampler has two versions. The one that we have just illustrated is called site sampler. It assumes that each sequence contains exactly one motif [30]. The other version is more flexible and allows each sequence to have none or multiple motifs [71] and the algorithm is termed motif sampler. The GATA-binding transcription factors comprise a protein family whose members contain either one or two highly conserved zinc finger DNA-binding domains [98] and it is consequently likely that a sequence may contain more than one GATA box. For example, the erythroid Kruppel-like factor (EKLF, which is a zinc finger transcription factor required for -globin gene expression) has in its 5′-region two GATA motifs flanking an E box motif characterized by CANNTG [100]. This calls for an algorithm that can identify multiple motifs in a single sequence.

The site sampler can be extended to motif sampler by post-processing. The PWM generated from the site sampler can be used to re-scan the sequences for motifs and compute the associated PWMS or odds ratio for all 6-mers in each sequence. All what we need is to have a cut-off score to keep those motifs with a PWMS or odds ratio greater than the cut-off score.

4. Statistical Significance Tests

The PMW, be it from alignment of known motifs or from running the Gibbs sampler, need to be assessed for its statistical significance. One continuous problem with PWM is the lack of generally applicable and accurate significance tests, either for individual sites of the motif, on PWM or on PWMS. There are two reasons why accurate significance tests are desirable. First, after characterizing a motif with PWM, one naturally wants to know whether the characterized PWM is significant, which sites contribute to the significance and which sequence has a PWMS that is significantly greater than random expectation. Second, after finding a significant PWM, one typically would want to use the PWM to scan other sequences to identify new motifs, and one needs a good significance test to show the reliability of the identified motif. This would reduce the number of putative sequence motifs going through experimental verification which is typically tedious and expensive [101, 102].

In short, three separates significance tests are required: one for individual sites, one for PWM per se and one for PWMS. These tests are detailed in the following sections.

4.1. Statistical Significance Tests for Individual Sites

The statistical significance of individual sites can be done by -tests with type I error rate controlled for by the false discovery rate [103, 104]. Take the data in Table 1 for example. The background frequencies are , , , and , which allow us to obtain expected counts of A, C, G, and T. With 17 -tests (Table 1), we face the problem of multiple comparisons and need to control for the familywise error rate [105] which is synonymous to experimentwise error rate.

Designate the error rate by , then the exact critical for rejection in individual tests is where is the number of tests and is equal to 17 in our case. If we set , then . The Bonferroni criterion is based on the approximation that which leads to . The second order Bonferroni , when relevant assumptions are met [105], is based on which leads to . In practice, these different values make little difference. In our case, all three values lead to the conclusion that the frequency distribution at sites 1, 2, 13, and 16 do not deviate significantly from the background frequencies.

The statistical protocol for controlling for the familywise error rate has been considered too conservative, and the protocol for controlling for the false discovery rate (FDR) has consequently been proposed recently [103, 104]. The classical FDR approach [103], now commonly referred to as the Benjamini-Hochberg procedure or simply the BH procedure, sorts values in descending order and computes for the th value (where the subscript BH stands for the BH procedure) as where is FDR (e.g., 0.05), is the rank of the value in the sorted array of values, and is the number of tests (i.e., the number of values). If is the largest satisfying the condition of , then we reject hypotheses from to . In our case, all the sites are statistically significant based on (Table 8).

The FDR procedure above assumes that the test statistics are independent or positively dependent (in the extreme case of perfect positive dependence, all tests are the same and there is really only just one test with no multiple comparison problem). A more conservative FDR procedure has been developed that relaxes the assumption [104]. This method, now commonly referred to as the Benjamini-Yekutieli or simply the BY procedure, computes for the th hypothesis as

With in our case, . Based on , the -tests pertaining to sites 1 and 2 are not statistically significant (Table 8). The BY procedure was found to be too conservative and several alternatives have been proposed [106]. For large , converges to (Euler’s constant equal approximately to 0.57721566). Thus, for , is close to 10. So is nearly 10 times smaller than . Both FDR procedures above have been used in significance tests concerning yeast splicing sites [23].

4.2. Evaluating Statistical Significance of PWM When Pseudocounts Are Used

Whether a PWM represents a motif with site-specific constraints can be tested by using the statistic [6] specified in (8). However, the distribution of is altered by pseudocounts as specified in (5) and (7). For example, the expectation of is no longer zero with pseudocounts when there is no site-specific pattern.

A more straightforward method for evaluating the significance of PWM is by resampling. With the tetranomial distribution defined by , where is the nucleotide frequency of nucleotide , we can obtain a new set of sequences (246 sequences of 17 nt each) and compute . This is repeated for, say, 5000 times to obtain 5000 values. The 95th or 99th percentile of the values can be taken as critical values at 0.05 and 0.01 significance levels, respectively. An observed for the PWM is significant if it is greater than the critical . Based on this criterion, the PWM from the 246 donor splice sites is highly significant. The same resampling technique can also be used to evaluate the significance of the site-specific patterns in the previous section or the significance of PWMS in the next section.

4.3. Statistical Significance of PWMS

One of the purposes of constructing a PWM is to facilitate the computation of PWMSs. For example, the PWMS for sequence UAAAGGUAUGUUUAAUU, given the PWM in Table 1 (the four columns headed by A, C, G, and U on the right side), is simply

Thus, we can use the PWM to predict a new donor splice site by scanning a nucleotide sequence with a window of 17 nucleotide sites and computing the PWMS. The larger the PWMS, the more likely the 17-mer is a donor splice site. However, we need to address the question of how large is large in such in silico predictions.

PWMS from random sequences follows approximately the normal distribution (Figure 3), with mean 0 (or slightly greater than 0 when pseudocounts are used with a small ). The distribution in Figure 3 has a mean equal to 0.068884 and a standard deviation equal to 0.314714254.

Suppose we are to use our PWM to scan a target sequence of 1000 nt for a possible donor splice site. There are 984 ) different 17 mers along the sequence , resulting in 984 PWMS values. If the maximum PWMS is 1, how statistically significant is it?

If the length of the target sequence were only 17 nt instead of 1000 nt, then the answer is easy. The upper 99% confidence limit for a normal distribution with mean equal to 0.0689 and standard deviation equal to 0.3147 is 0.8808 (), which implies that a PWMS of 1 is significant at the 0.01 level. However, because our target sequence is 1000 nt, with the maximum of PWMS equal to 1 out of a total of 984 PWMS values, we need to go a long way to evaluate the significance of this maximum PWMS value.

Suppose we perform many sampling experiments from the same normal distribution as in Figure 3: In each experiment, we sample times to obtain . The maximum in each experiment is . This is equivalent to use PWM to scan a sequence to obtain PWMS1, PWMS2,, PWMSN, with the maximum PWMS designated as . What is the distribution of , designated as ? Note that is an extreme value of values, so it is natural to call an extreme value distribution (EVD).

Extreme value distribution or EVD, also referred to as the Gumbel distribution in honour of the pioneer of the statistics of extremes [107], is used in BLAST [108, 109] and new versions of FASTA [110] to attach statistical significance to a match score between two sequences. It is also used to perform significance tests involving PWM [5, 6, 55]. Here I will outline the mathematical framework of EVD pertaining to PWMS.

The probability of getting an value smaller than is

Note that can be either or , with possibilities.    values are smaller than in each experiment. This leads us to which is plotted for , , and (Figure 4). Compared to the distribution of in Figure 3, the distribution of () has been shifted substantially to the right and peaks at .

Now we can answer the question of whether our observed is statistically significant. The probability of observing an value equal to 1 or greater is which is approximately 0.7986, that is, it is not statistically significant.

A much simpler, but likely less accurate, method based on only without deriving (16)–(18), is to use the Bonferroni criterion in (10). With , which requires a PWMS value equal to 1.292076814 to be marginally significant, given , and . As our observed maximum PWMS is , it is not significant at the 0.05 significance level.

In summary, a PWM-based sequence analysis involves three types of output: the site-specific deviation from the background frequencies, the position weight matrix itself and the position weight matrix score for each input sequence. The significance of the first can be evaluated with -tests using the false discovery rate as the criterion for rejection of the null hypothesis, the second by the resampling method, and the third by statistics based extreme value distribution. These tests have been implemented in the most recent versions of DAMBE [53, 54].

Acknowledgments

This study is supported by the Discovery, and Strategic Research Grants of Natural Science and Engineering Research Council of Canada. The author thanks S. Aris-Brosou, S. Findlay, M. Ragonnet, and A. van Weringh for comments. In particular, S. Findlay has corrected several errors and helped clarify a number of ambiguities.