In a previous post we looked at the positions of the trinucleotide ATG in the genome of the *Mycoplasma genitalium* bacterium.

In that genome there are 9,020 occurrences of ATG, and the locations of these begins and ends 214, 263, 355, 452, 467, 547, 568, 686, 734, 822, 831, 850, 930, 1023, … , 579349, 579358, 579437, 579508, 579579, 579717, 579804, 579846, 579889, 579892, 579927, 579961, 580026, 580042.

A question we can ask is: **are these locations of the ATG trinucleotides uniformly distributed along the genome?**

The ATG trinucleotide is both a start codon and also codes for methionine. As *catalysts-cradle* at reddit/r/bioinformatics remarks:

“There is a difference between looking for ATG trinucleotide sequences and looking for start codons.

- For a DNA sequence to be a codon, it must encode protein. Not all DNA sequences in an organism encode protein (for example, only 1.5% of the human genome encodes protein).
- Not all instances of ATG inside of a protein coding region will be ATG codons. The ATG must be in frame for it to be an ATG codon. For example, the short peptide-coding sequence: ATG AAA TGC CCC TAA (read as Met Lys Cys Pro Stop) has only one ATG codon but two instances of ATG trinucleotide sequences.
- Not all ATG codons are start codons. Only the first ATG codon in a protein coding sequence is the start codon. Subsequent instances of ATG encode the amino acid methionine but do not act as start codons.
Because M. genitalium encodes only 485 proteins, there are 485 start codons in the M. genitalium genome. Therefore, only ~5% of the instances of ATG are bona fide start codons.”

To address the question of the uniform distribution, or otherwise, of ATG we can use one or more of several statistical tests for uniform distribution of a sequence of integers. Probably the simplest of these to understand is the -square (pronounced “ky-square”) goodness of fit test.

**However, in the spirit of exploratory analysis, before we do any fancy statistical tests, let’s just take a look at how the number of ATG trinucleotides varies by genome region.**

The idea is very simple.

We first divide the possible locations into a bunch of non-overlapping intervals and count how many ATG trinucleotides occur in each interval.

The length of the *Mycoplasma genitalium* genome is 580,076 nucleotides. The starting location of an ATG trinucleotide can therefore be potentially anywhere from 1 through 580,076-2 = 580,074.

The number 580,074 is exactly divisible by 47: 580,074 = 47 12,342, so we can begin by dividing the potential locations of the ATG trinucleotide into one of 47 bins, each of length 12,342.

These bins are long enough so that we expect to see a few of the 9,020 ATG trinucleotides in each of them. In fact, if the ATG trinucleotides *are* uniformly distributed across the genome we expect to see about 9,020/47 192 of them in each of the 47 bins.

How many ATG trinucleotides do we *actually* see in each of the 47 bins?

Following the procedure described in this post we get R to return the number of ATG trinucleotides in each of the blocks from locations 12342 (n-1) + 1 through 12342 n, for .

We then plot the number of ATG trinucleotides in locations 12342 (n-1) + 1 through 12342 n, versus n, to get a visual representation of the variation of the number of ATG trinucleotides in the 47 books of length 12342.

Here’s R code that will do all this from the beginning, after the *Mycoplasma genitalium* genome is downloaded as a .fasta file from the National Center for Biotechnology Information (NCBI) database (see this post for details):

- With RStudio open, load the
**sequinr**and**stringi**packages:

> install.packages(c(“seqinr”,”stringi”))

> library(c(“seqinr”,”stringi”)

- Set your working directory to where you saved the downloaded sequence.fasta file:

> setwd(“/Users/Me/Documents”)

- Now use the
**sequinr**package to read in, and name, the sequence.fasta file:

> Mgenitalium <- read.fasta(file = “sequence.fasta”)

> genome<- Mgenitalium[[1]]

- Now define a function that counts the number of ATG codons in locations 12342 (n-1) + 1 through 12342 n, as a function of n:

> ATGcount_func<-function(n){a=12342*(n-1)+1; b=12342*n; stri_count(paste(genome[a:b],collapse=””), fixed=”atg”)}

- The code “paste(genome[a:b],collapse=””)” transforms the separate nucleotide symbols in positions a through b into a string of length 12342. My thanks go to “Deto” at Reddit/r/bioinformatics for spotting an earlier error in this code.
- Now create a table of values of this function:

>ATGcounts<-lapply(1:47, ATGcount_func)

and then plot this as a function of n, for n from 1 through 47 (using a “b” to plot both points and lines) and then add a red horizontal line at the 192 level to show the expected uniform value:

>plot(1:47,ATGcounts,”b”)

>abline(h=192, col=”red”)

This plot shows clearly that up to about the middle of the genome the the numbers of ATG trinucleotides in the location blocks of length 12342 is considerably greater than the expected uniform value of 192. From about the middle of the genome on, the numbers of ATG codons in the location blocks of length 12342 are generally below the expected value of 192 for a uniform distribution.

Because the distribution of ATG trinucleotides across the *Mycoplasma genitalium* genome is strikingly not uniform we did not need a fancy statistical test to tell us this: a simple exploration and visualization was enough.

Always explore your data before doing fancy tests. Get a feel for what’s going on.

## Leave a Reply