Making Sense of Biology

Nothing in Biology Makes Sense Except in the Light of Evolution, Theodosius Dobzhansky (1973). The American Biology Teacher, 35(3), 125-129.

“See the lilies of the field…their genomes are 36 times larger than ours…”

Those with a naïve understanding of evolution would have a difficult time understanding how a flowering plant could contain 36 times more DNA than humans. After all, DNA is the code of life.  DNA codes for proteins and thus one might think that more DNA would mean greater biochemical complexity.  Indeed, those who think of evolution as inexorably leading from simplicity to complexity would have a hard time fathoming how this could be true.  For example, it is true that prokaryotes have much less DNA than Eukaryotes.  The bacterium Escherichia coli contains 4.6 x 106 base pairs (bp); whereas the fungus Nuerospora crassa has 3.99 x 107 bp and animal Homo sapiens sapiens (humans) contain 5.93 x 109 bp. However there is a greater than 200,000-fold genome size diversity in Eukaryotes and this variation has no relationship to organismal complexity1

A decade ago this disparity was called the c-value paradox2.  The c-value refers to the haploid genome size of an organism.  If the mean genome size of groups were plotted, it would occur in this order: bacteria (~106 bp); algae and fungi (between 107 to 108 bp); worms (108 bp); insects (108 – 109 bp); echinoderms (109 – 1010 bp); fish (109 – 1010 bp); reptiles (109 – 1010 bp); birds (109 – 1010 bp); mammals (109-1010 bp); amphibians (109 – 1011 bp); flowering plants (109-1010 bp.)  There is no clear or easy way to make sense of these numbers.  For example, this order doesn’t correlate with organismal complexity or even age in the fossil record3.  The genome sizes within more narrow groups defy easy explanation as well, for example the genus Pinus (pines) has a range from 1.8 x 1010 to 4.0 x 1010, which is generally larger than flowering plants4.  The flowering plant Arabidopsis thaliana has a tiny genome in comparison, only 1.57 x 108.

What is really interesting however is what percentage of the DNA of Eukaryotes actually codes for proteins (or has a known function relating to the coding of proteins.) For example, in humans only 3% of the genome codes for proteins, and maybe another 10% have an identified role in regulating gene expression5.  In this sense, humans are typical and this means that at least 85% of eukaryotic DNA does not directly influence protein coding.  So what are the characteristics of this non-coding DNA?  How did it get there?  How is this non-coding DNA maintained through evolutionary time?  Is it really junk-DNA, as described by many early researchers?

The recent sequencing of a large number of eukaryotic genomes has provides a more accurate understanding of their DNA content and type. These are shown below:


Type Description Percent
Highly Repetitive Sequences (Non-coding) Satellites (5 – 200 bp), microsatellites (1 – 4 bp), minisatellites (5 – 50bp), macrosatellites (> 1kb) 5 – 50%
Moderately Repetitive Sequences Mobile genetic elements (retroelements; e.g. LINES and SINES and DNA transposons).  Gene Families (coding and non-coding parts, pseudogenes); e.g. histones. 5 – 50%
Unique Sequences Appear only once in genome; many structural genes here. 15 – 98%

Explaining the existence of structural genes is simple.  Clearly, all organisms require proteins to carry out the biochemical properties of life.  Some of these proteins must be produced in large amounts and so the existence of repetitive structural genes is also easy to explain.  How then do we explain the rest? This problem has been recently called the c-value enigma.

There are five main theories addressing this problem; junk-DNA, selfish-DNA, nucleoskeletal-DNA, nucleotypic-DNA, genome protection6.  The first two theories can be lumped together as mutation pressure theories and the last three are optimal DNA content theories.  The junk-DNA hypothesis says that the accumulation of genetic material in various organisms occurs by random process (genetic drift.)  If this hypothesis were true is would predict that there are no grand-level correlations between DNA content and organismal complexity. However, there do seem to be correlations between cell volume and genome size.  Junk-DNA would address this as a result, not a cause.  In other words, since there is more DNA, the cells of that particular organism must get larger.  It fails to adequately explain however the fact that there are some correlations between DNA content and life-history features within lineages, it assumes that cells cannot delete excess DNA, and there is no evidence of a steady accumulation of DNA content through evolutionary tine (which would be required by a drift mechanism.)

Originally, it was thought that the c-value paradox could easily be explained by selfish-DNA.  Selfish genetic elements can copy themselves and in theory have no impact on the host organism’s fitness so long as they replicate in non-coding regions of the genome.  Clearly, replication in the coding regions would result in massive mutation with major impacts on fitness.  Several human diseases result from transposable genetic elements replicating in a coding region.  For example, the alu elements are short interspersed elements (SINES) that have DNA recognition sequences that react to the restriction endonuclease AluI.  They are 200 – 300 base pairs long and can be present in the genome up to 900,000 times.  A common alu insertion occurs at the angiotensin converting enzyme (ACE) locus results in a polymorphism that affects the activity of this enzyme. The D-allele is characterized by an absence of these alu insertions, and thus has higher enzymatic activity than the I-allele which has the alu insertion. The frequency of the D and I alleles are similar in persons of African descent found in Nigeria, Jamaica, and the United States.  However individuals with the D-allele are more likely to develop hypertension only in the Western hemisphere (indicating gene x environment interaction.) Another 306bp Alu insertion polymorphism occurs in the progesterone receptor gene (PROGINS on chromosome 11) is associated with an increased risk of breast cancer. The insertion was found at a frequency of 5%, 10%, and 14.6% in East Indian women with endometriosis, uterine fibroids, and breast cancer.  The control group (women w/o these diseases) had a frequency of only 5.5% Thus, only the breast cancer group showed a statistically significant difference7.  The frequency of this insertion polymorphism differs widely in human populations:

European American, 0.208, N = 72

African American, 0.021, N = 71

Hispanic American, 0.164,  N = 76

Mvskoke (Creek), 0.041, N = 37

Pakistani, 0.09, N = 55

East Indian, 0.055, N = 490

From this we would include that all factors being equal, PROGINS associated breast-cancer should be 10 times more frequent in European American women as African American women. The fact that transposable genetic elements do cause disease indicates there should be serious selection for means to regulate their insertion sites away from exons.

Overall transposable genetic elements make up approximately 51.3% of the human genome (SINES 16.1%, LINES 22.3%, LTR retrotransposons 9.3%, DNA transposons 3.6%.) This figure varies widely across phylogenetic groups (F. assyrica, lilies 95-99%, Zea mays, corn 60%, A. thaliana, cruficer 14%, T. negroviridis, fish 0.14%, T. rubripes, fish 2%, R. esculenta, frog 77%, X. laevis, frog 37%, D. melanogaster, fruit fly 15-22%, A. gambiae mosquito 16%.)  Again the selfish-DNA hypothesis suffers from not explaining the existing correlations between DNA content and life-history features, assumes that cells cannot delete excess DNA or regulate transposition, and would require a steady accumulation of DNA content across evolutionary time.  For example, how does it explain the disparities in TGE content between fish, amphibians, and mammals (all of whom share a common ancestor?) 

The nucleoskeletal hypothesis suggests that the increased amounts of DNA in eukaryotic genomes are a mechanism by which their nuclear size is selected to meet the needs of the cell for balanced growth.  This idea is supported by the strong relationship between cellular DNA content and cell volume.  Gregory (2000) showed a highly significant correlation between these variables for erythrocytes in 159 species of vertebrates (jawless fishes, cartilaginous fishes, teleost fishes (excluding lungfishes), lungfishes, urodele amphibians, anuran amphibians, reptiles, and birds8. One mechanism accounting for this relationship is the idea that the transfer of messenger and ribosomal RNA from the nucleus to the cytoplasm via nuclear pores is a limiting factor on cell growth.  Thus we can imagine cases in which we have smaller cells, with smaller transfer rates, but quicker growth versus larger cells with larger pores and slower growth rates.  DNA amount in the cell is seen as a way of “providing space for the nucleus” which in turn impacts the size of its connections to the cellular transport membranes.  The problem with this hypothesis is that it doesn’t explain imperfect scaling between cell size and DNA content in some groups.  For example imperfect scaling occurs in lungfish and salamanders where cell volumes scales in a strongly positive allometric fashion with changes in genome size and there is a negative correlation between C-value between both cell division and developmental rates9.  One extreme example of this is the relationship between very large genome size and neoteny in salamanders.  Neoteny is when juvenile characters are retained in adults.  Neotenic salamanders have the largest genomes in amphibians (and as you already know amphibians have some of the large genomes of all organisms10.)  It also cannot explain instantaneous reductions of cell size that occur following reduction of DNA content, and is incompatible with observation of quantum genome size shifts (as happens with polyploidy.)

The nucleotypic hypothesis states that DNA content directly affects cell size and cell division rate. There is strong evidence that DNA content has a strong negative correlation with rates of both meiotic and mitotic cell division in a variety of organisms11.  The mechanism of this relationship results from the notion that more DNA should cause a greater duration of the S-phase of the cell cycle (DNA replication.)  This results from there being an imperfect relationship between increased DNA content and replicon number amongst other mechanisms. In addition all aspects of the cell cycle are increased with greater DNA content. There is some variation in the nature of the relationship between mitotic rate and DNA content.  For example in dicots the cell cycle is longer per amount of DNA than it is in monocots (although the scaling is similar.)  The relationship between meiotic rates and DNA content is more complex.  Animals tend to have longer meiotic rates than plants per DNA content and within animals mammals have longer rates than amphibians or insects.  The main problem with the nucleotypic hypothesis is that there is not yet enough data to fully test it and no satisfactory mechanism has been proposed to explain why it works.

Finally, Patrushev and Minekevich propose a unique optimal model to explain excess DNA content12.  Under normal cellular conditions, endogenous chemical mutagens cause spontaneous mutations.  In a variety of biochemical reactions reactive metabolites are formed as intermediate byproducts which can wreak havoc on DNA. In substrate redox reaction involving oxygen free radicals are always formed and in aerobic organisms 4-5% of molecular oxygen is transformed during respiration to reactive oxygen species (ROS.)  In a typical human cell this can mean as many as 50,000 – 200,000 mutations daily! These authors argue that one role of the non-coding DNA is to act as a sink to absorb ROS.  Thus, by chance alone in the human genome size at 3 x 109, there is a ~97% chance that a ROS damage event occurs in a non-coding segment, and a 87% chance it occurs in a segment that does not impact coding or regulation of coding. If there was no non-coding DNA, than 100% of mutations would occur in regions that impact the organism’s fitness.  Thus, under this model, there would be positive selection for genomes to allow the increase of TGE’s within non-coding segments of the genome.  The fact that there are variations in genome sizes amongst Eukaryotes would be predicted, especially if any of the other mechanisms mentioned above were in action.

In conclusion, the c-value enigma is a scientific question that requires more attention.  Genome sizes are distributed in a non-random fashion, but at present we don’t have a unifying theory that can explain the variation.  Mutation pressure theories explain some issues but fail at others.  Optimal size evolution theories suffer from some of the same difficulties as all adaptive program hypotheses.  My intuition suggests that genome sizes result from compromises between mutational and optimal forces.  We may not ever be able to develop a general c-theory, leaving us with only specific mechanisms that account for variation within specific groups.  What we know about genome variation does accomplish is the demolition of progression theories of evolution.  Lily DNA results from evolutionary issues specific to flowering plants as human DNA content results from those experienced by mammals.


  1. Gregory, T.R, Coincidence, coevolution, or causation?  DNA content, cell size, and the c-value enigma, Biol. Rev. 76: 65-101, 2001.
  2. Klug, W and Cummings, R, Genetics 5th edition, (Upper Saddle River, NJ: Prentice Hall), 1997.
  3.  Genome sizes are provided by Klug and Cummings, as well as from Bionumbers. 
  4. Morse, A.M. et al, Evolution of genome size and complexity in Pinus, PLOS One 4(2): e4332, 2009.
  5. Patrushev, L.I. and Minkevich, I.G, The problem of eukaryotic genome size, Biochemistry (Moscow) 73(13): 1519-1551, 2008.
  6. Gregory T.R. 2001 and Patrushev, L.I. and Minkevich, I.G. 2008.
  7. Govindan, S et al., Association of progesterone receptor gene polymorphism (PROGINS) with endometriosis, uterine fibroids and breast cancer, Cancer Biomark. 2007; 3(2):73-8.
  8. Gregory, T.R., Nucleotypic effects without nuclei: genome size and erythrocyte size in mammals, Genome 43: 895-901, 2000.
  9. Cavalier-Smith, T, Skeletal DNA and the evolution of genome size, Annual Rev. Biophysics and Bioengineering 11: 273-302 and Gregory, T.R., 2001.

10.  Cavalier-Smith, Coevolution of vertebrate genome and, cell, and nuclear sizes. In Selected Symposia and Monographs U.Z.I., vol. 4 (ed. G. Ghiara et al.), pp. 51-86.

11.  Gregory, T.R. 2001.

12.  Patrushev, L.I. and Minkevich, I.G, 2008.

Joseph Graves

About Joseph Graves

Dr. Joseph Graves, Jr. received his Ph.D. in Environmental, Evolutionary and Systematic Biology from Wayne State University in 1988. In 1994 he was elected a Fellow of the Council of the American Association for the Advancement of Science (AAAS.) In April 2002, he received the ASU-West award for Scholarly Research and Creative Activity. His research concerns the evolutionary genetics of postponed aging and biological concepts of race in humans, with over sixty papers and book chapters published, and had appeared in six documentary films and numerous television interviews on these general topics. He has been a Principal Investigator on grants from the National Institutes of Health, National Science Foundation and the Arizona Disease Research Commission. His books on the biology of race are entitled: The Emperor's New Clothes: Biological Theories of Race at the Millennium, Rutgers University Press, 2001, 2005 and The Race Myth: Why We Pretend Race Exists in America, Dutton Press, 2004, 2005. A summary of Dr. Graves’s research career can be found on Wikipedia, and he is also featured in the ABC-CLIO volume on Outstanding African American scientists. In November 2007, he was featured in the CNN Anderson Cooper 360 program on Dr. James Watson. He has served as a member of the external advisory board for the National Human Genome Center at Howard University. In January 2006, he became a member of the “New Genetics and the African Slave Trade” working group of the W.E.B. Du Bois Institute of Harvard University, chaired by professors Henry Louis Gates and Evelyn Hammonds. He is currently serving on the Senior Advisory Board for the National Evolutionary Synthesis Center (NESCent) at Duke University. He has been an active participant in the struggle to protect and improve the teaching of science, particularly evolutionary biology in the public schools. In 2007, he became a member of the inaugural editorial board of Evolution: Education and Outreach, published by Springer-Verlag. He has been a leader in addressing the under representation of minorities in science careers, having directed successful programs in California and Arizona. He currently serves as a member of the board of the Guilford Education Alliance. From 2005 – 2009, he has been a leading force in aiding underserved youth in Greensboro via the YMCA chess program.
This entry was posted in Adaptation, Evolution by Natural Selection. Bookmark the permalink.