Professor Sung-Hou Kim and his colleagues from the University of California, Berkeley, have applied a new way of thinking to the Tree of Life, a concept that has been around since Darwin’s time. Drawing from their collective expertise, they applied an Information Theory-based non-alignment method to compare whole-proteome sequences, the protein sequences coded by all genes of each organism. Their research group has studied over 4,000 organisms and their results show evidence of a ‘deep burst’ of the founders of all six major groups of life occurring near the root of the Tree of Life (Figure 1).
To understand and help us to visualise the connectedness of life on Earth, scientists construct phylogenetic trees. A phylogenetic tree of all organism groups, or ‘Tree of Life’, is a branching diagram that shows the relationship between organisms based on certain characteristics (Pace, 2009). The great naturalist, Charles Darwin, first sketched his ideas about how species might evolve through time (Darwin, 1859). His ideas quickly came to symbolise the theory of evolution by natural selection (Figure 2). Organisms that are more closely related are found closer together on the tree. For example, a human and a chimpanzee would be situated close together; a wolf and a shark would be far apart. The Tree of Life is still used today, as a model and research tool, to explore evolutionary relationships and to provide a simplified narrative of evolutionary history, which itself continues to evolve. Understanding the connectedness between species helps us to answer important broad scientific questions such as ‘How and when did all diversity of Life emerge?’ or more specific questions such as ‘How did HIV evolve?’ or the currently urgent question ‘How did the COVID-19 virus emerge and how are its variants evolving?’
Revisiting the tree of life in the genome era
As a concept, the Tree of Life gives us a valuable ‘peek’ at the incredibly complex picture that is evolution. It’s important to remember that a Tree of Life is not an actual record but a model, an approximation of life’s history. Darwin himself wrote about evolution and ecosystems as a ‘tangled bank’. As new data become more comprehensive, and our tools and methods to compare organisms become more sophisticated, however, so do our approximations.
In Darwin’s time, a Tree of Life was constructed by comparing the external morphology of organisms. Later, internal skeletal morphology was introduced, yielding richer insights into organisms’ degrees of affinity. These methods of comparative morphology helped scientists to start to organise the ‘big picture’ of the relatedness and evolution of all organisms.
A phylogenetic tree, or a Tree of Life, is a branching diagram that shows the relationship between organisms based on certain characteristics.
Following advancements in molecular biology, evolutionary biologists applied analysis of selected genes to describe the relationships between organisms. Genes are made up of DNA (deoxyribonucleic acid) and are the main unit of heredity. Variations in gene sequences offers another way to shine a light on evolutionary variation. Most genes act as codes to make proteins, the workhorses of all cells; for example melanin, which influences eye or skin colour in humans and is passed down through many generations. Over time, genes can mutate and change, leading to differences between organisms. Woese and Fox (1977) turned the Tree of Life on its head. Studying the DNA sequences of parts of ribosomes (the cellular machinery that make proteins in our cells) the researchers added a third ‘trunk’ to the Tree of Life – through the discovery of the third domain Archaea (distinct from bacteria and eukaryotes). Importantly, this discovery highlighted the fantastic diversity of microbial organisms. We now appreciate that these single-celled organisms represent the vast majority of the Earth’s genetic, metabolic and ecological niche diversity including hot springs (Figure 3).
Gene or protein alignment methods
Constructing an organism’s Tree of Life using only gene sequences has its own limitations. It gives just a glimpse of evolution based on that particular gene or set of genes selected, which account for only a very small fraction of all genes of each organism. The same limitations apply in using only a set of select protein sequences to construct a Tree of Life. Until now, the most common methods of constructing Trees of Life using a set of select genes or proteins from the whole-genome sequence require sequences to be lined up. For this reason, they are called alignment-based methods – the sequences of genes of a same function from all study organisms are compared after they have been multiply aligned. All alignment-based methods look for similarities between the sequences within only the aligned regions and use these to calculate relatedness.
More recent advances in phylogenetic analysis (the study of evolutionary relationship of organisms) have used a set of alignable genes selected from whole-genome sequencing, where the alignable genes account for a small fraction of the whole genome. A genome is the entire genetic material of an organism, including all genes coding for proteins and RNAs as well as non-coding regions. As much as 99% of the human genome is made up of non-coding portions of DNA (Venter et al, 2001).
Although the alignment-based methods have been the most common ways used to construct the tree of life for several decades, however, have serious limitations (Zielezinski et al, 2017). Since only a small fraction of whole-genomes can be aligned, this tells us about the evolution of the small fraction of each organism’s genome, not the whole genome that represents each organism.
A new view: alignment-free methods
Ten years ago, Professor Sung-Hou Kim and his colleagues at the University of California, Berkeley applied a novel approach to sequence comparisons for whole genomes or whole proteomes without alignment. The researchers took inspiration from computer algorithms used in Information Theory to compare books without alignment of words. This approach compares the ‘word frequency profile’, which is a collection of all unique words (letters in context) and their frequency in each book, to estimate the degree of ‘divergence’ between two books without aligning sentences or paragraphs. Professor Kim’s team adapted this approach to compare whole-genome or whole-proteome sequences without the need for selecting them for alignment at any stage in analysis. A major benefit of alignment-free methods is that they can be used to compare the entire contents of pair of genomes or proteomes.
The tree of ‘books’
Since there is no ‘true reference tree’ for Tree of Life to test the method, the first developmental step taken by Kim and his colleagues involved English literature. The words of a book can be considered similar to a whole genome sequence of an organism – they provide information on specific topics to the reader and similar books (for example all books written by the same author in a given genre) tend to have similar vocabulary. For their simulation test, they chose a few well known books from each author in a given genre. The team then removed all spaces, punctuation, and any short common words (for example: it, the, and), so that each book was transformed into a long string of alphabets, similar to a genome. The whole string was then divided into overlapping short strings of an optimal length – ‘features’ – with each feature holding information about a content of the alphabets and their contexts. The frequency of each unique feature in the ‘book’ was then counted. Thus, the collection of all unique features and their frequencies, called ‘Feature Frequency Profile’ (FFP), yields all the information needed to reconstruct the original ‘genome book’. Using various mathematical operations each ‘book’ as an FFP vector can easily be compared to different ‘books’ and a tree of ‘books’ can be constructed.
Once all the trees were constructed for a wide range of the length of the features, they selected the ‘optimal’ tree as the one that has most stable tree topology by identifying two adjacent trees (with the feature-length difference of one) that have minimal topological difference. The optimal tree of the ‘books’ revealed that similar books (author/topic/genre) were grouped together (Sims 2009a). Encouraged by the result, they tested the applicability of their FFP method on living organisms: Kim and colleagues repeated the process using whole genome sequences of some of the best characterised mammals (Sims 2009b) with good agreement.
A note about the properties of ‘features’: Let us look at three different character strings containing two different strings of letters. In each string, the shortest possible feature length is one and the longest possible feature is six. If the feature length were one, the FFPs of the strings containing As and Bs would be identical because they each contain three of each letter. However, this is not the case; if we build an FFP using a feature length of three, we can see that two of the strings share features, but with different frequencies (Figure 4).
Thus, the feature length plays an important role in telling similar character strings apart. The ideal feature length has the highest number of different features that can distinguish each string.
Shaping the whole-proteome tree of life
The Berkeley researchers have taken the FFP method one step further by considering whole proteomes (the complete set of proteins encoded by all genes). They have now completed analysis of the proteomes derived from whole-genome sequences of over 4,000 organisms available in the public genome database (The National Center for Biotechnology Information. National Institutes of Health USA). Their FFP tree shows many similarities to the historic morphological and conventional alignment-based gene Trees of Life in grouping pattern, but some fascinating and sometimes radical differences in the order and time of emergences of the early groups (Choi, Kim, 2020).
This method can shed light on a previously understudied evolutionary phenomenon: the ‘deep burst’ of life.
On grouping, the whole-proteome FFP tree of life suggests: 1) All organisms studied assort into two ‘trunks’ (domains of ‘Akarya’ (Prokarya) and Eukarya), six ‘major branches’ (kingdoms of Archaea, Bacteria, Protists, Fungi, Plants, and Animals), or 35+ ‘minor branches’ (‘minor groups’); 2) the third major group, the Protists (single-cell Eukaryotes), subdivides into three types: the first emerging at the basal position to all multi-cellular eukaryotes; the second and the third at the basal positions to all plant and all animal major groups, respectively Figure 1).
On the order and the progression stage of emergence of the various groups, the differences are much more pronounced and unexpected: 1) life started by the emergence of the founders of two domains: Akarya (Prokarya) and Eukarya simultaneously, not sequentially from Prokarya to Eukarya as predicted by all alignment-based trees; 2) there was a staged ‘deep burst’ of organism diversity, where the founders of all six major branches of all living organisms emerged within the first 0.20% of the entire progression of the evolution of life (Figure 5). Following this, a more gradual and step-wise evolution seems to have taken place for the remaining 99.80% of the evolutionary progression scale, similar to the evolution by more gradual natural selection that was described by Darwin. Kim’s group found in their finding of the staged ‘deep burst’ of life’s diversity, some similarity to the ‘Big Bang’ theory of the origin of the universe as suggested by Koonin (2007), but with some key differences. Kim and his colleagues observed a similar ‘burst’ of the founders of all major groups of insects near the root of the insect tree (Choi, Kim, Kim, 2020).The team is now focusing its research efforts into the detailed features of the whole proteome trees of other major groups and minor groups.
Our image of the Tree of Life has been evolving with the explosion of new whole-genome sequences and the evolution of the scientific tools and methods used to study it. With their alignment-free methods, Kim and his team at Berkeley are challenging the way we construct our evolutionary history once again. Their use of massive whole-genome information and an ability to think outside the box is challenging the status quo and driving our understanding of evolution to potentially exciting new places.
What sparked your interest in phylogeny?
My research interest started with my desire to get detailed views of the 3D structures of individual proteins and nucleic acids to find their architectural motifs and their relationship to their respective functions. After some successes, my interest expanded to get a view of all the structural motifs in the protein structural ‘universe’. Finally, recognising the fact that most of the 3D structures of proteins are encoded in their amino acid sequences, my interest expanded further to find ways to get a ‘wide-angle view’ of all organisms, each represented by their proteome or genome, and to find possible evolutionary relationships among them.
Do you think your background in crystallography gave you a different perspective?
In crystallography, the more comprehensive (high resolution) data one uses, the more true and accurate view of the whole molecule one gets. Thus, to construct a new organism Tree of Life, I wanted to start with the most comprehensive data available for each organism, which, at present, are whole-genome or whole-proteome sequences.