arxiv – recent articles on phylogenetics

obs: manuscripts deposited on the arxiv repository may have not been peer reviewed.

An alternative marginal likelihood estimator for phylogenetic models. (arXiv:1001.2136v1 [stat.CO]) by Serena Arima, Luca Tardella

Bayesian phylogenetic methods are generating noticeable enthusiasm in the field of molecular systematics. Several phylogenetic models are often at stake and different approaches are used to compare them within a Bayesian framework. The Bayes factor, defined as the ratio of the marginal likelihoods of two competing models, plays a key role in Bayesian model selection. However, its computation is still a challenging problem. Several computational solutions have been proposed none of which can be considered outperforming the others simultaneously in terms of simplicity of implementation, computational burden and precision of the estimates. Available Bayesian phylogenetic software has privileged so far the simplicity of the harmonic mean estimator (HM) and the arithmetic mean estimator (AM). However it is known that the resulting estimates of the Bayesian evidence in favor of one model are often biased and inaccurate up to having an infinite variance so that the reliability of the corresponding conclusions is doubtful.

We focus on an alternative generalized harmonic mean (GHM) estimator which, recycling MCMC simulations from the posterior, shares the computational simplicity of the original HM estimator, but, unlike it, overcomes the infinite variance issue. We show that the Inflated Density Ratio (IDR) estimator when applied to some standard phylogenetic benchmark data, produces fully satisfactory results outperforming those simple estimators currently provided by most of the publicly available software.

Pure Parsimony Xor Haplotyping. (arXiv:1001.1210v1 [cs.CE]) by Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi, Yuri Pirola, Romeo Rizzi

The haplotype resolution from xor-genotype data has been recently formulated as a new model for genetic studies. The xor-genotype data is a cheaply obtainable type of data distinguishing heterozygous from homozygous sites without identifying the homozygous alleles. In this paper we propose a formulation based on a well-known model used in haplotype inference: pure parsimony. We exhibit exact solutions of the problem by providing polynomial time algorithms for some restricted cases and a fixed-parameter algorithm for the general case. These results are based on some interesting combinatorial properties of a graph representation of the solutions. Furthermore, we show that the problem has a polynomial time k-approximation, where k is the maximum number of xor-genotypes containing a given SNP. Finally, we propose a heuristic and produce an experimental analysis showing that it scales to real-world large instances taken from the HapMap project.

Visualizing the Structure of Large Trees. (arXiv:1001.0951v2 [stat.AP]) by Burcu Aydin, Gabor Pataki, Haonan Wang, Alim Ladha, Elizabeth Bullitt, J.S. Marron

This study introduces a new method of visualizing complex tree structured objects. The usefulness of this method is illustrated in the context of detecting unexpected features in a data set of very large trees. The major contribution is a novel two-dimensional graphical representation of each tree, with a covariate coded by color. The motivating data set contains three dimensional representations of brain artery systems of 105 subjects. Due to inaccuracies inherent in the medical imaging techniques, issues with the reconstruction algo- rithms and inconsistencies introduced by manual adjustment, various discrepancies are present in the data. The proposed representation enables quick visual detection of the most common discrepancies. For our driving example, this tool led to the modification of 10% of the artery trees and deletion of 6.7%. The benefits of our cleaning method are demonstrated through a statistical hypothesis test on the effects of aging on vessel structure. The data cleaning resulted in improved significance levels.

Minimal Conflicting Sets for the Consecutive Ones Property in ancestral genome reconstruction. (arXiv:0912.4196v1 [q-bio.GN]) by Cedric Chauve, Utz-Uwe Haus, Tamon Stephen, Vivija P. You

A binary matrix has the Consecutive Ones Property (C1P) if its columns can be ordered in such a way that all 1’s on each row are consecutive. A Minimal Conflicting Set is a set of rows that does not have the C1P, but every proper subset has the C1P. Such submatrices have been considered in comparative genomics applications, but very little is known about their combinatorial structure and efficient algorithms to compute them.

We first describe an algorithm that detects rows that belong to Minimal Conflicting Sets. This algorithm has a polynomial time complexity when the number of 1’s in each row of the considered matrix is bounded by a constant. Next, we show that the problem of computing all Minimal Conflicting Sets can be reduced to the joint generation of all minimal true clauses and maximal false clauses for some monotone boolean function. We use these methods on simulated data related to ancestral genome reconstruction to show that computing Minimal Conflicting Set is useful in discriminating between true positive and false positive ancestral syntenies. We also study a dataset of yeast genomes and address the reliability of an ancestral genome proposal of the Saccahromycetaceae yeasts.

Combining Partial Order Alignment and Progressive Near-Optimal Alignment. (arXiv:0912.2813v1 [cs.DS]) by Dai Tri Man Le

In this paper, I proposed to utilize partial-order alignment technique as a heuristic method to cope with the state-space explosion problem in progressive near-optimal alignment. The key idea of my approach is a formal treatment of progressive partial order alignment based on the graph product construction.

Automated languages phylogeny from Levenshtein distance. (arXiv:0911.3280v1 [cs.CL]) by Maurizio Serva

In order to verify hypotheses concerning relationship between two languages it is necessary to define evaluate their distance from lexical differences. This concept seems to have its roots in the work of the French explorer Dumont D’Urville. He collected comparative words lists of various languages during his voyages aboard the Astrolabe from 1826 to 1829 and, in his work about the geographical division of the Pacific, he proposed a method to measure the degree of relation among languages. The method used by modern glottochronology, developed by Morris Swadesh in the 1950s, measures distances from the percentage of shared cognates, which are words with a common historical origin.

Recently, we proposed a new automated method which has some advantages: the first is that it avoids subjectivity, the second is that results can be replicated by other scholars assuming that the database is the same, the third is that no specific linguistic knowledge is requested, and the last, but surely not the least, is that it allows for rapid comparison of a very large number of languages. We applied our method to the Indo-European and to the Austronesian groups considering, in both cases, fifty different languages and we obtained two genealogical trees using the Unweighted Pair Group Method Average. The trees are similar to those found by previous research with some important differences concerning the position of few languages and subgroups. Indeed, we think that these differences carry some fresh information about the structure of the tree and about the phylogenetic relations inside the families.

Adaptive BLASTing through the Sequence Dataspace: Theories on Protein Sequence Embedding. (arXiv:0911.0650v1 [q-bio.QM]) by Yoojin Hong, Jaewoo Kang, Dongwon Lee, Randen L. Patterson, Damian B. van Rossum

We theorize that phylogenetic profiles provide a quantitative method that can relate the structural and functional properties of proteins, as well as their evolutionary relationships. A key feature of phylogenetic profiles is the interoperable data format (e.g. alignment information, physiochemical information, genomic information, etc). Indeed, we have previously demonstrated Position Specific Scoring Matrices (PSSMs) are an informative M-dimension which can be scored from quantitative measure of embedded or unmodified sequence alignments. Moreover, the information obtained from these alignments is informative, even in the twilight zone of sequence similarity (<25% identity)(1-5). Although powerful, our previous embedding strategy suffered from contaminating alignments(embedded AND unmodified) and computational expense.

Herein, we describe the logic and algorithmic process for a heuristic embedding strategy (Adaptive GDDA-BLAST, Ada-BLAST). Ada-BLAST on average up to ~19-fold faster and has similar sensitivity to our previous method. Further, we provide data demonstrating the benefits of embedded alignment measurements for isolating secondary structural elements and the classifying transmembrane-domain structure/function. We theorize that sequence-embedding is one of multiple ways that low-identity alignments can be measured and incorporated into high-performance PSSM-based phylogenetic profiles.

Progressive Mauve: Multiple alignment of genomes with gene flux and rearrangement. (arXiv:0910.5780v1 [q-bio.GN]) by Aaron E. Darling, Bob Mau, Nicole T. Perna

Multiple genome alignment remains a challenging problem. Effects of recombination including rearrangement, segmental duplication, gain, and loss can create a mosaic pattern of homology even among closely related organisms. We describe a method to align two or more genomes that have undergone large-scale recombination, particularly genomes that have undergone substantial amounts of gene gain and loss (gene flux). The method utilizes a novel alignment objective score, referred to as a sum-of-pairs breakpoint score. We also apply a probabilistic alignment filtering method to remove erroneous alignments of unrelated sequences, which are commonly observed in other genome alignment methods.

We describe new metrics for quantifying genome alignment accuracy which measure the quality of rearrangement breakpoint predictions and indel predictions. The progressive genome alignment algorithm demonstrates markedly improved accuracy over previous approaches in situations where genomes have undergone realistic amounts of genome rearrangement, gene gain, loss, and duplication. We apply the progressive genome alignment algorithm to a set of 23 completely sequenced genomes from the genera Escherichia, Shigella, and Salmonella. The 23 enterobacteria have an estimated 2.46Mbp of genomic content conserved among all taxa and total unique content of 15.2Mbp. We document substantial population-level variability among these organisms driven by homologous recombination, gene gain, and gene loss. Free, open-source software implementing the described genome alignment approach is available from this http URL ( .


About Leonardo de Oliveira Martins

I am currently a postdoc working in David Posada's lab at the University of Vigo, Spain. I did my Ph.D. at the Universty of Tokyo, and have both an M.Sc. in Biotechnology and a B.Sc. in Molecular Sciences completed at the University of Sao Paulo, Brasil.
This entry was posted in Abstracts, New Publications and tagged , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s