I’ll change the info on Research Blogging to point only to the new address. I copied some blog posts from here to there, but to avoid duplication I didn’t include the researchblogging meta-data.

I hope to see you at my new home!

]]>This figure doesn’t seem essential for the comprehension of the article (it was not present in the arxiv version of the manuscript, for instance), but you may feel lost as I did when reading the paragraph that depends on it. So here it is for completeness:

Oh, and by the way, the paper talks about the kinds of phylogenetic models that lead to statistical inconsistency (SIN) of the tree inference, and the “no common mechanism” are the models where the number of generating processes (with unique parameters) increase as the sequence length increases.

**Reference**:

You do not see anyone publishing papers with titles like “N-fold speed up of algoritm X by using N computers”.

The first time I heard a similar criticism was by Alexandros Stamatakis, who alerted that we should always compare the performance of a GPU algorithm to the same effort focused on a multi-core CPU environment. Since then I try to be more cautious about grand statements of improvement. And yes, I completely agree with them, even though I don’t give so much importance to speed gains.

My feeling is that thinking about novel GPU algorithms is always worth the effort since a different path for doing a given computation, though currently slow or inefficient, can lead to faster algorithms in the future. And can also give feedback to vendors as to where they should put their efforts into – provided there’s competition, which in end is the problem with the GPU approach…

(via Jason Stajich)

]]>So the scientists, knowing the kind of minds they needed to emerge from the school system, teamed up with the school teachers to understand their challenges and discuss some potential solutions: the teachers consistently reported boredom among pupils when confronted by the traditional learning experience.

(doi: 10.1002/bies.201090050)

]]>The motivation for the development of the model was to be able to say, for a given a mosaic structure, if the breakpoints can be explained by one recombination event or several. The recombination mosaic structure is usually inferred assuming the parental sequences (those not recombining) are known beforehand – in the figure they are the subtypes B and F – and recombination is then inferred when there is a change in the parental closest to the query sequence. Another problem is that it is common to analyze each one of the query sequences independently against the parentals – if all one wants is the “coloring”, then this might be enough. For the above figure I analyzed each query sequence against one reference sequence from the subtypes B, F and C (thus comprising a quartet for each analysis). And we know that these mosaics don’t tell the whole story.

If we know the topologies for both segments separated by the recombination breakpoint, then we can say, at least in theory, the minimum number of recombination events necessary to explain the difference (the real number can be much larger since we only detect those that lead to a change in the topology…). This minimum number is the Subtree Prune-and-Regraft distance, and is related to the problem of detection of horizontal gene transfers. In our case we devised an approximation to this distance based on the disagreement between all pairs of bipartitions belonging to the topologies: at each iteration we remove the smallest number of “leaves” such that the topologies will become more similar, and our approximate “uSPR distance” will be how many times we iterate this removal.

It is just an approximation, but it is better (closer to the SPR distance) than the Robinson-Foulds or the (complementary) Maximum Agreement Subtree distances, which compete in speed with our algorithm. For larger topologies it apparently works better than for smaller, but this is an artifact of the simulation – one realized SPR “neutralizes” previous ones, and it happens more often for small trees.

Our Bayesian model works with a partitioning of the alignment, where recombination can only occur between segments and never within. This doesn’t pose a problem in practice since it will “shift” the recombinations to the border – the idea is that several neighboring breakpoints are equivalent to one breakpoint with a larger distance. This segments could be composed of one site each, but for computational reasons we usually set it up at five or ten base pairs. The drawback is the loss of hability to detect rate heterogeneity within the segment.

Each segment will have it own toplology (represented by Tx, Ty and Tz in the figure), which will coincide for many neighboring segments since we have the distance between them as a latent variable penalizing against too many breakpoints:

This is a truncated Poisson distribution, modified so that it can handle underdispersion – the parameter *w* will make the Poisson sharper around the mean – and each potential breakpoint has its own lambda and *w*.

The posterior distribution will have *K* terms for the segments (topology likelihood and evolutionary model priors) and *K-1* terms for the potential breakpoints (distances between segments), as well as the hyper-priors. I use the term “potential breakpoint” because if two consecutive segments happen to have the same topology (distance equals zero) then we don’t have an actual breakpoint. Again, considering only the recombinations that change the topology. This posterior distribution is calculated through MCMC sampling in the program called biomc2.

To test the algorithm, we did simulations with eight and twelve taxa datasets, simulating one (for the eight taxa datasets) or two recombinations per breakpoint. We present the output of the program biomc2.summarise, which interprets the posterior samples for one replicate: based on the posterior distribution of distances for each potential breakpoint, we neglect the actual distances and focus on whether it is larger than or equal to zero (second figure of the panel). Based on this multimodal distribution of breakpoints we infer the regions where no recombination was detected (that we call “cold spots”), credible intervals around each mode (red bars on top) or based on all values (red dots at bottom, together with the cold spots).

We also looked at the average distance for each potential breakpoint per replicate dataset, and show that indeed the software can correctly infer the location and amount of recombination for most replicates. It is worth remembering that we were generous in our simulations, in that there is still phylogenetic signal preserved on alignments with many mutations and a few recombinations. If recombination is much more frequent, then any two sites might represent distinct evolutionary histories and our method will fail.

We then analyzed HIV genomic sequences with a very similar mosaic structure, as inferred by cBrother (an implementation of DualBrothers). Here it is important to say that we used cBrother only to estimate the mosaic structure of the recombinants, doing an independent analysis for each sequence against three reference parental sequences. Therefore the figure is not a direct comparison of the programs, contrary to what its unfortunate caption might induce us to think. The distinction is between analyzing all sequences at once or independently, in quartets of sequences. If we superpose the panels it might become clearer to compare them:

The curve in blue shows the positions where there is a change in closest parental for the query sequence, if each query sequence is analyzed neglecting the others. In red we have our algorithm estimating recombinations between all eleven sequences (eight query and three parental sequences). We can see that:

- all breakpoints detected by the independent analysis were also detected by our method;
- many recombinations were detected only when all sequences were analyzed at once, indicating that they do not involve the parental sequences –
*de novo*recombination; - if we look at the underlying topologies estimated by our method (figure S2 of the PLoS ONE paper), we see that those also detected by the independent analysis in fact involve the parentals while the others don’t;
- biomc2 not only infers the location of recombination, but also its “strength” – given by the distance between topologies;

Finally, we show two further developments of the software: a point estimate for the recombination mosaic, and the relevance of the chosen prior over distances. The point estimate came from the need of a more easily interpretable summary of the distribution of breakpoints: instead of looking at the whole multimodal distribution, we may want to pay attention only to the peaks, or some other similar measure. This is a common problem in bioinformatics: to represent a collection of trees by a single one or to find a protein structure that represents best an ensemble of structures. In our case we have a collection of recombination mosaics (one per sample of the posterior distribution), and we elect the one with the smallest distance from all other mosaics – we had to devise a distance for this as well…

To show the importance of the prior distribution of distances, we compared it with simplified versions, like setting the penalty parameter *w* fixed at a low or high value. The overall behavior for all scenarios is lower resolution around breakpoints, and for weaker penalties we reconstruct the topologies better than for stronger ones, at the cost of inferring spurious breakpoints more often. We also compared the original model with a simplification where the topological distance is neglected and the prior considers only if the topologies are equal or not. This is similar to what cBrother and other programs do, and by looking at the top panel we observe that the results were also equivalent (blue lines labeled “cBrother” and “m=0”). In the same panel we plot the performance using our original (“unrestricted”) model as a gray area.

I also submitted the poster to the F1000 Poster Bank, let’s see how it works…

**References:**

de Oliveira Martins, L., Leal, É., & Kishino, H. (2008). Phylogenetic Detection of Recombination with a Bayesian Prior on the Distance between Trees PLoS ONE, 3 (7) DOI: 10.1371/journal.pone.0002651

Oliveira Martins, L., & Kishino, H. (2009). Distribution of distances between topologies and its effect on detection of phylogenetic recombination Annals of the Institute of Statistical Mathematics, 62 (1), 145-159 DOI: 10.1007/s10463-009-0259-8

98% identical, 100% wrong: per cent nucleotide identity can lead plant virus epidemiology astray (by Duffy, S., Seah, Y. M.)

(…) Many of these sentinel publications include viral sequence data, but most use that information only to confirm the virus’ species. When researchers use the standard technique of per cent nucleotide identity to determine that the new sequence is closely related to another sequence, potentially erroneous conclusions can be drawn from the results. Multiple introductions of the same pathogen into a country are being ignored because researchers know fast-evolving plant viruses can accumulate substantial sequence divergence over time, even from a single introduction. (…)

The virulence-transmission trade-off in vector-borne plant viruses: a review of (non-)existing studies (by Froissart, R., Doumayrou, J., Vuillaume, F., Alizon, S., Michalakis, Y.)

The adaptive hypothesis invoked to explain why parasites harm their hosts is known as the trade-off hypothesis, which states that increased parasite transmission comes at the cost of shorter infection duration. (…) We found only very few appropriate studies testing such a correlation, themselves limited by the fact that they use symptoms as a proxy for virulence and are based on very few viral genotypes. Overall, the available evidence does not allow us to confirm or refute the existence of a transmission–virulence trade-off for vector-borne plant viruses. (…)

Pathways to extinction: beyond the error threshold (by Manrubia, S. C., Domingo, E., Lazaro, E.)

(…) Current models of viral evolution take into account more realistic scenarios that consider compensatory and lethal mutations, a highly redundant genotype-to-phenotype map, rough fitness landscapes relating phenotype and fitness, and where phenotype is described as a set of interdependent traits. Further, viral populations cannot be understood without specifying the characteristics of the environment where they evolve and adapt. Altogether, it turns out that the pathways through which viral quasispecies go extinct are multiple and diverse.

Lethal mutagenesis and evolutionary epidemiology (by Martin, G., Gandon, S.)

The lethal mutagenesis hypothesis states that within-host populations of pathogens can be driven to extinction when the load of deleterious mutations is artificially increased with a mutagen, and becomes too high for the population to be maintained. (…) We derive the epidemiological and evolutionary equilibrium of the system. At this equilibrium, the density of the pathogen is expected to decrease linearly with the genomic mutation rate U. We also provide a simple expression for the critical mutation rate leading to extinction. Stochastic simulations show that these predictions are accurate for a broad range of parameter values. As they depend on a small set of measurable epidemiological and evolutionary parameters, we used available information on several viruses to make quantitative and testable predictions on critical mutation rates. In the light of this model, we discuss the feasibility of lethal mutagenesis as an efficient therapeutic strategy.

Mutational fitness effects in RNA and single-stranded DNA viruses: common patterns revealed by site-directed mutagenesis studies (by Sanjuan, R.)

]]>The fitness effects of mutations are central to evolution, yet have begun to be characterized in detail only recently. Site-directed mutagenesis is a powerful tool for achieving this goal, which is particularly suited for viruses because of their small genomes. Here, I discuss the evolutionary relevance of mutational fitness effects and critically review previous site-directed mutagenesis studies. The effects of single-nucleotide substitutions are standardized and compared for five RNA or single-stranded DNA viruses infecting bacteria, plants or animals. (…)

I am not involved in the project, but I was in the very comfortable position of being one of the beta testers: all I needed to do is to find the largest and most obscure datasets I had and try them; then complain to the authors about the minimal details. I tried some big datasets (I think it was influenza H3N2 HA and HIV-1 complete genomes from South America, around 2 and 4 Mbytes each), and my simulated alignments created “by hand” from PAML. And ALTER could handle them in the end: they even sent me a report explaining how each one of my commentaries was used to improve the software, and asking me to try again until I feel satisfied.

The ALTER web server is a converter between multiple sequence alignment (MSA) formats, for DNA or protein, focused not only on the format itself (like FASTA or NEXUS) but more on the softwares that generated the alignment and the software where the alignment is going to be used in (e.g. clustal or MrBayes). They mention that this program-oriented format conversion is necessary since all useful softwares eventually violate the (outdated) format specification. In their own words

[D]uring the last years MSA’s formats have `evolved’ very much like the sequences they contain, with mutational events consisting of long names, extra spaces, additional carriage returns, etc.

The web service can automatically recognize the input format, and generate an output for several programs, in several formats. I found it very easy to use, as you proceed it automatically shows you the possible next steps in the same page. Another very nice feature is the possibility of collapsing duplicate (identical) sequences, working then only with the haplotypes (unique sequences). If later you need the information about the collapsed duplicates check out the “info” panel on the bottom of the screen (inside the “log” window).

The obvious case when this elimination of duplicates is useful is when doing phylogenetic reconstruction (in many cases you can safely remove identical sequences), but another option offered by ALTER is to remove very similar sequences, where you can define the threshold of similarity. Sometimes when I’m doing a preliminary analysis on a dataset, I want to discard sequences too similar in order to get an overall picture of the data, and some other times I **must** remove closely-related sequences since my recombination-detection program has a limitation on the number of taxa…

Besides the user-friendly web service, they also offer a geek-friendly API – if you want your program to communicate directly with the service – and the source code, licensed under the LGPL.

**Reference**:

Glez-Pena, D., Gomez-Blanco, D., Reboiro-Jato, M., Fdez-Riverola, F., & Posada, D. (2010). ALTER: program-oriented conversion of DNA and protein alignments Nucleic Acids Research DOI: 10.1093/nar/gkq321

The main result of the paper, published in Genome Biology and Evolution, is that there is a correlation between the mean number of anatomical systems (human tissues or cell types) where the gene is expressed and the time when the gene appeared on the phylogeny of the species. In other words, recent gene families are expressed in fewer anatomical systems (are more specific) than ancient ones. An anatomical system is a hierarchical classification of human tissues (e.g. the first level of the hierarchy: nervous, dermal, embryo, etc) available from gene expression data. So the age of appearance of a gene is an indicator of its specificity. Since the genes are subject to duplication we may have more than one member of the gene family in the same species, and the authors show that this correlation is maintained if we consider the appearance of the gene itself (as a result of duplication) or the appearance of the whole gene family to which the gene belongs.

They worked with gene families identified by MANTiS, which is a pipeline that 1) downloads data from metazoan genomes at ENSEMBL, 2) infers the gene tree based on the protein alignment of the gene family and 3) detects duplications through a reconciliation with a given species tree. Each gene tree is produced by EnsemblCompara which, as I understand, employs an extension of “reciprocal best hits” (that allow for many-to-many relations) to find the members of the family, and then maximum likelihood to find the tree itself. I will talk more about the gene tree/species tree reconciliation in the future, but it is enough to say that it’s the minimal list of nodes on the gene tree that represent duplications. We have an example of such a reconciled gene tree below, where the duplications are represented by the red boxes:

MANTiS creates a new character (the brown polygons, that I think of as an orthologous group) for each duplication event, and the phylogenetic profile generated by these characters is then used to calculate the branch lengths of the species tree through a least squares approach. The phylogenetic profiles are represented by 0’s and 1’s in the inlet figure above, from which a distance matrix must be calculated in order to have the branch lengths.

In the study two datasets were created for the presence/absence of genes: one called “families only” composed of one character for each single gene and for each protein family, and another called “with duplications” where a new character is created for each duplication event. Both analyses were necessary since gene gain through duplications is important in explaining genome size increase.

MANTiS creates a database relating each gene to its biological function and anatomical system: the biological processes and molecular functions (ontology terms) of protein families are given by the PANTHER database for human, mouse, rat and D. melanogaster, while the gene expression data (related to the anatomical systems) comes from eGenetics, GNF and HMDEG. When comparing the time of appearance of the gene (as explained above) and the expression data for the genes we have a figure like the following:

We must notice that in this graph the X axis is inverted (that is, left is older with the present day at the right) giving the impression of a negative correlation. So older gene families – or duplications – are expressed in more cell types in humans. Similar results were obtained using rat expression data – since the expression datasets had information for both – or using the other expression datasets.

The authors say that a possible explanation for this behaviour is the increase in the number of distinct cell types (blue line, notice the inverted axis again :D), where new genes are likely to be more specific to a cell type (which may have appeared recently itself). Associated with this explanation is the subfunctionalization of duplicated genes, and the tendency to subfunctionalize (“specialize”) can explain the decreased extent of expression. The subfunctionalization process itself might be related to the generation of a new cell phenotype.

One shortcoming of the analysis is that the gene family inference might fail to detect distantly related genes, and therefore what appears to be a gene gain (the “birth” of a new gene family) might be in fact a duplication of a more ancient single gene family. For example if after the duplication number 3 on the first figure the sequences diverged too much, we might wrongly classify them as two gene families. But to be free from this problem is a tall order. The authors also call our attention to the problem of low coverage of some genomes and taxonomic bias.

**References**:

Milinkovitch, M., Helaers, R., & Tzika, A. (2009). Historical Constraints on Vertebrate Genome Evolution Genome Biology and Evolution, 2010, 13-18 DOI: 10.1093/gbe/evp052

Tzika, A., Helaers, R., Van de Peer, Y., & Milinkovitch, M. (2007). MANTIS: a phylogenetic framework for multi-species genome comparisons Bioinformatics, 24 (2), 151-157 DOI: 10.1093/bioinformatics/btm567

The Tile64 card is composed of 64 core processors, with each core running its own Linux OS and standard programs, and communicating using the Tilera API. The Tile64 is a System on Chip (SyC), that therefore can be plugged into a PCI slot and be used independently from the CPU. On the other hand it can handle only integer number instructions, which limits its usability for numerical computations.

The Needleman–Wunsch algorithm is used for **global** sequence alignment. That is, for given two sequences it tries to maximize the score by including as few insertions as possible in each one of the sequences. It is closely related to the Smith-Waterman algorithm for **local** alignment, which tries to find the longest subsequence with positive score – where the score function is almost the same as for Needleman–Wunsch.

Both algorithms are a dynamic programming method where a matrix is built with the scores for all possible pairwise combinations (the solution is found by backtrack after the matrix is complete). After initialization of the matrix (first row and first column) the score of a cell can be calculated by looking at its immediate top and left neighbor cells, represented by the arrows in the figure below. For example the score of cell q4d4 depends only on q4d3, q3d3 and q3d4.

In the article they use an implementation of the FastLSA algorithm, a parallel version of Needleman–Wunsch where instead of storing the whole matrix it stores one row/column combination per block, since depending on the sequence length the memory requirements for the whole matrix can become prohibitive. In other words it stores the score values only for a grid of rows and columns (e.g. at every ten sites). In [1] they claim that this implementation is therefore well suited for very long sequences, which cannot be handled for instance by the “needle” application of the EMBOSS package or the CUDA implementation of the SmithWaterman algorithm [2].

The parallelism is achieved if we notice that the cells belonging to the same anti-diagonal (one such anti-diagonal represented in gray) can be calculated independently. Thus distinct cores can calculate the score of these cells at the same time with the so-called wavefront parallelism. Their solution achieved gains of 20 times over similar programs – even though their SyC implementation is in C and the other CPU implementations are in Java.

**references:**

[1] Galvez, S., Diaz, D., Hernandez, P., Esteban, F., Caballero, J., & Dorado, G. (2010). Next-generation bioinformatics: using many-core processor architecture to develop a web service for sequence alignment Bioinformatics, 26 (5), 683-686 DOI: 10.1093/bioinformatics/btq017

[2] Manavski, S., & Valle, G. (2008). CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment BMC Bioinformatics, 9 (Suppl 2) DOI: 10.1186/1471-2105-9-S2-S10

There is a paper (with discussion) on Statistical Science 2009, Vol. 24, No. 2 about Harold Jeffreys’ “Theory of Probability” book, which is one the foundations of Bayesian Statistics. The Institute of Mathematical Statistics (responsible for

**Harold Jeffreys’s Theory of Probability Revisited** (Christian P. Robert, Nicolas Chopin, Judith Rousseau) DOI: 10.1214/09-STS284 (arXiv:0804.3173v7)

Published exactly seventy years ago, Jeffreys’s Theory of Probability (1939) has had a unique impact on the Bayesian community and is now considered to be one of the main classics in Bayesian Statistics as well as the initiator of the objective Bayes school. In particular, its advances on the derivation of noninformative priors as well as on the scaling of Bayes factors have had a lasting impact on the field. However, the book reflects the characteristics of the time, especially in terms of mathematical rigor. In this paper we point out the fundamental aspects of this reference work, especially the thorough coverage of testing problems and the construction of both estimation and testing noninformative priors based on functional divergences. Our major aim here is to help modern readers in navigating in this difficult text and in concentrating on passages that are still relevant today.

**Comment on “Harold Jeffreys’s Theory of Probability Revisited”** (José M. Bernardo) DOI: 10.1214/09-STS284E (arXiv:1001.2967v1)

The authors provide an authoritative lecture guide of Theory of Probability, where they clearly state that the more useful material today is that contained in Chapters 3 and 5, which respectively deal with estimation, and hypothesis testing. We argue that, from a contemporary viewpoint, the impact of Jeffreys proposals on those two problems is rather different, and we describe what we perceive to be the state of the question nowadays, suggesting that Jeffreys’s dramatically different treatment is not necessary, and that a joint objective approach to those two problems is indeed possible.

**Bayes, Jeffreys, Prior Distributions and the Philosophy of Statistics** (Andrew Gelman) DOI: 10.1214/09-STS284D (arXiv:1001.2968v1)

(…) In this brief discussion I will argue the following: (1) in thinking about prior distributions, we should go beyond Jeffreys’s principles and move toward weakly informative priors; (2) it is natural for those of us who work in social and computational sciences to favor complex models, contra Jeffreys’s preference for simplicity; and (3) a key generalization of Jeffreys’s ideas is to explicitly include model checking in the process of data analysis.

**Comment: The Importance of Jeffreys’s Legacy** (Robert Kass) DOI: 10.1214/09-STS284A (arXiv:1001.2970v1)

Theory of Probability is distinguished by several high-level philosophical attitudes, some stressed by Jeffreys, some implicit. By reviewing these we may recognize the importance in this work in the historical development of statistics.

**Comment on “Harold Jeffreys’s Theory of Probability Revisited”** (Dennis Lindley) DOI: 10.1214/09-STS284F (arXiv:1001.3073v1)

I was taught by Harold Jeffreys, having attended his postgraduate lectures at Cambridge in the academic year 19461947, and also knew him when I joined the Faculty there. I thought I appreciated the Theory of Probability rather well, so was astonished to read this splendid paper, which so successfully sheds new light on the book by placing it in the context of recent developments.

**Comment on “Harold Jeffreys’s Theory of Probability Revisited”** (Stephen Senn) DOI: 10.1214/09-STS284B (arXiv:1001.2975v1)

I have always felt very guilty about Harold Jeffreys’s Theory of Probability (referred to as ToP, hereafter). I take seriously George Barnard’s injunction (Barnard, 1996) to have some familiarity with the four great systems of inference. I also consider it a duty and generally find it a pleasure to read the classics, but I find Jeffreys much harder going than Fisher, Neyman and Pearson fils or De Finetti. So I was intrigued to learn that Christian Robert and colleagues had produced an extensive chapter by chapter commentary on Jeffreys, honored to be invited to comment but apprehensive at the task.

**Comment on “Harold Jeffreys’s Theory of Probability Revisited”** (Arnold Zellner) DOI: 10.1214/09-STS284C (arXiv:1001.2985v1)

The authors are to be congratulated for their deep appreciation of Jeffreys’s famous book, Theory of Probability, and their very impressive, knowledgeable consideration of its contents, chapter by chapter. Many will benefit from their analyses of topics in Jeffreys’s book. As they state in their abstract, “Our major aim here is to help modern readers in navigating this difficult text and in concentrating on passages that are still relevant today.” From what follows, it might have been more accurate to use the phrase, “modern well-informed Bayesian statisticians” rather than “modern readers” since the authors’ discussions assume a rather advanced knowledge of modern Bayesian statistics.

**Rejoinder: Harold Jeffreys’s Theory of Probability Revisited** (Christian P. Robert, Nicolas Chopin, Judith Rousseau) DOI: 10.1214/09-STS284REJ (arXiv:0909.1008v2)

]]>We are grateful to all discussants of our re-visitation for their strong support in our enterprise and for their overall agreement with our perspective. Further discussions with them and other leading statisticians showed that the legacy of Theory of Probability is alive and lasting.

**An alternative marginal likelihood estimator for phylogenetic models**. (arXiv:1001.2136v1 [stat.CO]) by Serena Arima, Luca Tardella

Bayesian phylogenetic methods are generating noticeable enthusiasm in the field of molecular systematics. Several phylogenetic models are often at stake and different approaches are used to compare them within a Bayesian framework. The Bayes factor, defined as the ratio of the marginal likelihoods of two competing models, plays a key role in Bayesian model selection. However, its computation is still a challenging problem. Several computational solutions have been proposed none of which can be considered outperforming the others simultaneously in terms of simplicity of implementation, computational burden and precision of the estimates. Available Bayesian phylogenetic software has privileged so far the simplicity of the harmonic mean estimator (HM) and the arithmetic mean estimator (AM). However it is known that the resulting estimates of the Bayesian evidence in favor of one model are often biased and inaccurate up to having an infinite variance so that the reliability of the corresponding conclusions is doubtful.

We focus on an alternative generalized harmonic mean (GHM) estimator which, recycling MCMC simulations from the posterior, shares the computational simplicity of the original HM estimator, but, unlike it, overcomes the infinite variance issue. We show that the Inflated Density Ratio (IDR) estimator when applied to some standard phylogenetic benchmark data, produces fully satisfactory results outperforming those simple estimators currently provided by most of the publicly available software.

**Pure Parsimony Xor Haplotyping**. (arXiv:1001.1210v1 [cs.CE]) by Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi, Yuri Pirola, Romeo Rizzi

The haplotype resolution from xor-genotype data has been recently formulated as a new model for genetic studies. The xor-genotype data is a cheaply obtainable type of data distinguishing heterozygous from homozygous sites without identifying the homozygous alleles. In this paper we propose a formulation based on a well-known model used in haplotype inference: pure parsimony. We exhibit exact solutions of the problem by providing polynomial time algorithms for some restricted cases and a fixed-parameter algorithm for the general case. These results are based on some interesting combinatorial properties of a graph representation of the solutions. Furthermore, we show that the problem has a polynomial time k-approximation, where k is the maximum number of xor-genotypes containing a given SNP. Finally, we propose a heuristic and produce an experimental analysis showing that it scales to real-world large instances taken from the HapMap project.

**Visualizing the Structure of Large Trees**. (arXiv:1001.0951v2 [stat.AP]) by Burcu Aydin, Gabor Pataki, Haonan Wang, Alim Ladha, Elizabeth Bullitt, J.S. Marron

This study introduces a new method of visualizing complex tree structured objects. The usefulness of this method is illustrated in the context of detecting unexpected features in a data set of very large trees. The major contribution is a novel two-dimensional graphical representation of each tree, with a covariate coded by color. The motivating data set contains three dimensional representations of brain artery systems of 105 subjects. Due to inaccuracies inherent in the medical imaging techniques, issues with the reconstruction algo- rithms and inconsistencies introduced by manual adjustment, various discrepancies are present in the data. The proposed representation enables quick visual detection of the most common discrepancies. For our driving example, this tool led to the modification of 10% of the artery trees and deletion of 6.7%. The benefits of our cleaning method are demonstrated through a statistical hypothesis test on the effects of aging on vessel structure. The data cleaning resulted in improved significance levels.

**Minimal Conflicting Sets for the Consecutive Ones Property in ancestral genome reconstruction**. (arXiv:0912.4196v1 [q-bio.GN]) by Cedric Chauve, Utz-Uwe Haus, Tamon Stephen, Vivija P. You

A binary matrix has the Consecutive Ones Property (C1P) if its columns can be ordered in such a way that all 1’s on each row are consecutive. A Minimal Conflicting Set is a set of rows that does not have the C1P, but every proper subset has the C1P. Such submatrices have been considered in comparative genomics applications, but very little is known about their combinatorial structure and efficient algorithms to compute them.

We first describe an algorithm that detects rows that belong to Minimal Conflicting Sets. This algorithm has a polynomial time complexity when the number of 1’s in each row of the considered matrix is bounded by a constant. Next, we show that the problem of computing all Minimal Conflicting Sets can be reduced to the joint generation of all minimal true clauses and maximal false clauses for some monotone boolean function. We use these methods on simulated data related to ancestral genome reconstruction to show that computing Minimal Conflicting Set is useful in discriminating between true positive and false positive ancestral syntenies. We also study a dataset of yeast genomes and address the reliability of an ancestral genome proposal of the Saccahromycetaceae yeasts.

**Combining Partial Order Alignment and Progressive Near-Optimal Alignment**. (arXiv:0912.2813v1 [cs.DS]) by Dai Tri Man Le

In this paper, I proposed to utilize partial-order alignment technique as a heuristic method to cope with the state-space explosion problem in progressive near-optimal alignment. The key idea of my approach is a formal treatment of progressive partial order alignment based on the graph product construction.

**Automated languages phylogeny from Levenshtein distance**. (arXiv:0911.3280v1 [cs.CL]) by Maurizio Serva

In order to verify hypotheses concerning relationship between two languages it is necessary to define evaluate their distance from lexical differences. This concept seems to have its roots in the work of the French explorer Dumont D’Urville. He collected comparative words lists of various languages during his voyages aboard the Astrolabe from 1826 to 1829 and, in his work about the geographical division of the Pacific, he proposed a method to measure the degree of relation among languages. The method used by modern glottochronology, developed by Morris Swadesh in the 1950s, measures distances from the percentage of shared cognates, which are words with a common historical origin.

Recently, we proposed a new automated method which has some advantages: the first is that it avoids subjectivity, the second is that results can be replicated by other scholars assuming that the database is the same, the third is that no specific linguistic knowledge is requested, and the last, but surely not the least, is that it allows for rapid comparison of a very large number of languages. We applied our method to the Indo-European and to the Austronesian groups considering, in both cases, fifty different languages and we obtained two genealogical trees using the Unweighted Pair Group Method Average. The trees are similar to those found by previous research with some important differences concerning the position of few languages and subgroups. Indeed, we think that these differences carry some fresh information about the structure of the tree and about the phylogenetic relations inside the families.

**Adaptive BLASTing through the Sequence Dataspace: Theories on Protein Sequence Embedding**. (arXiv:0911.0650v1 [q-bio.QM]) by Yoojin Hong, Jaewoo Kang, Dongwon Lee, Randen L. Patterson, Damian B. van Rossum

We theorize that phylogenetic profiles provide a quantitative method that can relate the structural and functional properties of proteins, as well as their evolutionary relationships. A key feature of phylogenetic profiles is the interoperable data format (e.g. alignment information, physiochemical information, genomic information, etc). Indeed, we have previously demonstrated Position Specific Scoring Matrices (PSSMs) are an informative M-dimension which can be scored from quantitative measure of embedded or unmodified sequence alignments. Moreover, the information obtained from these alignments is informative, even in the twilight zone of sequence similarity (<25% identity)(1-5). Although powerful, our previous embedding strategy suffered from contaminating alignments(embedded AND unmodified) and computational expense.

Herein, we describe the logic and algorithmic process for a heuristic embedding strategy (Adaptive GDDA-BLAST, Ada-BLAST). Ada-BLAST on average up to ~19-fold faster and has similar sensitivity to our previous method. Further, we provide data demonstrating the benefits of embedded alignment measurements for isolating secondary structural elements and the classifying transmembrane-domain structure/function. We theorize that sequence-embedding is one of multiple ways that low-identity alignments can be measured and incorporated into high-performance PSSM-based phylogenetic profiles.

**Progressive Mauve: Multiple alignment of genomes with gene flux and rearrangement**. (arXiv:0910.5780v1 [q-bio.GN]) by Aaron E. Darling, Bob Mau, Nicole T. Perna

]]>Multiple genome alignment remains a challenging problem. Effects of recombination including rearrangement, segmental duplication, gain, and loss can create a mosaic pattern of homology even among closely related organisms. We describe a method to align two or more genomes that have undergone large-scale recombination, particularly genomes that have undergone substantial amounts of gene gain and loss (gene flux). The method utilizes a novel alignment objective score, referred to as a sum-of-pairs breakpoint score. We also apply a probabilistic alignment filtering method to remove erroneous alignments of unrelated sequences, which are commonly observed in other genome alignment methods.

We describe new metrics for quantifying genome alignment accuracy which measure the quality of rearrangement breakpoint predictions and indel predictions. The progressive genome alignment algorithm demonstrates markedly improved accuracy over previous approaches in situations where genomes have undergone realistic amounts of genome rearrangement, gene gain, loss, and duplication. We apply the progressive genome alignment algorithm to a set of 23 completely sequenced genomes from the genera Escherichia, Shigella, and Salmonella. The 23 enterobacteria have an estimated 2.46Mbp of genomic content conserved among all taxa and total unique content of 15.2Mbp. We document substantial population-level variability among these organisms driven by homologous recombination, gene gain, and gene loss. Free, open-source software implementing the described genome alignment approach is available from this http URL (http://gel.ahabs.wisc.edu/mauve/) .

**Reversible jump Markov chain Monte Carlo**. (arXiv:1001.2055v1 [stat.ME]) by Y Fan, S A Sisson

To appear to MCMC handbook, S. P. Brooks, A. Gelman, G. Jones and X.-L. Meng (eds), Chapman & Hall.

**Likelihood-free Markov chain Monte Carlo**. (arXiv:1001.2058v1 [stat.ME]) by S A Sisson, Y Fan

To appear to MCMC handbook, S. P. Brooks, A. Gelman, G. Jones and X.-L. Meng (eds), Chapman & Hall.

**Consistency of the Maximum Likelihood Estimator for general hidden Markov models**. (arXiv:0912.4480v1 [math.ST]) by Randal Douc (CITI), Eric Moulines (LTCI), Jimmy Olsson, Ramon Van Handel

Consider a parametrized family of general hidden Markov models, where both the observed and unobserved components take values in a complete separable metric space. We prove that the maximum likelihood estimator (MLE) of the parameter is strongly consistent under a rather minimal set of assumptions. As special cases of our main result, we obtain consistency in a large class of nonlinear state space models, as well as general results on linear Gaussian state space models and finite state models. A novel aspect of our approach is an information-theoretic technique for proving identifiability, which does not require an explicit representation for the relative entropy rate. Our method of proof could therefore form a foundation for the investigation of MLE consistency in more general dependent and non-Markovian time series. Also of independent interest is a general concentration inequality for $V$-uniformly ergodic Markov chains.

**Inferring Multiple Graphical Models**. (arXiv:0912.4434v1 [stat.ME]) by Julien Chiquet, Yves Grandvalet, Christophe Ambroise

Gaussian Graphical Models provide a convenient framework for representing dependencies between variables. Recently, this tool has received a high interest for the discovery of biological networks. The litterature focuses on the case where a single network is inferred from a set of measurements, but, as wetlab data is typically scarce, several assays, where the experimental conditions affect interactions, are usually merged to infer a single network. In this paper, we propose two approaches for estimating multiple related graphs, by rendering the closeness assumption into an empirical prior or group penalties. We provide quantitative results demonstrating the benefits of the proposed approaches.

**Bayesian model selection in Gaussian regression**. (arXiv:0912.4387v1 [math.ST]) by Felix Abramovich, Vadim Grinshtein

We consider a Bayesian approach to model selection in Gaussian linear regression, where the number of predictors might be much larger than the number of observations. From a frequentist view, the proposed procedure results in the penalized least squares estimation with a complexity penalty associated with a prior on the model size. We investigate the optimality properties of the resulting estimator. We establish the oracle inequality and specify conditions on the prior that imply its asymptotic minimaxity within a wide range of sparse and dense settings for “nearly-orthogonal” and “multicollinear” designs.

**Geometric Representations of Hypergraphs for Prior Specification and Posterior Sampling**. (arXiv:0912.3648v1 [math.ST]) by Simón Lunagómez, Sayan Mukherjee, Robert L. Wolpert

A parametrization of hypergraphs based on the geometry of points in $\rr^\Dim$ is developed. Informative prior distributions on hypergraphs are induced through this parametrization by priors on point configurations via spatial processes. This prior specification is used to infer conditional independence models or Markov structure of multivariate distributions. Specifically, we can recover both the junction tree factorization as well as the hyper Markov law. This approach offers greater control on the distribution of graph features than Erd\”os-R\’enyi random graphs, supports inference of factorizations that cannot be retrieved by a graph alone, and leads to new Metropolis\slash Hastings Markov chain Monte Carlo algorithms with both local and global moves in graph space. We illustrate the utility of this parametrization and prior specification using simulations.

**Bayesian Inference from Composite Likelihoods, with an Application to Spatial Extremes**. (arXiv:0911.5357v1 [stat.ME]) by Mathieu Ribatet, Daniel Cooley, Anthony C. Davison

]]>Composite likelihoods are increasingly used in applications where the full likelihood is analytically unknown or computationally prohibitive. Although the maximum composite likelihood estimator has frequentist properties akin to those of the usual maximum likelihood estimator, Bayesian inference based on composite likelihoods has yet to be explored. In this paper we investigate the use of the Metropolis–Hastings algorithm to compute a pseudo-posterior distribution based on the composite likelihood. Two methodologies for adjusting the algorithm are presented and their performance on approximating the true posterior distribution is investigated using simulated data sets and real data on spatial extremes of rainfall.

Genome Biology 2009, 10:R141 doi:10.1186/gb-2009-10-12-r141

Background

Accurate, high-throughput genotyping allows for the fine characterization of genetic ancestry. Here we applied recently developed statistical and computational techniques to the question of African ancestry in African Americans using data on more than 500,000 single nucleotide polymorphisms (SNPs) genotyped in 94 Africans of diverse geographic origins included in the HGDP, as well as 136 African Americans and 38 European Americans participating in the Atherosclerotic Disease Vascular Function and Genetic Epidemiology (ADVANCE) study. To focus on African ancestry, we reduced the data to include only those genotypes in each African American determined statistically to be African in origin.

Results

From cluster analysis, we found that all the African Americans are admixed in their African components of ancestry, with the majority contributions being from West and West-Central Africa, and only modest variation in these African ancestry proportions among individuals. Furthermore, by principal components analysis, we found little evidence of genetic structure within the African component of ancestry in African Americans.

Conclusions

These results are consistent with historical mating patterns among African Americans that are largely uncorrelated to African ancestral origins, and cast doubt on the general utility of mtDNA or Y chromosome markers alone to delineate the full African ancestry of African Americans. Our results also indicate that the genetic architecture of African Americans is distinct from Africans, and that the greatest source of potential genetic stratification bias in case control studies of African Americans derives from the proportion of European ancestry.

**Genome-wide patterns of population structure and admixture in West Africans and African Americans**, by Katarzyna Bryc, Adam Auton, Matthew R. Nelson, Jorge R. Oksenberg, Stephen L. Hauser, Scott Williams, Alain Froment, Jean-Marie Bodo, Charles Wambebe, Sarah A. Tishkoff, and Carlos D. Bustamante

PNAS January 12, 2010 vol. 107 no.2, 786-791 doi: 10.1073/pnas.0909559107

]]>Quantifying patterns of population structure in Africans and African Americans illuminates the history of human populations and is critical for undertaking medical genomic studies on a global scale. To obtain a fine-scale genome-wide perspective of ancestry, we analyze Affymetrix GeneChip 500K genotype data from African Americans (n = 365) and individuals with ancestry from West Africa (n = 203 from 12 populations) and Europe (n = 400 from 42 countries). We find that population structure within the West African sample reflects primarily language and secondarily geographical distance, echoing the Bantu expansion. Among African Americans, analysis of genomic admixture by a principal component-based approach indicates that the median proportion of European ancestry is 18.5% (25th–75th percentiles: 11.6–27.7%), with very large variation among individuals. In the African-American sample as a whole, few autosomal regions showed exceptionally high or low mean African ancestry, but the X chromosome showed elevated levels of African ancestry, consistent with a sex-biased pattern of gene flow with an excess of European male and African female ancestry. We also find that genomic profiles of individual African Americans afford personalized ancestry reconstructions differentiating ancient vs. recent European and African ancestry. Finally, patterns of genetic similarity among inferred African segments of African-American genomes and genomes of contemporary African populations included in this study suggest African ancestry is most similar to non-Bantu Niger-Kordofanian-speaking populations, consistent with historical documents of the African Diaspora and trans-Atlantic slave trade.

Molecular Biology and Evolution 2010 27(2):221-224; doi:10.1093/molbev/msp259

We present SeaView version 4, a multiplatform program designed to facilitate multiple alignment and phylogenetic tree building from molecular sequence data through the use of a graphical user interface. SeaView version 4 combines all the functions of the widely used programs SeaView (in its previous versions) and Phylo_win, and expands them by adding network access to sequence databases, alignment with arbitrary algorithm, maximum-likelihood tree building with PhyML, and display, printing, and copy-to-clipboard of rooted or unrooted, binary or multifurcating phylogenetic trees. In relation to the wide present offer of tools and algorithms for phylogenetic analyses, SeaView is especially useful for teaching and for occasional users of such software. SeaView is freely available at http://pbil.univ-lyon1.fr/software/seaview.

Rapid Likelihood Analysis on Large Phylogenies Using Partial Sampling of Substitution Histories, by de Koning, A. P. J., Gu, W., Pollock, D. D.

Molecular Biology and Evolution 2010 27(2):249-265; doi:10.1093/molbev/msp228

Likelihood-based approaches can reconstruct evolutionary processes in greater detail and with better precision from larger data sets. The extremely large comparative genomic data sets that are now being generated thus create new opportunities for understanding molecular evolution, but analysis of such large quantities of data poses escalating computational challenges. Recently developed Markov chain Monte Carlo methods that augment substitution histories are a promising approach to alleviate these computational costs. We analyzed the computational costs of several such approaches, considering how they scale with model and data set complexity. This provided a theoretical framework to understand the most important computational bottlenecks, leading us to combine novel variations of our conditional pathway integration approach with recent advances made by others. The resulting technique (“partial sampling” of substitution histories) is considerably faster than all other approaches we considered. It is accurate, simple to implement, and scales exceptionally well with dimensions of model complexity and data set size. In particular, the time complexity of sampling unobserved substitution histories using the new method is much faster than previously existing methods, and model parameter and branch length updates are independent of data set size. We compared the performance of methods on a 224-taxon set of mammalian cytochrome-b sequences. For a simple nucleotide substitution model, partial sampling was at least 10 times faster than the PhyloBayes program, which samples substitutions in continuous time, and about 100 times faster than when using fully integrated substitution histories. Under a general reversible model of amino acid substitution, the partial sampling method was 1,600 times faster than when using fully integrated substitution histories, confirming significantly improved scaling with model state-space complexity. Partial sampling of substitutions thus dramatically improves the utility of likelihood approaches for analyzing complex evolutionary processes on large data sets.

Phylogenetic Distributions and Histories of Proteins Involved in Anaerobic Pyruvate Metabolism in Eukaryotes, by Hug, L. A., Stechmann, A., Roger, A. J.

Molecular Biology and Evolution 2010 27(2):311-324; doi:10.1093/molbev/msp237

Protists that live in low oxygen conditions often oxidize pyruvate to acetate via anaerobic ATP-generating pathways. Key enzymes that commonly occur in these pathways are pyruvate:ferredoxin oxidoreductase (PFO) and [FeFe]-hydrogenase (H2ase) as well as the associated [FeFe]-H2ase maturase proteins HydE, HydF, and HydG. Determining the origins of these proteins in eukaryotes is of key importance to understanding the origins of anaerobic energy metabolism in microbial eukaryotes. We conducted a comprehensive search for genes encoding these proteins in available whole genomes and expressed sequence tag data from diverse eukaryotes. Our analyses of the presence/absence of eukaryotic PFO, [FeFe]-H2ase, and H2ase maturase sequences across eukaryotic diversity reveal orthologs of these proteins encoded in the genomes of a variety of protists previously not known to contain them. Our phylogenetic analyses revealed: 1) extensive lateral gene transfers of both PFO and [FeFe]-H2ase in eubacteria, 2) decreased support for the monophyly of eukaryote PFO domains, and 3) that eukaryotic [FeFe]-H2ases are not monophyletic. Although there are few eukaryote [FeFe]-H2ase maturase orthologs characterized, phylogenies of these proteins do recover eukaryote monophyly, although a consistent eubacterial sister group for eukaryotic homologs could not be determined. An exhaustive search for these five genes in diverse genomes from two representative eubacterial groups, the Clostridiales and the -proteobacteria, shows that although these enzymes are nearly universally present within the former group, they are very rare in the latter. No -proteobacterial genome sequenced to date encodes all five proteins. Molecular phylogenies and the extremely restricted distribution of PFO, [FeFe]-H2ases, and their associated maturases within the -proteobacteria do not support a mitochondrial origin for these enzymes in eukaryotes. However, the unexpected prevalence of PFO, pyruvate:NADP oxidoreductase, [FeFe]-H2ase, and the maturase proteins encoded in genomes of diverse eukaryotes indicates that these enzymes have an important role in the evolution of microbial eukaryote energy metabolism.

Infrequent Transitions between Saline and Fresh Waters in One of the Most Abundant Microbial Lineages (SAR11), by Logares, R., Brate, J., Heinrich, F., Shalchian-Tabrizi, K., Bertilsson, S.

Molecular Biology and Evolution 2010 27(2):347-357; doi:10.1093/molbev/msp239

The aquatic bacterial group SAR11 is one of the most abundant organisms on Earth, with an estimated global population size of 2.4 x 1028 cells in the oceans. Members of SAR11 have also been detected in brackish and fresh waters, but the evolutionary relationships between the species present in the different environments have been ambiguous. In particular, it was not clear how frequently this lineage has crossed the saline–freshwater boundary during its evolutionary diversification. Due to the huge population size of SAR11 and the potential of microbes for long-distance dispersal, we hypothesized that environmental transitions could have occurred repeatedly during the evolutionary diversification of this group. Here, we have constructed extensive 16S rDNA–based molecular phylogenies and undertaken metagenomic data analyses to assess the frequency of saline–freshwater transitions in SAR11 and to investigate the evolutionary implications of this process. Our analyses indicated that very few saline–freshwater transitions occurred during the evolutionary diversification of SAR11, generating genetically distinct saline and freshwater lineages that do not appear to exchange genes extensively via horizontal gene transfer. In contrast to lineages from saline environments, extant freshwater taxa from diverse, and sometimes distant, geographic locations were very closely related. This points to a rapid diversification and dispersal in fresh waters or to slower evolutionary rates in fresh water SAR11 when compared with marine counterparts. In addition, the colonization of both saline and fresh waters appears to have occurred early in the evolution of SAR11. We conclude that the different biogeochemical conditions that prevail in saline and fresh waters have likely prevented the environmental transitions in SAR11, promoting the evolution of clearly distinct lineages in each environment.

A Dirichlet Process Covarion Mixture Model and Its Assessments Using Posterior Predictive Discrepancy Tests, by Zhou, Y., Brinkmann, H., Rodrigue, N., Lartillot, N., Philippe, H.

Molecular Biology and Evolution 2010 27(2):371-384; doi:10.1093/molbev/msp248

Heterotachy, the variation of substitution rate at a site across time, is a prevalent phenomenon in nucleotide and amino acid alignments, which may mislead probabilistic-based phylogenetic inferences. The covarion model is a special case of heterotachy, in which sites change between the “ON” state (allowing substitutions according to any particular model of sequence evolution) and the “OFF” state (prohibiting substitutions). In current implementations, the switch rates between ON and OFF states are homogeneous across sites, a hypothesis that has never been tested. In this study, we developed an infinite mixture model, called the covarion mixture (CM) model, which allows the covarion parameters to vary across sites, controlled by a Dirichlet process prior. Moreover, we combine the CM model with other approaches. We use a second independent Dirichlet process that models the heterogeneities of amino acid equilibrium frequencies across sites, known as the CAT model, and general rate-across-site heterogeneity is modeled by a gamma distribution. The application of the CM model to several large alignments demonstrates that the covarion parameters are significantly heterogeneous across sites. We describe posterior predictive discrepancy tests and use these to demonstrate the importance of these different elements of the models.

Phylodynamics of HIV-1 from a Phase-III AIDS Vaccine Trial in North America, by Perez-Losada, M., Jobes, D. V., Sinangil, F., Crandall, K. A., Posada, D., Berman, P. W.

Molecular Biology and Evolution 2010 27(2):417-425; doi:10.1093/molbev/msp254

]]>In 2003, a phase III placebo-controlled trial (VAX004) of a candidate HIV-1 vaccine (AIDSVAX B/B) was completed in 5,403 volunteers at high risk for HIV-1 infection from North America and the Netherlands. A total of 368 individuals became infected with HIV-1 during the trial. The envelope glycoprotein gene (gp120) from the HIV-1 subtype B viruses infecting 349 patients was sequenced from clinical samples taken as close as possible to the time of diagnosis, rendering a final data set of 1,047 sequences (1,032 from North America and 15 from the Netherlands). Here, we used these data in combination with other sequences available in public databases to assess HIV-1 variation as a function of vaccination treatment, geographic region, race, risk behavior, and viral load. Viral samples did not show any phylogenetic structure for any of these factors, but individuals with different viral loads showed significant differences (P = 0.009) in genetic diversity. The estimated time of emergence of HIV-1 subtype B was 1966–1970. Despite the fact that the number of AIDS cases has decreased in North America since the early 90s, HIV-1 genetic diversity seems to have remained almost constant over time. This study represents one of the largest molecular epidemiologic surveys of viruses responsible for new HIV-1 infections in North America and could help the selection of epidemiologically representative vaccine antigens to include in the next generation of candidate HIV-1 vaccines.

Motivation: It has been proven that the accessibility of the target sites has a critical influence on RNA–RNA binding, in general and the specificity and efficiency of miRNAs and siRNAs, in particular. Recently, O(N6) time and O(N4) space dynamic programming (DP) algorithms have become available that compute the partition function of RNA–RNA interaction complexes, thereby providing detailed insights into their thermodynamic properties.

Results: Modifications to the grammars underlying earlier approaches enables the calculation of interaction probabilities for any given interval on the target RNA. The computation of the ‘hybrid probabilities’ is complemented by a stochastic sampling algorithm that produces a Boltzmann weighted ensemble of RNA–RNA interaction structures. The sampling of k structures requires only negligible additional memory resources and runs in O(k·N3).

Availability: The algorithms described here are implemented in C as part of the rip package. The source code of rip2 can be downloaded from http://www.combinatorics.cn/cbpc/rip.html and http://www.bioinf.uni-leipzig.de/Software/rip.html.

Rapid model quality assessment for protein structure predictions using the comparison of multiple models without structural alignments — : . 26 (2): 182 — Bioinformatics

Motivation: The accurate prediction of the quality of 3D models is a key component of successful protein tertiary structure prediction methods. Currently, clustering- or consensus-based Model Quality Assessment Programs (MQAPs) are the most accurate methods for predicting 3D model quality; however, they are often CPU intensive as they carry out multiple structural alignments in order to compare numerous models. In this study, we describe ModFOLDclustQ—a novel MQAP that compares 3D models of proteins without the need for CPU intensive structural alignments by utilizing the Q measure for model comparisons. The ModFOLDclustQ method is benchmarked against the top established methods in terms of both accuracy and speed. In addition, the ModFOLDclustQ scores are combined with those from our older ModFOLDclust method to form a new method, ModFOLDclust2, that aims to provide increased prediction accuracy with negligible computational overhead.

Results: The ModFOLDclustQ method is competitive with leading clustering-based MQAPs for the prediction of global model quality, yet it is up to 150 times faster than the previous version of the ModFOLDclust method at comparing models of small proteins (<60 residues) and over five times faster at comparing models of large proteins (>800 residues). Furthermore, a significant improvement in accuracy can be gained over the previous clustering-based MQAPs by combining the scores from ModFOLDclustQ and ModFOLDclust to form the new ModFOLDclust2 method, with little impact on the overall time taken for each prediction.

Availability: The ModFOLDclustQ and ModFOLDclust2 methods are available to download from http://www.reading.ac.uk/bioinf/downloads/

AQUA: automated quality improvement for multiple sequence alignments — : 26 (2): 263 — Bioinformatics

]]>Summary: Multiple sequence alignment (MSA) is a central tool in most modern biology studies. However, despite generations of valuable tools, human experts are still able to improve automatically generated MSAs. In an effort to automatically identify the most reliable MSA for a given protein family, we propose a very simple protocol, named AQUA for ‘Automated quality improvement for multiple sequence alignments’. Our current implementation relies on two alignment programs (MUSCLE and MAFFT), one refinement program (RASCAL) and one assessment program (NORMD), but other programs could be incorporated at any of the three steps.

Availability: AQUA is implemented in Tcl/Tk and runs in command line on all platforms. The source code is available under the GNU GPL license. Source code, README and Supplementary data are available at http://www.bork.embl.de/Docu/AQUA.

Massimo Pigliucci (2008) Biology and Philosophy, DOI 10.1007/s10539-008-9124-z

Sewall Wright introduced the metaphor of evolution on “adaptive landscapes” in a pair of papers published in 1931 and 1932. The metaphor has been one of the most influential in modern evolutionary biology, although recent theoretical advancements show that it is deeply flawed and may have actually created research questions that are not, in fact, fecund. In this paper I examine in detail what Wright actually said in the 1932 paper, as well as what he thought of the matter at the very end of his career, in 1988. While the metaphor is flawed, some of the problems which Wright was attempting to address are still with us today, and are in the process of being reformulated as part of a forthcoming Extended Evolutionary Synthesis.

**Does nothing in evolution make sense except in the light of population genetics?**

Lindell Bromham (2009) Biology and Philosophy, DOI 10.1007/s10539-008-9146-6

“The Origins of Genome Architecture” by Michael Lynch (2007) may not immediately sound like a book that someone interested in the philosophy of biology would grab off the shelf. But there are three important reasons why you should read this book. Firstly, if you want to understand biological evolution, you should have at least a passing familiarity with evolutionary change at the level of the genome. This is not to say that everyone interested in evolution should be a geneticist or a bioinformatician, but that a working knowledge of genetic change is an essential part of the intellectual toolkit of modern evolutionary biology, even if your primary focus is the evolution of behaviour or the diversity of communities. Secondly, this book provides excellent examples of another important tool in the biologist’s intellectual toolkit, but one that is rarely explained or illustrated to such an extent: null (or neutral) models. The role null models play in testing hypotheses in evolution is a central focus of this book. Thirdly, as an accomplished work of advocacy for a strictly microevolutionary view of evolution, this book provides grist for the mill for the important debate about whether population genetic processes are the sine qua non of evolutionary explanations.

**When monophyly is not enough: exclusivity as the key to defining a phylogenetic species concept**

Joel D. Velasco (2009) Biology and Philosophy, DOI 10.1007/s10539-009-9151-4

A natural starting place for developing a phylogenetic species concept is to examine monophyletic groups of organisms. Proponents of “the” Phylogenetic Species Concept fall into one of two camps. The first camp denies that species even could be monophyletic and groups organisms using character traits. The second groups organisms using common ancestry and requires that species must be monophyletic. I argue that neither view is entirely correct. While monophyletic groups of organisms exist, they should not be equated with species. Instead, species must meet the more restrictive criterion of being genealogically exclusive groups where the members are more closely related to each other than to anything outside the group. I carefully spell out different versions of what this might mean and arrive at a working definition of exclusivity that forms groups that can function within phylogenetic theory. I conclude by arguing that while a phylogenetic species concept must use exclusivity as a grouping criterion, a variety of ranking criteria are consistent with the requirement that species can be placed on phylogenetic trees.

**Individuals, groups, fitness and utility: multi-level selection meets social choice theory**

Samir Okasha (2009) Biology and Philosophy, DOI 10.1007/s10539-009-9154-1

In models of multi-level selection, the property of Darwinian fitness is attributed to entities at more than one level of the biological hierarchy, e.g. individuals and groups. However, the relation between individual and group fitness is a controversial matter. Theorists disagree about whether group fitness should always, or ever, be defined as total (or average) individual fitness. This paper tries to shed light on the issue by drawing on work in social choice theory, and pursuing an analogy between fitness and utility. Social choice theorists have long been interested in the relation between individual and social utility, and have identified conditions under which social utility equals total (or average) individual utility. These ideas are used to shed light on the biological problem.

**Sober & Wilson’s evolutionary arguments for psychological altruism: a reassessment**

Armin W. Schulz (2009) Biology and Philosophy, DOI 10.1007/s10539-009-9179-5

In their book Unto Others, Sober and Wilson argue that various evolutionary considerations (based on the logic of natural selection) lend support to the truth of psychological altruism. However, recently, Stephen Stich has raised a number of challenges to their reasoning: in particular, he claims that three out of the four evolutionary arguments they give are internally unconvincing, and that the one that is initially plausible fails to take into account recent findings from cognitive science and thus leaves open a number of egoistic responses. These challenges make it necessary to reassess the plausibility of Sober & Wilson’s evolutionary account—which is what I aim to do in this paper. In particular, I try to show that, as a matter of fact, Sober & Wilson’s case remains compelling, as some of Stich’s concerns rest on a confusion, and those that do not are not sufficiently strong to establish all the conclusions he is after. The upshot is that no reason has been given to abandon the view that evolutionary theory has advanced the debate surrounding psychological altruism.

**Sober and Elgin on laws of biology: a critique**

Lane DesAutels (2009) Biology and Philosophy, DOI 10.1007/s10539-009-9182-x

In this short discussion note, I discuss whether any of the generalizations made in biology should be construed as laws. Specifically, I examine a strategy offered by Elliot Sober (1997) and supported by Mehmet Elgin (2006) to reformulate certain biological generalizations so as to eliminate their contingency, thereby allowing them to count as laws. I argue that this strategy entails a conception of laws that is unacceptable on two counts: (1) Sober and Elgin’s approach allows the possibility of formulating laws describing any biological phenomenon whatsoever; and (2) on Sober and Elgin’s view, any interesting contrast between so-called laws and obviously accidental generalizations collapses. I conclude by offering suggestions to refine their view in order to avoid these theoretical problems.

**Simulation of biological evolution under attack, but not really: a response to Meester**

Stefaan Blancke, Maarten Boudry, Johan Braeckman (2010) Biology and Philosophy, DOI 10.1007/s10539-009-9192-8

The leading Intelligent Design theorist William Dembski (Rowman & Littlefield, Lanham MD, 2002) argued that the first No Free Lunch theorem, first formulated by Wolpert and Macready (IEEE Trans Evol Comput 1: 67–82, 1997), renders Darwinian evolution impossible. In response, Dembski’s critics pointed out that the theorem is irrelevant to biological evolution. Meester (Biol Phil 24: 461–472, 2009) agrees with this conclusion, but still thinks that the theorem does apply to simulations of evolutionary processes. According to Meester, the theorem shows that simulations of Darwinian evolution, as these are typically set in advance by the programmer, are teleological and therefore non-Darwinian. Therefore, Meester argues, they are useless in showing how complex adaptations arise in the universe. Meester uses the term “teleological” inconsistently, however, and we argue that, no matter how we interpret the term, a Darwinian algorithm does not become non-Darwinian by simulation. We show that the NFL theorem is entirely irrelevant to this argument, and conclude that it does not pose a threat to the relevance of simulations of biological evolution.

Massimo Pigliucci (2009) Biology and Philosophy, DOI 10.1007/s10539-007-9101-y

The debate about the levels of selection has been one of the most controversial both in evolutionary biology and in philosophy of science. Okasha’s book makes the sort of contribution that simply will not be able to be ignored by anyone interested in this field for many years to come. However, my interest here is in highlighting some examples of how Okasha goes about discussing his material to suggest that his book is part of an increasingly interesting trend that sees scientists and philosophers coming together to build a broadened concept of “theory” through a combination of standard mathematical treatments and conceptual analyses. Given the often contentious history of the relationship between philosophy and science, such trend cannot but be welcome.

**Moving past the levels of selection debates**

Stephen M. Downes (2009) Biology and Philosophy, DOI 10.1007/s10539-008-9130-1

Book review:Samir Okasha, Evolution and the levels of selection, Oxford University Press, Oxford, 2006

**Philosophical foundations for the hierarchy of life**

Deborah E. Shelton, Richard E. Michod (2009) Biology and Philosophy, DOI 10.1007/s10539-009-9160-3

We review Evolution and the Levels of Selection by Samir Okasha. This important book provides a cohesive philosophical framework for understanding levels-of-selections problems in biology. Concerning evolutionary transitions, Okasha proposes that three stages characterize the shift from a lower level of selection to a higher one. We discuss the application of Okasha’s three-stage concept to the evolutionary transition from unicellularity to multicellularity in the volvocine green algae. Okasha’s concepts are a provocative step towards a more general understanding of the major evolutionary transitions; however, the application of certain ideas to the volvocine model system is not straightforward.

**Replies to my critics**

Samir Okasha (2009) Biology and Philosophy, DOI 10.1007/s10539-009-9158-x

]]>This paper contains replies to the reviews of my book by Steven Downes, Massimo Pigliucci and Deborah Shelton & Rick Michod.

Radu V. Craiu, Jeffrey Rosenthal, Chao Yang. Journal of the American Statistical Association. December 1, 2009, 104(488): 1454-1466. doi:10.1198/jasa.2009.tm08393.

Starting with the seminal paper of Haario, Saksman, and Tamminen (Haario, Saksman, and Tamminen 2001), a substantial amount of work has been done to validate adaptive Markov chain Monte Carlo algorithms. In this paper we focus on two practical aspects of adaptive Metropolis samplers. First, we draw attention to the deficient performance of standard adaptation when the target distribution is multimodal. We propose a parallel chain adaptation strategy that incorporates multiple Markov chains which are run in parallel. Second, we note that the current adaptive MCMC paradigm implicitly assumes that the adaptation is uniformly efficient on all regions of the state space. However, in many practical instances, different “optimal” kernels are needed in different regions of the state space. We propose here a regional adaptation algorithm in which we account for possible errors made in defining the adaptation regions. This corresponds to the more realistic case in which one does not know exactly the optimal regions for adaptation. The methods focus on the random walk Metropolis sampling algorithm but their scope is much wider. We provide theoretical justification for the two adaptive approaches using the existent theory build for adaptive Markov chain Monte Carlo. We illustrate the performance of the methods using simulations and analyze a mixture model for real data using an algorithm that combines the two approaches.

Jing Qin, Biao Zhang, Denis H. Y. Leung. Journal of the American Statistical Association. December 1, 2009, 104(488): 1492-1503. doi:10.1198/jasa.2009.tm08163.

Missing data is a ubiquitous problem in medical and social sciences. It is well known that inferences based only on the complete data may not only lose efficiency, but may also lead to biased results if the data is not missing completely at random (MCAR). The inverse-probability weighting method proposed by Horvitz and Thompson (1952) is a popular alternative when the data is not MCAR. The Horvitz–Thompson method, however, is sensitive to the inverse weights and may suffer from loss of efficiency. In this paper, we propose a unified empirical likelihood approach to missing data problems and explore the use of empirical likelihood to effectively combine unbiased estimating equations when the number of estimating equations is greater than the number of unknown parameters. One important feature of this approach is the separation of the complete data unbiased estimating equations from the incomplete data unbiased estimating equations. The proposed method can achieve semiparametric efficiency if the probability of missingness is correctly specified. Simulation results show that the proposed method has better finite sample performance than its competitors. Supplemental materials for this paper, including proofs of the main theoretical results and the R code used for the NHANES example, are available online on the journal website.

Yeonseung Chung, David B. Dunson. Journal of the American Statistical Association. December 1, 2009, 104(488): 1646-1660. doi:10.1198/jasa.2009.tm08302.

This article considers a methodology for flexibly characterizing the relationship between a response and multiple predictors. Goals are (1) to estimate the conditional response distribution addressing the distributional changes across the predictor space, and (2) to identify important predictors for the response distribution change both within local regions and globally. We first introduce the probit stick-breaking process (PSBP) as a prior for an uncountable collection of predictor-dependent random distributions and propose a PSBP mixture (PSBPM) of normal regressions for modeling the conditional distributions. A global variable selection structure is incorporated to discard unimportant predictors, while allowing estimation of posterior inclusion probabilities. Local variable selection is conducted relying on the conditional distribution estimates at different predictor points. An efficient stochastic search sampling algorithm is proposed for posterior computation. The methods are illustrated through simulation and applied to an epidemiologic study.

Paul Gustafson. Journal of the American Statistical Association. December 1, 2009, 104(488): 1682-1695. doi:10.1198/jasa.2009.tm08603.

In health research and other fields, the observational data available to researchers often fall short of the data that ideally would be available, due to the inherent limitations of study design and data acquisition. Were they available, these ideal data might be readily analyzed via straightforward statistical models with such desirable properties as parameter identifiability. Conversely, realistic models for the available data that incorporate uncertainty about the link between ideal and available data may be nonidentified. While there is no conceptual difficulty in implementing Bayesian analysis with nonidentified models and proper prior distributions, it is important to know to what extent data can be informative about parameters of interest. Determining the large-sample limit of the posterior distribution is one way to characterize the informativeness of data. In some nonidentified models, it is relatively straightforward to determine the limit via a particular reparameterization of the model; however, in other nonidentified models there is no such obvious approach. Thus we have developed an algorithm for determining the limiting posterior distribution for at least some such more difficult models. The work is motivated by two specific nonidentified models that arise quite naturally, and the algorithm is applied to reveal how informative the data are for these models. This article has supplementary material online.

]]>by McIntyre, L. M.

Personal Reflections on the Origins and Emergence of Recombinant DNA Technology [Perspectives]

by Berg, P., Mertz, J. E.

The emergence of recombinant DNA technology occurred via the appropriation of known tools and procedures in novel ways that had broad applications for analyzing and modifying gene structure and organization of complex genomes. Although revolutionary in their impact, the tools and procedures per se were not revolutionary. Rather, the novel ways in which they were applied was what transformed biology.

Human Triallelic Sites: Evidence for a New Mutational Mechanism? [Population and evolutionary genetics]

by Hodgkinson, A., Eyre-Walker, A.

Most SNPs in the human genome are biallelic; however, there are some sites that are triallelic. We show here that there are approximately twice as many triallelic sites as we would expect by chance. This excess does not appear to be caused by natural selection or mutational hotspots. Instead we propose that a new mutation can induce another mutation either within the same individual or subsequently during recombination. We provide evidence for this model by showing that the rarer two alleles at triallelic sites tend to cluster on phylogenetic trees of human haplotypes. However, we find no association between the density of triallelic sites and the rate of recombination, which leads us to suggest that triallelic sites might be generated by the simultaneous production of two new mutations within the same individual on the same genetic background. Under this model we estimate that simultaneous mutation contributes ~3% of all distinct SNPs. We also show that there is a twofold excess of adjacent SNPs. Approximately half of these seem to be generated simultaneously since they have identical minor allele frequencies. We estimate that the mutation of adjacent nucleotides accounts for a little less than 1% of all SNPs.

Bayesian Computation and Model Selection Without Likelihoods [Population and evolutionary genetics]

by Leuenberger, C., Wegmann, D.

Until recently, the use of Bayesian inference was limited to a few cases because for many realistic probability models the likelihood function cannot be calculated analytically. The situation changed with the advent of likelihood-free inference algorithms, often subsumed under the term approximate Bayesian computation (ABC). A key innovation was the use of a postsampling regression adjustment, allowing larger tolerance values and as such shifting computation time to realistic orders of magnitude. Here we propose a reformulation of the regression adjustment in terms of a general linear model (GLM). This allows the integration into the sound theoretical framework of Bayesian statistics and the use of its methods, including model selection via Bayes factors. We then apply the proposed methodology to the question of population subdivision among western chimpanzees, Pan troglodytes verus.

A Genetic Analysis of Mortality in Pigs [Genetics of complex traits]

by Varona, L., Sorensen, D.

An analysis of mortality is undertaken in two breeds of pigs: Danish Landrace and Yorkshire. Zero-inflated and standard versions of hierarchical Poisson, binomial, and negative binomial Bayesian models were fitted using Markov chain Monte Carlo (MCMC). The objectives of the study were to investigate whether there is support for genetic variation for mortality and to study the quality of fit and predictive properties of the various models. In both breeds, the model that provided the best fit to the data was the standard binomial hierarchical model. The model that performed best in terms of the ability to predict the distribution of stillbirths was the hierarchical zero-inflated negative binomial model. The best fit of the binomial hierarchical model and of the zero-inflated hierarchical negative binomial model was obtained when genetic variation was included as a parameter. For the hierarchical binomial model, the estimate of the posterior mean of the additive genetic variance (posterior standard deviation in brackets) at the level of the logit of the probability of a stillbirth was 0.173(0.039) in Landrace and 0.202(0.048) in Yorkshire. The implications of these results from a breeding perspective are briefly discussed.

]]>Preface by Hugh Loxdale, Mike Claridge and Jim Mallet:

]]>Although several people had previously seriously considered the possibility of evolution as the driving force in the origin of species, including Jean-Baptiste Lamarck (1744–1829), Robert Chambers (1802–1871), and Erasmus Darwin (1731–1802), it was Erasmus’s grandson, Charles Robert Darwin (1809–1882), along with Alfred Russel Wallace (1823–1913), who independently put forward the theory of evolution by natural selection in 1858. Darwin subsequently wrote his seminal work, On the Origin of Species the following year, whilst Wallace became a devout follower of Darwin, even publishing a book entitled Darwinism in 1889.

(…)

Darwin and Wallace are, of course, famous for other important contributions to biological thought, including sexual selection and zoogeography, respectively. The year 2009 is the bicentenary of the birth of Darwin in Shrewsbury, England, in 1809 and the 150th anniversary of the publication of the Origin in 1859 and the 120th anniversary of the publication of Wallace’s book Darwinism in 1889. The present collection of papers derives from a RES meeting on insect evolution below the species level, held at Rothamsted Research, Harpenden, Hertfordshire, UK, on 22 April 2009. It is thus a celebration of the works of these two great biologists, concentrating on the first (and last) stages of evolution… that is to say, at the ecological level. We as a society are fortunate to have attracted an international array of ecological and evolutionary entomologists as speakers. The papers derived from the meeting are published in this special issue of Ecological Entomology, one of the seven major entomology journals currently published by the RES.

The origin of flowering plant (angiosperm) diversity, which is intimately connected to the diversification of floral form and floral biology, is also of great interest because as the dominant autotrophs of terrestrial environments, angiosperms provide the energy on which most of the rest of biological diversity depends. The evolution of flowers and flowering plants is therefore both of fundamental significance and of contemporary relevance.

(…)

The aim was to examine Darwin’s key contributions to understanding the biology of flowers in light of current knowledge, but also to feature emerging areas of research and the advances now possible with new ideas and approaches.

]]>Biological systems function cooperatively across different spatial and temporal scales, from nanoscale biomolecules to microscale cells, and to macroscale tissues and organs. To understand the mechanisms of various biological functions, it is important to study biological systems at different scales. This Theme Issue of the Royal Society’s Philosophical Transactions A, entitled ‘Multi-scale biothermal and biomechanical behaviours of biological materials’, aims to provide some insight into the biothermal–mechanical–neural behaviour at different scales. In this issue, biological behaviours at different scales are re-cast in engineering systems parlance. It focuses on the frontiers of this fast-growing field with emphasis on the thermal behaviour, mechanical behaviour, the coupled thermomechanical behaviour and corresponding neural response/signalling of biological materials at subcellular, cellular and tissue levels.