This blog is being put to sleep

I realized that I have too many blogs, too little time. Since I didn’t quite get the grip of wordpress (I cannot customize the way I want, I don’t have full control about the final formatting, I cannot add gadgets, etc.) I decided to merge this site with his  homonymous at blogspot. I planned to use the blogspot version just for my software, but I was not updating it at all.

I’ll change the info on Research Blogging to point only to the new address. I copied some blog posts from here to there, but to avoid duplication I didn’t include the researchblogging meta-data.

I hope to see you at my new home!

Posted in Abstracts, Collections, Commentaries, New Publications, Research Blogging, Uncategorized | Leave a comment

Repent that you lack the figure!

This week (yes, it’s taking me the whole week ;) I am reading the article ‘Can We Avoid “SIN” in the House of “No Common Mechanism”?’ by Mike Steel, and realized that there was something missing: the figure 2, on the final version. In order to get it you must download the “Advance Access” version of it, since the figure got lost in the final, printed edition.

This figure doesn’t seem essential for the comprehension of the article (it was not present in the arxiv version of the manuscript, for instance), but you may feel lost as I did when reading the paragraph that depends on it. So here it is for completeness:

Oh, and by the way, the paper talks about the kinds of phylogenetic models that lead to statistical inconsistency (SIN) of the tree inference, and the “no common mechanism” are the models where the number of generating processes (with unique parameters) increase as the sequence length increases.


  1. Syst Biol (2011) 60 (1): 96-109. doi: 10.1093/sysbio/syq069
Posted in Collections | Tagged , , , | Leave a comment

The GPU hype and bioinformatics algorithms

I am a big fan of GPU computing (and other “alternative” hardware), but it was refreshing to read this commentary by Lars Jensen, which I could summarize by this statement:

You do not see anyone publishing papers with titles like “N-fold speed up of algoritm X by using N computers”. 

The first time I heard a similar criticism was by Alexandros Stamatakis, who alerted that we should always compare the performance of a GPU algorithm to the same effort focused on a multi-core CPU environment. Since then I try to be more cautious about grand statements of improvement. And yes, I completely agree with them, even though I don’t give so much importance to speed gains.

My feeling is that thinking about novel GPU algorithms is always worth the effort since a different path for doing a given computation, though currently slow or inefficient,  can lead to faster algorithms in the future. And can also give feedback to vendors as to where they should put their efforts into – provided there’s competition, which in end is the problem with the GPU approach…

(via Jason Stajich)

Posted in Commentaries | Tagged , , , , | Leave a comment

Facts are there to be used, not regurgitated

Nice editorial at BioEssays written by Andrew Moore, about an experimental examination at a biology class (excerpt):

So the scientists, knowing the kind of minds they needed to emerge from the school system, teamed up with the school teachers to understand their challenges and discuss some potential solutions: the teachers consistently reported boredom among pupils when confronted by the traditional learning experience.

(doi: 10.1002/bies.201090050)

Posted in New Publications | Tagged , | Leave a comment

Distribution of recombination distances between trees – poster at SMBE2010

I just came back from SMBE2010, where I presented a poster about our recombination detection software and had the chance to see awesome research other people are doing. The poster can be downloaded here (1.MB in pdf format) and I’m distributing it under the Creative Commons License. Given the great feedback I got from other participants of the meeting, I thought it might be a good opportunity to comment about the work, guided by the poster figures. Please refer to the poster to follow the figures and the explanation, I’ll try to reproduce here my presentation taking into account the commentaries I received.

The motivation for the development of the model was to be able to say, for a given a mosaic structure, if the breakpoints can be explained by one recombination event or several. The recombination mosaic structure is usually inferred assuming the parental sequences (those not recombining) are known beforehand – in the figure they are the subtypes B and F – and recombination is then inferred when there is a change in the parental closest to the query sequence. Another problem is that it is common to analyze each one of the query sequences independently against the parentals – if all one wants is the “coloring”, then this might be enough. For the above figure I analyzed each query sequence against one reference sequence from the subtypes B, F and C (thus comprising a quartet for each analysis). And we know that these mosaics don’t tell the whole story.

If we know the topologies for both segments separated by the recombination breakpoint, then we can say, at least in theory, the minimum number of recombination events necessary to explain the difference (the real number can be much larger since we only detect those that lead to a change in the topology…). This minimum number is the Subtree Prune-and-Regraft distance, and is related to the problem of detection of horizontal gene transfers. In our case we devised an approximation to this distance based on the disagreement between all pairs of bipartitions belonging to the topologies: at each iteration we remove the smallest number of “leaves” such that the topologies will become more similar, and our approximate “uSPR distance” will be how many times we iterate this removal.

It is just an approximation, but it is better (closer to the SPR distance) than the Robinson-Foulds or the (complementary) Maximum Agreement Subtree distances, which compete in speed with our algorithm. For larger topologies it apparently works better than for smaller, but this is an artifact of the simulation – one realized SPR “neutralizes” previous ones, and it happens more often for small trees.

Our Bayesian model works with a partitioning of the alignment, where recombination can only occur between segments and never within. This doesn’t pose a problem in practice since it will “shift” the recombinations to the border – the idea is that several neighboring breakpoints are equivalent to one breakpoint with a larger distance. This segments could be composed of one site each, but for computational reasons we usually set it up at five or ten base pairs. The drawback is the loss of hability to detect rate heterogeneity within the segment.

Each segment will have it own toplology (represented by Tx, Ty and Tz in the figure), which will coincide for many neighboring segments since we have the distance between them as a latent variable penalizing against too many breakpoints:

This is a truncated Poisson distribution, modified so that it can handle underdispersion – the parameter w will make the Poisson sharper around the mean – and each potential breakpoint has its own lambda and w.

The posterior distribution will have K terms for the segments (topology likelihood and evolutionary model priors) and K-1 terms for the potential breakpoints (distances between segments), as well as the hyper-priors. I use the term “potential breakpoint” because if two consecutive segments happen to have the same topology (distance equals zero) then we don’t have an actual breakpoint. Again, considering only the recombinations that change the topology. This posterior distribution is calculated through MCMC sampling in the program called biomc2.

To test the algorithm, we did simulations with eight and twelve taxa datasets, simulating one (for the eight taxa datasets) or two recombinations per breakpoint. We present the output of the program biomc2.summarise, which interprets the posterior samples for one replicate: based on the posterior distribution of distances for each potential breakpoint, we neglect the actual distances and focus on whether it is larger than or equal to zero (second figure of the panel). Based on this multimodal distribution of breakpoints we infer the regions where no recombination was detected (that we call “cold spots”), credible intervals around each mode (red bars on top) or based on all values (red dots at bottom, together with the cold spots).

We also looked at the average distance for each potential breakpoint per replicate dataset, and show that indeed the software can correctly infer the location and amount of recombination for most replicates. It is worth remembering that we were generous in our simulations, in that there is still phylogenetic signal preserved on alignments with many mutations and a few recombinations. If recombination is much more frequent, then any two sites might represent distinct evolutionary histories and our method will fail.

We then analyzed HIV genomic sequences with a very similar mosaic structure, as inferred by cBrother (an implementation of DualBrothers). Here it is important to say that we used cBrother only to estimate the mosaic structure of the recombinants, doing an independent analysis for each sequence against three reference parental sequences. Therefore the figure is not a direct comparison of the programs, contrary to what its unfortunate caption might induce us to think. The distinction is between analyzing all sequences at once or independently, in quartets of sequences. If we superpose the panels it might become clearer to compare them:

click on the figure for a larger version (this one is not on the poster)

The curve in blue shows the positions where there is a change in closest parental for the query sequence, if each query sequence is analyzed neglecting the others. In red we have our algorithm estimating recombinations between all eleven sequences (eight query and three parental sequences). We can see that:

  1. all breakpoints detected by the independent analysis were also detected by our method;
  2. many recombinations were detected only when all sequences were analyzed at once, indicating that they do not involve the parental sequences – de novo recombination;
  3. if we look at the underlying topologies estimated by our method (figure S2 of the PLoS ONE paper), we see that those also detected by the independent analysis in fact involve the parentals while the others don’t;
  4. biomc2 not only infers the location of recombination, but also its “strength” – given by the distance between topologies;

Finally, we show two further developments of the software: a point estimate for the recombination mosaic, and the relevance of the chosen prior over distances. The point estimate came from the need of a more easily interpretable summary of the distribution of breakpoints: instead of looking at the whole multimodal distribution, we may want to pay attention only to the peaks, or some other similar measure. This is a common problem in bioinformatics: to represent a collection of trees by a single one or to find a protein structure that represents best an ensemble of structures. In our case we have a collection of recombination mosaics (one per sample of the posterior distribution), and we elect the one with the smallest distance from all other mosaics – we had to devise a distance for this as well…

To show the importance of the prior distribution of distances, we compared it with simplified versions, like setting the penalty parameter w fixed at a low or high value. The overall behavior for all scenarios is lower resolution around breakpoints, and for weaker penalties we reconstruct the topologies better than for stronger ones, at the cost of inferring spurious breakpoints more often. We also compared the original model with a simplification where the topological distance is neglected and the prior considers only if the topologies are equal or not. This is similar to what cBrother and other programs do, and by looking at the top panel we observe that the results were also equivalent (blue lines labeled “cBrother” and “m=0”). In the same panel we plot the performance using our original (“unrestricted”) model as a gray area.

I also submitted the poster to the F1000 Poster Bank, let’s see how it works…

de Oliveira Martins, L., Leal, É., & Kishino, H. (2008). Phylogenetic Detection of Recombination with a Bayesian Prior on the Distance between Trees PLoS ONE, 3 (7) DOI: 10.1371/journal.pone.0002651
Oliveira Martins, L., & Kishino, H. (2009). Distribution of distances between topologies and its effect on detection of phylogenetic recombination Annals of the Institute of Statistical Mathematics, 62 (1), 145-159 DOI: 10.1007/s10463-009-0259-8

Posted in New Publications, Research Blogging | Tagged , , , , , | 1 Comment

New experimental and theoretical approaches towards the understanding of the emergence of viral infections

Some articles of interest, with quotes from abstracts, from the Theme Issue ‘New experimental and theoretical approaches towards the understanding of the emergence of viral infections’ compiled and edited by Santiago F. Elena and Rémy Froissart – Philosophical Transactions of the Royal Society BJune 27, 2010 365 (1548):

98% identical, 100% wrong: per cent nucleotide identity can lead plant virus epidemiology astray (by Duffy, S., Seah, Y. M.)

(…) Many of these sentinel publications include viral sequence data, but most use that information only to confirm the virus’ species. When researchers use the standard technique of per cent nucleotide identity to determine that the new sequence is closely related to another sequence, potentially erroneous conclusions can be drawn from the results. Multiple introductions of the same pathogen into a country are being ignored because researchers know fast-evolving plant viruses can accumulate substantial sequence divergence over time, even from a single introduction. (…)

The virulence-transmission trade-off in vector-borne plant viruses: a review of (non-)existing studies (by Froissart, R., Doumayrou, J., Vuillaume, F., Alizon, S., Michalakis, Y.)

The adaptive hypothesis invoked to explain why parasites harm their hosts is known as the trade-off hypothesis, which states that increased parasite transmission comes at the cost of shorter infection duration. (…)  We found only very few appropriate studies testing such a correlation, themselves limited by the fact that they use symptoms as a proxy for virulence and are based on very few viral genotypes. Overall, the available evidence does not allow us to confirm or refute the existence of a transmission–virulence trade-off for vector-borne plant viruses. (…)

Pathways to extinction: beyond the error threshold (by Manrubia, S. C., Domingo, E., Lazaro, E.)

(…) Current models of viral evolution take into account more realistic scenarios that consider compensatory and lethal mutations, a highly redundant genotype-to-phenotype map, rough fitness landscapes relating phenotype and fitness, and where phenotype is described as a set of interdependent traits. Further, viral populations cannot be understood without specifying the characteristics of the environment where they evolve and adapt. Altogether, it turns out that the pathways through which viral quasispecies go extinct are multiple and diverse.

Lethal mutagenesis and evolutionary epidemiology (by Martin, G., Gandon, S.)

The lethal mutagenesis hypothesis states that within-host populations of pathogens can be driven to extinction when the load of deleterious mutations is artificially increased with a mutagen, and becomes too high for the population to be maintained. (…) We derive the epidemiological and evolutionary equilibrium of the system. At this equilibrium, the density of the pathogen is expected to decrease linearly with the genomic mutation rate U. We also provide a simple expression for the critical mutation rate leading to extinction. Stochastic simulations show that these predictions are accurate for a broad range of parameter values. As they depend on a small set of measurable epidemiological and evolutionary parameters, we used available information on several viruses to make quantitative and testable predictions on critical mutation rates. In the light of this model, we discuss the feasibility of lethal mutagenesis as an efficient therapeutic strategy.

Mutational fitness effects in RNA and single-stranded DNA viruses: common patterns revealed by site-directed mutagenesis studies (by Sanjuan, R.)

The fitness effects of mutations are central to evolution, yet have begun to be characterized in detail only recently. Site-directed mutagenesis is a powerful tool for achieving this goal, which is particularly suited for viruses because of their small genomes. Here, I discuss the evolutionary relevance of mutational fitness effects and critically review previous site-directed mutagenesis studies. The effects of single-nucleotide substitutions are standardized and compared for five RNA or single-stranded DNA viruses infecting bacteria, plants or animals. (…)

Posted in Abstracts, Collections, New Publications | Tagged , , , | Leave a comment

fault-tolerant conversion between sequence alignments

ResearchBlogging.orgDespite I’m very charitable when testing my own programs, I’m not so nice when asked to scrutinize other people’s work. That’s why I was happy to see the announcement about the ALTER web server being published at Nucleic Acids Research (open access!).

I am not involved in the project, but I was in the very comfortable position of being one of the beta testers: all I needed to do is to find the largest and most obscure datasets I had and try them; then complain to the authors about the minimal details. I tried some big datasets (I think it was influenza H3N2 HA and HIV-1 complete genomes from South America, around 2 and 4 Mbytes each), and my simulated alignments created “by hand” from PAML. And ALTER could handle them in the end: they even sent me a report explaining how each one of my commentaries was used to improve the software, and asking me to try again until I feel satisfied.

The ALTER web server is a converter between multiple sequence alignment (MSA) formats, for DNA or protein, focused not only on the format itself (like FASTA or NEXUS) but more on the softwares that generated the alignment and the software where the alignment is going to be used in (e.g. clustal or MrBayes). They mention that this program-oriented format conversion is necessary since all useful softwares eventually violate the (outdated) format specification. In their own words

[D]uring the last years MSA’s formats have `evolved’ very much like the sequences they contain, with mutational events consisting of long names, extra spaces, additional carriage returns, etc.

The web service can automatically recognize the input format, and generate an output for several programs, in several formats. I found it very easy to use, as you proceed it automatically shows you the possible next steps in the same page. Another very nice feature is the possibility of collapsing duplicate (identical) sequences, working then only with the haplotypes (unique sequences). If later you need the information about the collapsed duplicates check out the “info” panel on the bottom of the screen (inside the “log” window).

The obvious case when this elimination of duplicates is useful is when doing phylogenetic reconstruction (in many cases you can safely remove identical sequences), but another option offered by ALTER is to remove very similar sequences, where you can define the threshold of similarity. Sometimes when I’m doing a preliminary analysis on a dataset, I want to discard sequences too similar in order to get an overall picture of the data, and some other times I must remove closely-related sequences since my recombination-detection program has a limitation on the number of taxa…

Besides the user-friendly web service, they also offer a geek-friendly API – if you want your program to communicate directly with the service – and the source code, licensed under the LGPL.

Glez-Pena, D., Gomez-Blanco, D., Reboiro-Jato, M., Fdez-Riverola, F., & Posada, D. (2010). ALTER: program-oriented conversion of DNA and protein alignments Nucleic Acids Research DOI: 10.1093/nar/gkq321

Posted in Research Blogging | Tagged , , , | 1 Comment

The specialization of novel genes

ResearchBlogging.orgRecently a paper about the software MANTiS called my attention, and I’ve been trying to write about it for a while. This announcement at the EvolDir list seemed like the perfect opportunity. I must warn you though that I’ve never used the software and I don’t have any intimacy with the underlying databases, but the article is easy to follow.

The main result of the paper, published in Genome Biology and Evolution, is that there is a correlation between the mean number of anatomical systems (human tissues or cell types) where the gene is expressed and the time when the gene appeared on the phylogeny of the species. In other words, recent gene families are expressed in fewer anatomical systems (are more specific) than ancient ones. An anatomical system is a hierarchical classification of human tissues (e.g. the first level of the hierarchy: nervous, dermal, embryo, etc) available from gene expression data. So the age of appearance of a gene is an indicator of its specificity. Since the genes are subject to duplication we may have more than one member of the gene family in the same species, and the authors show that this correlation is maintained if we consider the appearance of the gene itself (as a result of duplication) or the appearance of the whole gene family to which the gene belongs.

They worked with gene families identified by MANTiS, which is a pipeline that 1) downloads data from metazoan genomes at ENSEMBL, 2) infers the gene tree based on the protein alignment of the gene family and 3) detects duplications through a reconciliation with a given species tree. Each gene tree is produced by EnsemblCompara which, as I understand, employs an extension of “reciprocal best hits” (that allow for many-to-many relations) to find the members of the family, and then maximum likelihood to find the tree itself. I will talk more about the gene tree/species tree reconciliation in the future, but it is enough to say that it’s the minimal list of nodes on the gene tree that represent duplications. We have an example of such a reconciled gene tree below, where the duplications are represented by the red boxes:

MANTiS creates a new character (the brown polygons, that I think of as an orthologous group) for each duplication event, and the phylogenetic profile generated by these characters is then used to calculate the branch lengths of the species tree through a least squares approach. The phylogenetic profiles are represented by 0’s and 1’s in the inlet figure above, from which a distance matrix must be calculated in order to have the branch lengths.

In the study two datasets were created for the presence/absence of genes: one called “families only” composed of one character for each single gene and for each protein family, and another called “with duplications” where a new character is created for each duplication event. Both analyses were necessary since gene gain through duplications is important in explaining genome size increase.

MANTiS creates a database relating each gene to its biological function and anatomical system: the biological processes and molecular functions (ontology terms) of protein families are given by the PANTHER database for human, mouse, rat and D. melanogaster, while the gene expression data (related to the anatomical systems) comes from eGenetics, GNF and HMDEG. When comparing the time of appearance of the gene (as explained above) and the expression data for the genes we have a figure like the following:

We must notice that in this graph the X axis is inverted (that is, left is older with the present day at the right) giving the impression of a negative correlation. So older gene families – or duplications – are expressed in more cell types in humans.  Similar results were obtained using rat expression data – since the expression datasets had information for both – or using the other expression datasets.

The authors say that a possible explanation for this behaviour is the increase in the number of distinct cell types (blue line, notice the inverted axis again :D), where new genes are likely to be more specific to a cell type (which may have appeared recently itself). Associated with this explanation is the subfunctionalization of duplicated genes, and the tendency to subfunctionalize (“specialize”) can explain the decreased extent of expression. The subfunctionalization process itself might be related to the generation of a new cell phenotype.

One shortcoming of the analysis is that the gene family inference might fail to detect distantly related genes, and therefore what appears to be a gene gain (the “birth” of a new gene family) might be in fact a duplication of a more ancient single gene family. For example if after the duplication number 3 on the first figure the sequences diverged too much, we might wrongly classify them as two gene families. But to be free from this problem is a tall order. The authors also call our attention to the problem of low coverage of some genomes and taxonomic bias.

Milinkovitch, M., Helaers, R., & Tzika, A. (2009). Historical Constraints on Vertebrate Genome Evolution Genome Biology and Evolution, 2010, 13-18 DOI: 10.1093/gbe/evp052
Tzika, A., Helaers, R., Van de Peer, Y., & Milinkovitch, M. (2007). MANTIS: a phylogenetic framework for multi-species genome comparisons Bioinformatics, 24 (2), 151-157 DOI: 10.1093/bioinformatics/btm567

Posted in Research Blogging | Tagged , , , , , , , , | Leave a comment

Using System-on-a-Chip hardware to speed up alignments

In recent years there has been an explosion of parallel algorithms for solving bioinformatics problems, namely phylogenetic reconstruction and sequence alignment. These algorithms follow the growth of new hardware solutions like  Field-Programmable Gate Arrays (integrated circuits capable of  performing simple instructions in parallel), Cell microprocessors (like the one inside Playstation 3), Graphics Processing Units (nvidia and ATI powerful graphic cards) and massively parallel cluster architectures (like the IBM BlueGene). There is now an article describing a parallelized Needleman–Wunsch  alignment algorithm for the the Tile64 RISC processor.

TILE64 Processor Block Diagram

TILE64 Processor Block Diagram (click to enlarge)

The Tile64 card is composed of 64 core processors, with each core running its own Linux OS and standard programs, and communicating using the Tilera API.  The Tile64 is a System on Chip (SyC), that therefore can be plugged into a PCI slot and be used independently from the CPU. On the other hand it can handle only integer number instructions, which limits its usability for numerical computations.

The Needleman–Wunsch algorithm is used for global sequence alignment. That is, for given two sequences it tries to maximize the score by including as few insertions as possible in each one of the sequences. It is closely related to the Smith-Waterman algorithm for local alignment, which tries to find the longest subsequence with positive score – where the score function is almost the same as for Needleman–Wunsch.

Both algorithms are a dynamic programming method where a matrix is built with the scores for all possible pairwise combinations (the solution is found by backtrack after the matrix is complete). After initialization of the matrix (first row and first column) the score of a cell can be calculated by looking at its immediate top and left neighbor cells, represented by the arrows in the figure below. For example the score of cell q4d4 depends only on q4d3, q3d3 and q3d4.

alignment matrix

Alignment matrix for a pair of sequences, adapted from ref. 2 (click to enlarge)

In the article they use an implementation of the FastLSA algorithm, a parallel version of Needleman–Wunsch where instead of storing the whole matrix it stores one row/column combination per block, since depending on the sequence length the memory requirements for the whole matrix can become prohibitive. In other words it stores the score values only for a grid of rows and columns (e.g. at every ten sites). In [1] they claim that this implementation is therefore well suited for very long sequences, which cannot be handled for instance by the “needle” application of the EMBOSS package or the CUDA implementation of the Smith­Waterman algorithm [2].

The parallelism is achieved if we notice that the cells belonging to the same anti-diagonal (one such anti-diagonal represented in gray) can be calculated independently. Thus distinct cores can calculate the score of these cells at the same time with the so-called wavefront parallelism. Their solution achieved gains of 20 times over similar programs – even though their SyC implementation is in C and the other CPU implementations are in Java.

references:[1] Galvez, S., Diaz, D., Hernandez, P., Esteban, F., Caballero, J., & Dorado, G. (2010). Next-generation bioinformatics: using many-core processor architecture to develop a web service for sequence alignment Bioinformatics, 26 (5), 683-686 DOI: 10.1093/bioinformatics/btq017
[2] Manavski, S., & Valle, G. (2008). CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment BMC Bioinformatics, 9 (Suppl 2) DOI: 10.1186/1471-2105-9-S2-S10

Posted in Research Blogging | Tagged , , , , , , , , , | 3 Comments

Harold Jeffreys’ Theory of Probability Revisited (Statistical Science 2009, Vol. 24, No. 2)

There is a paper (with discussion) on Statistical Science 2009, Vol. 24, No. 2 about Harold Jeffreys’ “Theory of Probability” book, which is one the foundations of Bayesian Statistics. The Institute of Mathematical Statistics (responsible for Statistical Science) encourages the deposit of the preprint-formatted articles in arxiv, thus providing us with the advantages from both publishing worlds: broad sharing of knowledge provided by the open repository and the quality of the academic peer-review. Here are the articles, with brief quotations from the abstracts or first paragraphs:

Harold Jeffreys’s Theory of Probability Revisited (Christian P. Robert, Nicolas Chopin, Judith Rousseau) DOI: 10.1214/09-STS284 (arXiv:0804.3173v7)

Published exactly seventy years ago, Jeffreys’s Theory of Probability (1939) has had a unique impact on the Bayesian community and is now considered to be one of the main classics in Bayesian Statistics as well as the initiator of the objective Bayes school. In particular, its advances on the derivation of noninformative priors as well as on the scaling of Bayes factors have had a lasting impact on the field. However, the book reflects the characteristics of the time, especially in terms of mathematical rigor. In this paper we point out the fundamental aspects of this reference work, especially the thorough coverage of testing problems and the construction of both estimation and testing noninformative priors based on functional divergences. Our major aim here is to help modern readers in navigating in this difficult text and in concentrating on passages that are still relevant today.

Comment on “Harold Jeffreys’s Theory of Probability Revisited” (José M. Bernardo) DOI: 10.1214/09-STS284E  (arXiv:1001.2967v1)

The authors provide an authoritative lecture guide of Theory of Probability, where they clearly state that the more useful material today is that contained in Chapters 3 and 5, which respectively deal with estimation, and hypothesis testing. We argue that, from a contemporary viewpoint, the impact of Jeffreys proposals on those two problems is rather different, and we describe what we perceive to be the state of the question nowadays, suggesting that Jeffreys’s dramatically different treatment is not necessary, and that a joint objective approach to those two problems is indeed possible.

Bayes, Jeffreys, Prior Distributions and the Philosophy of Statistics (Andrew Gelman) DOI: 10.1214/09-STS284D  (arXiv:1001.2968v1)

 (…) In this brief discussion I will argue the following: (1) in thinking about prior distributions, we should go beyond Jeffreys’s principles and move toward weakly informative priors; (2) it is natural for those of  us who work in social and computational sciences to favor complex models, contra Jeffreys’s preference for simplicity; and (3) a key generalization of Jeffreys’s ideas is to explicitly include model checking in  the process of data analysis.

Comment: The Importance of Jeffreys’s Legacy (Robert Kass) DOI: 10.1214/09-STS284A  (arXiv:1001.2970v1)

Theory of Probability is distinguished by several high-level philosophical attitudes, some stressed by Jeffreys, some implicit. By reviewing these we may recognize the importance in this work in the historical development of statistics.

Comment on “Harold Jeffreys’s Theory of Probability Revisited” (Dennis Lindley) DOI: 10.1214/09-STS284F  (arXiv:1001.3073v1)

I was taught by Harold Jeffreys, having attended his postgraduate lectures at Cambridge in the academic year 1946­1947, and also knew him when I joined the Faculty there. I thought I appreciated the  Theory of Probability rather well, so was astonished to read this splendid paper, which so successfully sheds new light on the book by placing it in the context of recent developments.

Comment on “Harold Jeffreys’s Theory of Probability Revisited” (Stephen Senn) DOI: 10.1214/09-STS284B  (arXiv:1001.2975v1)

I have always felt very guilty about Harold Jeffreys’s Theory of Probability (referred to as ToP, hereafter). I take seriously George Barnard’s injunction (Barnard, 1996) to have some familiarity with the  four great systems of inference. I also consider it a duty and generally find it a pleasure to read the classics, but I find Jeffreys much harder going than Fisher, Neyman and Pearson fils or De Finetti. So I was intrigued to learn that Christian Robert and colleagues had produced an extensive chapter by chapter commentary on Jeffreys, honored to be invited to comment but apprehensive at the task.

Comment on “Harold Jeffreys’s Theory of Probability Revisited” (Arnold Zellner) DOI: 10.1214/09-STS284C  (arXiv:1001.2985v1)

The authors are to be congratulated for their deep appreciation of Jeffreys’s famous book, Theory of Probability, and their very impressive, knowledgeable consideration of its contents, chapter by chapter. Many will benefit from their analyses of topics in Jeffreys’s book. As they state in their abstract, “Our major aim here is to help modern readers in navigating this difficult text and in concentrating on  passages that are still relevant today.” From what follows, it might have been more accurate to use the phrase, “modern well-informed Bayesian statisticians” rather than “modern readers” since the  authors’ discussions assume a rather advanced knowledge of modern Bayesian statistics.

Rejoinder: Harold Jeffreys’s Theory of Probability Revisited (Christian P. Robert, Nicolas Chopin, Judith Rousseau) DOI: 10.1214/09-STS284REJ  (arXiv:0909.1008v2)

We are grateful to all discussants of our re-visitation for their strong support in our enterprise and for their overall agreement with our perspective. Further discussions with them and other leading statisticians showed that the legacy of Theory of Probability is alive and lasting.

Posted in Abstracts, Collections, New Publications | Tagged , , , , | Leave a comment