Relating the Phylogenetic Tree result to a Transmission Tree of Infectious Disease Outbreaks

Hi all,

First, thanks for this amazing tool and the accompanied documentations.

In my research, my aim is to infer the Avian Influenza transmission events for the period 2019-2021 based on an outbreak dataset. In order to validate my proposed model, I would like to obtain a kind of ground-truth regarding Avian Influenza transmission events. I thought that the phylogenetic analysis could be a solution for my validation issue at coarser level. This is how I came across the site and the article “Nextstrain: real-time tracking of pathogen

I do not have any bioinformatics background. There are some interesting works on Avian Influenza transmissions based on phylogenetic analysis, (e.g. link, see also Fig 3 in this paper), but there is no code provided.
So, for now my aim is just to rely on the 4 phylogenetic analysis for Avian Influenza that you have on NextStrain: H5N1, H5NX, H7N9, H9N2 (e.g. auspice). Based on them, I could be able to download the corresponding phylogenetic tree results in nexus format. Then, I can extract a transmission graph between locations by combining these 4 phylogenetic trees for a specific period (what you have called “transmission lines”).

I would like to ask a couple of questions:

  1. In auspice, we see this information: “Showing 2103 of 2103 genomes sampled between Dec 1996 and Mar 2022.” How I can obtain the sources for the genome data used in these 4 analysis ? For instance, are you using some GitHub repositories of genomic data ?
  2. Unless I am mistaken, when I downloaded the phylogenetic trees in Nexus format, I could not see any probability related values on edges between nodes, which would indicate Bayes Factor results. Is there a way to get these values ?
  3. Would you have any recommendations about my objective of obtaining a transmission event dataset ? If you can point me to other works or tools (in R or python) on Avian Influenza, I would be grateful.


Hi @arinik9! Thanks for your interest in the avian-flu builds.

  1. The sequence data in the trees come originally from GISAID and/or the Influenza Research Database. In order to access data from GISAID, you should make an account and sign a user agreement on their website. All data from the IRD is pulled from Genbank and is publicly available for use. Due to GISAID policy, we are not able to repost or republish any sequence data. The builds are also rebuilt each month, and for each build, we randomly sample down the genomes as detailed in the Snakefile in the github repo. For example, for H5N1, we sample 10 genomes per country per year each month when the builds are redone, meaning that the exact strains used in each build change from month to month.

  2. That’s correct, the downloadable trees do not contain information about transmission inference.

  3. To do a proper transmission analysis, you would need to take care to generate a dataset with consideration for sampling depending on the transmission model you are using. There is no simple or straightforward answer that I could point you to here, and the circumstances and details of the ongoing avian influenza outbreak are not fully understood or characterized. To my knowledge, there is no existing ground truth dataset for this ongoing outbreak. However, you may want to read this paper, which did a really nice and thorough analysis of a previous H5N8 outbreak in North America.

1 Like