Hi, I am working on a bacteria project and want to build a time-calibrated phylogenetic tree via Nextstrain. However, I am missing 90% of the sample collection date. I tried with augur refine and traits but it auto-filter out samples without dates and left with very small number of samples with date info.
Secondly, I am also missing country info. However this is better as this is only 30% of missing data. How do I estimate these missing information (or is it possible) and reconstruct the complete tree? Is sample collection date consider a trait?
Hi Swan. Thanks for getting in touch. Yes, this inference of date and location of a subset of samples is definitely possible. For an example, you can look at https://nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome. Here, these “sra-via-andersen-lab” samples from the SRA have collection date of 2024-XX-XX and lack information for admin division. However they have estimated collection dates and estimated admin divisions you can see here: nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome?c=division&f_data_source=sra-via-andersen-lab. You can mouseover these tips to see confidence in assigned date and assigned admin division as well.
To do this, augur refine was run where these samples had date in metadata of 2024-XX-XX. If you don’t know collection year of your samples I believe you can input these as XXXX-XX-XX and it should work.
Similarly, augur traits was run where these samples had division in metadata of ?.
Hopefully this helps. If you’d like specific input on your workflow file, we can take a look as well.
Thank you for your helpful response. I replaced all the missing data (date and country) with ?. My next question is how do I know if strains with missing data are still kept in the refined tree? After using augur filter, I have 16,538 sequences, which is expected. However, when I perform augur refine, I was left with 13,105. When eyeballing the tree, I saw only those with complete metadata. The resulted tree may still works, but I am curious and want to troubleshoot where did I go wrong in retaining sequences with missing data.
My workflow is as follow (I explicitly align filtered.fasta prior to running iqtree):