Samples have missing collection date and country info

Hi, I am working on a bacteria project and want to build a time-calibrated phylogenetic tree via Nextstrain. However, I am missing 90% of the sample collection date. I tried with augur refine and traits but it auto-filter out samples without dates and left with very small number of samples with date info.

Secondly, I am also missing country info. However this is better as this is only 30% of missing data. How do I estimate these missing information (or is it possible) and reconstruct the complete tree? Is sample collection date consider a trait?

Thanks a million!!
Swan

Hi Swan. Thanks for getting in touch. Yes, this inference of date and location of a subset of samples is definitely possible. For an example, you can look at https://nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome. Here, these “sra-via-andersen-lab” samples from the SRA have collection date of 2024-XX-XX and lack information for admin division. However they have estimated collection dates and estimated admin divisions you can see here: nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome?c=division&f_data_source=sra-via-andersen-lab. You can mouseover these tips to see confidence in assigned date and assigned admin division as well.

To do this, augur refine was run where these samples had date in metadata of 2024-XX-XX. If you don’t know collection year of your samples I believe you can input these as XXXX-XX-XX and it should work.

Similarly, augur traits was run where these samples had division in metadata of ?.

Hopefully this helps. If you’d like specific input on your workflow file, we can take a look as well.