Samples have missing collection date and country info

Hi, I am working on a bacteria project and want to build a time-calibrated phylogenetic tree via Nextstrain. However, I am missing 90% of the sample collection date. I tried with augur refine and traits but it auto-filter out samples without dates and left with very small number of samples with date info.

Secondly, I am also missing country info. However this is better as this is only 30% of missing data. How do I estimate these missing information (or is it possible) and reconstruct the complete tree? Is sample collection date consider a trait?

Thanks a million!!
Swan

Hi Swan. Thanks for getting in touch. Yes, this inference of date and location of a subset of samples is definitely possible. For an example, you can look at https://nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome. Here, these “sra-via-andersen-lab” samples from the SRA have collection date of 2024-XX-XX and lack information for admin division. However they have estimated collection dates and estimated admin divisions you can see here: nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome?c=division&f_data_source=sra-via-andersen-lab. You can mouseover these tips to see confidence in assigned date and assigned admin division as well.

To do this, augur refine was run where these samples had date in metadata of 2024-XX-XX. If you don’t know collection year of your samples I believe you can input these as XXXX-XX-XX and it should work.

Similarly, augur traits was run where these samples had division in metadata of ?.

Hopefully this helps. If you’d like specific input on your workflow file, we can take a look as well.

Hi Dr. Bedford,

Thank you for your helpful response. I replaced all the missing data (date and country) with ?. My next question is how do I know if strains with missing data are still kept in the refined tree? After using augur filter, I have 16,538 sequences, which is expected. However, when I perform augur refine, I was left with 13,105. When eyeballing the tree, I saw only those with complete metadata. The resulted tree may still works, but I am curious and want to troubleshoot where did I go wrong in retaining sequences with missing data.

My workflow is as follow (I explicitly align filtered.fasta prior to running iqtree):

augur index --sequences sequence.fasta --output sequence_index.tsv
augur filter --sequences sequence.fasta --sequence-index sequence_index.tsv --metadata metadata.tsv --output filtered.fasta --exclude-all --include included_strains.txt 
augur tree --method iqtree --alignment filtered.fasta --output tree_raw.nwk --nthreads auto
augur refine --tree tree_raw.nwk --alignment filtered.fasta --metadata metadata.tsv --output-tree tree.nwk --output-node-data branch_lengths.json --timetree --root best --divergence-units mutations --clock-filter-iqd 4 --stochastic-resolve
augur ancestral --tree tree.nwk --alignment filtered.fasta --output-node-data nt_muts.json --inference joint --output-sequence ancestral.fasta --keep-ambiguous --keep-overhangs
augur translate --tree tree.nwk --ancestral-sequences nt_muts.json --reference-sequence ref.gb --output-node-data aa_muts.json --alignment-output aligned_aa_%GENE.fasta
augur traits --tree tree.nwk --metadata metadata.tsv --columns country --output-node-data traits.json 
augur export v2 --tree tree.nwk --metadata metadata.tsv --node-data branch_lengths.json nt_muts.json aa_muts.json traits.json --lat-longs lat_longs.tsv --auspice-config auspice_config.json --output auspice/result.json

Thanks again!!
Swan