Hi all,
I’m creating weekly SARS-CoV-2 builds consiting of a combination of Gisaid data and internal data. This usually works fine, but for BA.5 only very few sequences end up in the final build. Much fewer than what is entered into the analyses, and also fewer than what seems to pass any filters along the build process.
For example, today I included more than 200 BA.5 sequences from Norway (I focus on Norway) and every BA.5 from Gisaid, but only 17 Norwegian sequences ended up in the build. In the build file I subsample with a query on the Pango lineage, but perhaps there is something happening during the tree refinement? How can I best troubleshoot to figure out why sequences are dropped?
Also, it’d be good if you could share the exact command you run for augur filter. It’s hard to help without these details.
It could well be that your sequences are wrongly classified by pangolin. Try running your sequences through Nextclade and use the Nextclade_clade or Nextstrain_pango field for lineage filtering. This is more robust.
You’re right that some of the sequences could be wrongly classified, at least the ones from Gisaid. But our internal sequences are already classified using NextClade, but still they are lost somehow.
I use the ncov workflow cloned from here: git clone https://github.com/nextstrain/ncov.git and pull all the recent changes before running. But actually, when I inspect the logs more carefully I see that the missing sequences are dropped because they are in the file excluded_by_diagnostics.txt. I also insert the augur filter commands here:
augur filter --metadata results/sanitized_metadata_bn.tsv.xz --sequences results/aligned_bn.fasta.xz --sequence-index results/combined_sequence_index.tsv.xz --exclude-all --include results/omicron_ba_five/sample-country.txt --output-sequences results/omicron_ba_five/sample-country.fasta 2>&1 | tee logs/extract_subsampled_sequences_omicron_ba_five_country.txt
34929 strains were dropped during filtering
35084 of these were dropped by `--exclude-all`
204 strains were added back because they were in results/omicron_ba_five/sample-country.txt
155 strains passed all filters
augur filter --metadata results/sanitized_metadata_bn.tsv.xz --include defaults/include.txt --exclude defaults/exclude.txt --query '(country != '"'"'Norway'"'"') & (pango_lineage.str.startswith('"'"'BA.5'"'"'))' --priority results/omicron_ba_five/priorities_country.tsv --group-by country year month --subsample-max-sequences 2000 --probabilistic-sampling --output-strains results/omicron_ba_five/sample-related.txt 2>&1 | tee logs/subsample_omicron_ba_five_related.txt
Sampling at 22 per group.
33201 strains were dropped during filtering
204 of these were filtered out by the query: "(country != 'Norway') & (pango_lineage.str.startswith('BA.5'))"
43 were dropped during grouping due to ambiguous month information
32954 of these were dropped because of subsampling criteria
1883 strains passed all filters
augur filter --sequences results/aligned_bn.fasta.xz --metadata results/sanitized_metadata_bn.tsv.xz --exclude-all --include results/omicron_ba_five/sample-country.txt results/omicron_ba_five/sample-related.txt --output-sequences results/omicron_ba_five/omicron_ba_five_subsampled_sequences.fasta.xz --output-metadata results/omicron_ba_five/omicron_ba_five_subsampled_metadata.tsv.xz 2>&1 | tee logs/subsample_regions_omicron_ba_five.txt
33231 strains were dropped during filtering
35084 of these were dropped by `--exclude-all`
204 strains were added back because they were in results/omicron_ba_five/sample-country.txt
1883 strains were added back because they were in results/omicron_ba_five/sample-related.txt
1853 strains passed all filters
Note: You did not provide a sequence index, so Augur will generate one. You can generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`.
1822 strains were dropped during filtering
1821 of these were dropped because they were in results/omicron_ba_five/excluded_by_diagnostics.txt
1 of these were filtered out by the query: "(_length >= 28500)"
31 strains passed all filters
I see that the file excluded_by_diagnostics.txt is generated during the pipeline, but I’m not sure why these sequences are added to it?
Could it have something to do with the snp_clusters parameter to diagnostic.py? That the BA.5’s are too divergent? Although I’m not entirely sure how to interpret this parameter…