Hi,
I was running the Nextstrain job on a server and all of a sudden all the samples for the Czech Republic that should have been included were dropped
4770074 strains were dropped during filtering
1427549 had no sequence data
39 of these were dropped because they were in defaults/exclude.txt
1987126 of these were dropped because of 'region=Europe'
116264 strains were added back because they were in defaults/include.txt
1438229 of these were dropped because of subsampling criteria
83931 strains passed all filters
Sampling at 4 per group.
4768722 strains were dropped during filtering
1427549 had no sequence data
39 of these were dropped because they were in defaults/exclude.txt
185556 of these were dropped because of 'country=Germany'
1439291 of these were dropped because of 'region!=Europe'
116264 strains were added back because they were in defaults/include.txt
2 were dropped during grouping due to ambiguous month information
1799064 of these were dropped because of subsampling criteria
85283 strains passed all filters
4765027 strains were dropped during filtering
4854005 of these were dropped by `--exclude-all`
116264 strains were added back because they were in results/November_metadata/sample-division.txt
118897 strains were added back because they were in results/November_metadata/sample-country.txt
118678 strains were added back because they were in results/November_metadata/sample-region.txt
117326 strains were added back because they were in results/November_metadata/sample-global.txt
88978 strains passed all filters
As you see there is a difference between 117326 strains were added back because they were in results/November_metadata/sample-global.txt
and 88978 strains passed all filters
.
I am sure they should be included because I wrote all the sequence names in the defaults/include.txt
file. At least all the Poland sequences that I also wrote in this file were in the final tree.
I tried to understand why this is happening and here is what I found out:
in /results/aligned_November-data.fasta.xz
the Czech sequences are named, for example, >CzechRepublic/IAB_10/2020
, while in sanitized metadata the same sequence has name: Czech_Republic/IAB_10/2020
(they also named so in results/November_metadata/sample-*.txt
files.)
In defaults/include.txt
I included sequence names in this format: Czech_Republic/IAB_10/2020
.
I believe that renaming of all Czech sequences in the defaults/include.txt
might help but not sure about it…
I also checked your SARS-CoV-2 analysis for the Europe and there you have the name of the sequence for Czechia in a tree hCoV-19/CzechRepublic/77SVUPHA_8645989933/2021
(without “_” between Czech and Republic).
Here is my build. It might also give you more information on why this happened.
genes: ["ORF1a", "ORF1b", "S", "ORF3a", "M", "N"]
inputs:
- name: November-data
sequences: /projects/p_cov2muta/data/sequences.fasta
metadata: /projects/p_cov2muta/data/metadata_latin.tsv
builds:
November_metadata:
subsampling_scheme: custom_division
region: Europe
country: Germany
division: Saxony
traits:
default:
sampling_bias_correction: 2.5
columns: ["country","division"]
files:
auspice_config: "my_profiles/Saxony/auspice_config.json"
description: "my_profiles/example/my_description.md"
subsampling:
custom_division:
division:
group_by: "year month"
max_sequences: 10000
sampling_scheme: "--probabilistic-sampling"
exclude: "--exclude-where 'region!={region}' 'country!={country}' 'division!={division}'"
country:
group_by: "division year month"
max_sequences: 4400
sampling_scheme: "--probabilistic-sampling"
exclude: "--exclude-where 'country!={country}' 'division={division}'"
region:
group_by: "country year month"
max_sequences: 3000
sampling_scheme: "--probabilistic-sampling"
exclude: "--exclude-where 'country={country}' 'region!={region}'"
priorities:
type: "proximity"
focus: "country"
global:
group_by: "country year month"
max_sequences: 1100
sampling_scheme: "--probabilistic-sampling"
exclude: "--exclude-where 'region={region}'"
priorities:
type: "proximity"
focus: "country"
Unfortunately I also ran into one more problem. I am interested in reconstructing ancestral nodes in a tree at the division level. During the run I got the following error:
ERROR: 300 or more distinct discrete states found. TreeTime is currently not set up to handle that many states.
I overcame this issue by changing the script: changed 300 to 999999999. I think it was not the best way to solve the problem, but what is done is done.
When my tree was completed, I realized that something was wrong with the reconstruction. You can see this in the attached screenshot.
The probability that this node belongs to Germany is 0.55, but the division is North America with a probability of 0.95, which makes no sense.
Is there a way to reconstruct the division and not only countries?
I also have an idea how to solve this problem, but I’m not sure if this is a good way to do it. I can replace the names of the regions (or countries) (for the sequences I am interested in) in the metadata file with the name of the divisions, and then specify in the builds.yaml
that I want to reconstruct the regions. So, I will have only the divisions I am interested in in the reconstruction and maybe it will help me to avoid the problem. I am afraid this will affect the final tree and in the end the reconstruction will not make sense at all.
Maybe I can specify the divisions that I want to reconstruct?
Thank you in advance!
Best wishes,
Dmitrii