Drop of Czech Republic samples & division reconstruction

Hi,

I was running the Nextstrain job on a server and all of a sudden all the samples for the Czech Republic that should have been included were dropped

4770074 strains were dropped during filtering
	1427549 had no sequence data
	39 of these were dropped because they were in defaults/exclude.txt
	1987126 of these were dropped because of 'region=Europe'
	116264 strains were added back because they were in defaults/include.txt
	1438229 of these were dropped because of subsampling criteria
83931 strains passed all filters
Sampling at 4 per group.
4768722 strains were dropped during filtering
	1427549 had no sequence data
	39 of these were dropped because they were in defaults/exclude.txt
	185556 of these were dropped because of 'country=Germany'
	1439291 of these were dropped because of 'region!=Europe'
	116264 strains were added back because they were in defaults/include.txt
	2 were dropped during grouping due to ambiguous month information
	1799064 of these were dropped because of subsampling criteria
85283 strains passed all filters
4765027 strains were dropped during filtering
	4854005 of these were dropped by `--exclude-all`
	116264 strains were added back because they were in results/November_metadata/sample-division.txt
	118897 strains were added back because they were in results/November_metadata/sample-country.txt
	118678 strains were added back because they were in results/November_metadata/sample-region.txt
	117326 strains were added back because they were in results/November_metadata/sample-global.txt
88978 strains passed all filters

As you see there is a difference between 117326 strains were added back because they were in results/November_metadata/sample-global.txt and 88978 strains passed all filters.

I am sure they should be included because I wrote all the sequence names in the defaults/include.txt file. At least all the Poland sequences that I also wrote in this file were in the final tree.

I tried to understand why this is happening and here is what I found out:

in /results/aligned_November-data.fasta.xz the Czech sequences are named, for example, >CzechRepublic/IAB_10/2020, while in sanitized metadata the same sequence has name: Czech_Republic/IAB_10/2020 (they also named so in results/November_metadata/sample-*.txt files.)

In defaults/include.txt I included sequence names in this format: Czech_Republic/IAB_10/2020.

I believe that renaming of all Czech sequences in the defaults/include.txt might help but not sure about it…

I also checked your SARS-CoV-2 analysis for the Europe and there you have the name of the sequence for Czechia in a tree hCoV-19/CzechRepublic/77SVUPHA_8645989933/2021 (without “_” between Czech and Republic).

Here is my build. It might also give you more information on why this happened.

genes: ["ORF1a", "ORF1b", "S", "ORF3a", "M", "N"]

inputs:
  - name: November-data
    sequences: /projects/p_cov2muta/data/sequences.fasta
    metadata: /projects/p_cov2muta/data/metadata_latin.tsv



builds:

  November_metadata:
    subsampling_scheme: custom_division
    region: Europe
    country: Germany
    division: Saxony

traits:
  default:
     sampling_bias_correction: 2.5
     columns: ["country","division"]

files:
  auspice_config: "my_profiles/Saxony/auspice_config.json"
  description: "my_profiles/example/my_description.md"

subsampling:
  custom_division:

    division:
      group_by: "year month"
      max_sequences: 10000
      sampling_scheme: "--probabilistic-sampling"
      exclude: "--exclude-where 'region!={region}' 'country!={country}' 'division!={division}'"
    country:
      group_by: "division year month"
      max_sequences: 4400
      sampling_scheme: "--probabilistic-sampling"
      exclude: "--exclude-where 'country!={country}' 'division={division}'"

    region:
      group_by: "country year month"
      max_sequences: 3000
      sampling_scheme: "--probabilistic-sampling"
      exclude: "--exclude-where 'country={country}' 'region!={region}'"
      priorities:
        type: "proximity"
        focus: "country"

    global:
      group_by: "country year month"
      max_sequences: 1100
      sampling_scheme: "--probabilistic-sampling"
      exclude: "--exclude-where 'region={region}'"
      priorities:
        type: "proximity"
        focus: "country"

Unfortunately I also ran into one more problem. I am interested in reconstructing ancestral nodes in a tree at the division level. During the run I got the following error:

ERROR: 300 or more distinct discrete states found. TreeTime is currently not set up to handle that many states.

I overcame this issue by changing the script: changed 300 to 999999999. I think it was not the best way to solve the problem, but what is done is done.

When my tree was completed, I realized that something was wrong with the reconstruction. You can see this in the attached screenshot.


The probability that this node belongs to Germany is 0.55, but the division is North America with a probability of 0.95, which makes no sense.

Is there a way to reconstruct the division and not only countries?

I also have an idea how to solve this problem, but I’m not sure if this is a good way to do it. I can replace the names of the regions (or countries) (for the sequences I am interested in) in the metadata file with the name of the divisions, and then specify in the builds.yaml that I want to reconstruct the regions. So, I will have only the divisions I am interested in in the reconstruction and maybe it will help me to avoid the problem. I am afraid this will affect the final tree and in the end the reconstruction will not make sense at all.

Maybe I can specify the divisions that I want to reconstruct?

Thank you in advance!

Best wishes,
Dmitrii

Hi Dmitrii,

variation in spelling and resolution of spaces and special characters are a continuous problem and there is not much short term help we can offer.

Regarding your other issue. It is true that with unicode characters the ancestral reconstruction can handle more than 300 states, but the results are unlikely sensible as you have discovered. Generally, this discrete trait reconstruction is very sensitive to sampling and the degree to which the markovian state transition model describes reality. Furthermore, the inferences for division and country happen completely independently and it is thus possible that you observe the discrepancies you did. I would generally not assign a lot of significance to these ancestral reconstructions without carefully inspecting on what data they are based.

best,
richard