ERROR: 300 or more distinct discrete states found

I am currently analyzing SARS-CoV-2 with a focus on Malaysia and encountering the following error:

augur traits is using TreeTime version 0.8.5
ERROR: 300 or more distinct discrete states found. TreeTime is currently not set up to handle that many states.
[Wed Nov  1 03:28:48 2023]
Error in rule traits:
    jobid: 31
    output: results/Myanmar/traits.json
    log: logs/traits_Myanmar.txt (check log file(s) for error message)
    shell:
        
        augur traits             --tree results/Myanmar/tree.nwk             --metadata results/Myanmar/metadata_adjusted.tsv.xz             --output results/Myanmar/traits.json             --columns division             --confidence             --sampling-bias-correction 2.5 2>&1 | tee logs/traits_Myanmar.txt
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Logfile logs/traits_Myanmar.txt:
augur traits is using TreeTime version 0.8.5
ERROR: 300 or more distinct discrete states found. TreeTime is currently not set up to handle that many states.


tree frequencies written to results/Myanmar/tip-frequencies.json
[Wed Nov  1 03:28:54 2023]
Finished job 33.
23 of 37 steps (62%) done
WARNING: supplied genes don't match the annotation
the following features are in the annotation by not supplied as genes: {'nuc'}
the following features are in the supplied as genes but not the annotation: set()

0.00    -TreeAnc: set-up

28.61   -SequenceData: loaded alignment.

28.62   -SeqData: making compressed alignment...

Here is my build:

builds:

  Myanmar:
    subsampling_scheme: switzerland 
    region: Asia
    country: Myanmar

    colors: "my_profiles/Myanmar/colors.tsv"
  
subsampling:
  switzerland:
    country:
      group_by: "division year month"
      max_sequences: 15000
      exclude: "--exclude-where 'country!={country}'"
    region:
      group_by: "country year month"
      seq_per_group: 40
      exclude: "--exclude-where 'country={country}' 'region!={region}'"
      priorities:
        type: "proximity"
        focus: "country"
    global:
      group_by: "country year month"
      seq_per_group: 10
      exclude: "--exclude-where 'region={region}'"
      priorities:
        type: "proximity"
        focus: "country"

  
      
files:
  colors: "my_profiles/Myanmar/colors.tsv"
traits:
  Myanmar:
    columns: ["division"]

I found a similar post but could not resolve it.
Can someone comment on what I am doing wrong here?

1 Like

Hello @ryoya!

What you’re encountering is a limitation of TreeTime. It can’t infer traits when there are more than 300 discrete states - in this case divisions. That’s what the error says:

ERROR: 300 or more distinct discrete states found. 
  TreeTime is currently not set up to handle that many states.

You can try to reduce the number of divisions in your metadata.tsv or maybe just see how many there are by e.g. tsv-summarize -H -g divisions --count <metadata.tsv> in the first place.

Alternatively, you can disable the traits rule altogether, then the build should work out ok.

It looks like you’re running a version of the ncov-simple workflow here. That one is still quite complex/complicated. You might want to start with a simple workflow that you understand from beginning to end. There’s a tutorial here for Zika, but this can be adjusted to work with SARS-CoV-2: Creating a pathogen workflow — Nextstrain documentation

Lastly, it looks like you’re on treetime version 0.8.5, that’s quite old :slight_smile: Treetime is on 0.11.1 now! Nothing wrong with using the old treetime but the new one might be slightly faster, less buggy etc. The error you get would still be the same though.

I hope this helps!

Best,

Cornelius