Repeated runs of the same data will give different results

Repeated runs of the same data will give different results.
I have run the same data three times, but the results were different, including root age, root country and evolutionary rate. Could you tell me how can I get a stable and credible result?

Hi @GenLi,

I assume you are referring to the output of augur refine. This documentation page may help, notably the --seed parameter which should help with reproducibility of results.

If you are using a pre-defined pathogen workflow (e.g. SARS-CoV-2 workflow), note that not all Augur parameters are available as Snakemake parameters, and the workflow file may have to be modified.

Best,
Victor

Hi - I used “–seed 20” in augur refine, and did two runs on the same data, but got different results.

Are there other places I need to use seed (it’s basically an edit of the zika-tutorial Snakefile)

1 Like

Hi @mboursnell, thanks for reporting this issue! Could you please add a bit more details so we can try to reproduce?

In particular:

  • How are you invoking the build? Are you using the nextstrain cli, if so which runtime are you using, or are you using your own conda/mamba environment?
  • What’s full augur refine command that your run?
  • What’s the output of:
augur version
treetime version
  • When you reran things, did you just rerun augur refine or also other parts of the workflow? You might need to set the seed in other augur commands as well to make the whole workflow reproducible.

There was a PR that changed how seeds are handled in refine, but this shouldn’t affect end users: Update TreeTime by rneher · Pull Request #1207 · nextstrain/augur · GitHub

Best wishes,

Cornelius

Hi. I was running a Snakefile, in the terminal on a Mac.
I re-ran the whole Snakefile (deleted all the results files first)

Versions are:

augur 22.0.3
treetime 0.10.1

I can’t see how to attach the Snakefile, so I’ll paste it here. Hope that’s OK.

rule all:
    input:
        auspice_json = "auspice/56_files_Percy_DNA_20h_root_oldest_seed.json",

input_fasta = "data/55_files_Percy_DNA.fasta",
input_metadata = "data/song_trees_metadata.tsv",
dropped_strains = "config/dropped_strains.txt",
reference = "config/percy_outgroup.gb",
colors = "config/colors.tsv",
lat_longs = "config/lat_longs.tsv",
auspice_config = "config/auspice_config.json"

rule index_sequences:
    message:
        """
        Creating an index of sequence composition for filtering (Song Trees).
        """
    input:
        sequences = input_fasta
    output:
        sequence_index = "results/55_files_Percy_index.tsv"
    shell:
        """
        augur index \
            --sequences {input.sequences} \
            --output {output.sequence_index}
        """

rule filter:
    message:
        """
        Filtering to
          - {params.sequences_per_group} sequence(s) per {params.group_by!s}
          - from {params.min_date} onwards
          - excluding strains in {input.exclude}
        """
    input:
        sequences = input_fasta,
        sequence_index = "results/55_files_Percy_index.tsv",
        metadata = input_metadata,
        exclude = dropped_strains
    output:
        sequences = "results/55_files_Percy_filtered.fasta"
    params:
        group_by = "country division location year",
        sequences_per_group = 100,
        min_date = 1700
    shell:
        """
        augur filter \
            --sequences {input.sequences} \
            --sequence-index {input.sequence_index} \
            --metadata {input.metadata} \
            --exclude {input.exclude} \
            --output {output.sequences} \
            --group-by {params.group_by} \
            --sequences-per-group {params.sequences_per_group} \
            --min-date {params.min_date} \
            --subsample-seed 20
        """

rule align:
    message:
        """
        Aligning sequences to {input.reference}
          - filling gaps with N
        """
    input:
        sequences = rules.filter.output.sequences,
        reference = reference
    output:
        alignment = "results/55_files_Percy_aligned.fasta"
    shell:
        """
        augur align \
            --sequences {input.sequences} \
            --reference-sequence {input.reference} \
            --output {output.alignment} \
            --fill-gaps
        """

rule tree:
    message: "Building raw tree (Song Trees)"
    input:
        alignment = rules.align.output.alignment
    output:
        tree = "results/55_files_Percy_tree_raw.nwk"
    shell:
        """
        augur tree \
            --alignment {input.alignment} \
            --output {output.tree}
        """

rule refine:
    message:
        """
        Refining tree
          - estimate timetree
          - use {params.coalescent} coalescent timescale
          - estimate {params.date_inference} node dates
          - filter tips more than {params.clock_filter_iqd} IQDs from clock expectation
        """
    input:
        tree = rules.tree.output.tree,
        alignment = rules.align.output,
        metadata = input_metadata
    output:
        tree = "results/55_files_Percy_tree.nwk",
        node_data = "results/55_files_Percy_tree_branch_lengths.json"
    params:
        coalescent = "opt",
        date_inference = "marginal",
        clock_filter_iqd = 4
    shell:
        """
        augur refine \
            --tree {input.tree} \
            --alignment {input.alignment} \
            --metadata {input.metadata} \
            --output-tree {output.tree} \
            --output-node-data {output.node_data} \
            --timetree \
            --coalescent {params.coalescent} \
            --date-confidence \
            --date-inference {params.date_inference} \
            --clock-filter-iqd {params.clock_filter_iqd} \
            --root oldest \
            --seed 20
        """

rule ancestral:
    message: "Reconstructing ancestral sequences and mutations"
    input:
        tree = rules.refine.output.tree,
        alignment = rules.align.output
    output:
        node_data = "results/55_files_Percy_tree_nt_muts.json"
    params:
        inference = "joint"
    shell:
        """
        augur ancestral \
            --tree {input.tree} \
            --alignment {input.alignment} \
            --output-node-data {output.node_data} \
            --inference {params.inference}
        """

rule translate:
    message: "Translating amino acid sequences (Song Trees)"
    input:
        tree = rules.refine.output.tree,
        node_data = rules.ancestral.output.node_data,
        reference = reference
    output:
        node_data = "results/55_files_Percy_tree_aa_muts.json"
    shell:
        """
        augur translate \
            --tree {input.tree} \
            --ancestral-sequences {input.node_data} \
            --reference-sequence {input.reference} \
            --output-node-data {output.node_data} \
        """

rule traits:
    message: "Inferring ancestral traits for {params.columns!s}"
    input:
        tree = rules.refine.output.tree,
        metadata = input_metadata
    output:
        node_data = "results/55_files_Percy_tree_traits.json",
    params:
        columns = "date country division location"
    shell:
        """
        augur traits \
            --tree {input.tree} \
            --metadata {input.metadata} \
            --output-node-data {output.node_data} \
            --columns {params.columns} \
            --confidence
        """

rule export:
    message: "Exporting data files for for auspice (Song Trees) "
    input:
        tree = rules.refine.output.tree,
        metadata = input_metadata,
        branch_lengths = rules.refine.output.node_data,
        traits = rules.traits.output.node_data,
        nt_muts = rules.ancestral.output.node_data,
        aa_muts = rules.translate.output.node_data,
        colors = colors,
        lat_longs = lat_longs,
        auspice_config = auspice_config
    output:
        auspice_json = rules.all.input.auspice_json,
    shell:
        """
        augur export v2 \
            --tree {input.tree} \
            --metadata {input.metadata} \
            --node-data {input.branch_lengths} {input.traits} {input.nt_muts} {input.aa_muts} \
            --colors {input.colors} \
            --lat-longs {input.lat_longs} \
            --auspice-config {input.auspice_config} \
            --include-root-sequence \
            --output {output.auspice_json}
        """

rule clean:
    message: "Removing directories: {params}"
    params:
        "results ",
        "auspice"
    shell:
        "rm -rfv {params}"
1 Like

Thanks @mboursnell

It looks like you reran the whole workflow from scratch. There are multiple commands that include randomness, going through the command help message, apart from refine, the seed for filter can be fixed:

augur filter --subsample-seed 42 ...

Other commands that might be stochastic but where there seems to be no control over seeds are:

  • tree
  • traits
  • ancestral
  • align

Feel free to open a feature request here to add fixable seeds to those subcommands: new issue

Thanks. I had added “seed” to refine and “subsample-seed” to filter.

So if I want to avoid this do I have to keep the output files from each of these command? e.g. keep the xx_aligned.fasta file from align…

and then run the whole Snakefile again, and it will ignore the places where files exist?