Creating tree for genomic subregion(s)

Hi Nextstrain team,

I tried to create a tree for the first 2/3 of the genome only, i.e., ORF1a and ORF1b, but I was running into an error in terms of tree building or rule distance.

To accommodate the smaller sequences in the workflow, I tried the following in the builds.yaml file, but without success:

mask:
  mask_from_end: 8341   # to cover bp 1 - 21562 of the SARS-CoV-2 genome only (= ORF1a/b)

filter:
  min_length: 17000   # to allow shorter (partial) genomes
# or alternatively, I tried:
  skip_diagnostics: True

# Genes to display:
genes: ["ORF1a", "ORF1b"]

rule distances:
    params:
        genes = 'ORF1a',
        comparisons = ['root'],
        attribute_names = ['ORF1a_mutations']
# I tried modifying more in here, but nothing worked

Are trees for genomic subregions supported at all by Nextstrain?
If yes, I’d greatly appreciate hearing your tweaks.

Thank you!

Ralf

I can’t immediately tell where the problem may be.

Could you paste in the error message(s) you get?

The distance rule as parametrized in ncov assumes you’re trying to count Spike mutations. That’s not really applicable here in your case. So you should probably just disable it by commenting out this line: ncov/main_workflow.smk at fb985678b052809d7829f9e8e72e82250a22a142 · nextstrain/ncov · GitHub

Then Snakemake won’t ask for this rule to be run and so there won’t be an error.

The distances are just added as an extra coloring, you don’t loose anything important.

Is there an error regarding tree building, too?

Hi Cornelius,

When I ran the initial build:

mask:
  mask_from_end: 8341   # to cover bp 1 - 21562 of the SARS-CoV-2 genome only (= ORF1a/b)

filter:
  min_length: 17000   # to allow shorter (partial) genomes
# or alternatively, I tried:
  skip_diagnostics: True

I received the following error:


WARNING: more threads requested than there are sequences; falling back to IQ-TREE's `-nt AUTO` mode.

ERROR: Shell exited 2 when running: iqtree -nt AUTO -s results/NYU_rec_Delta-part_testthree/masked_filtered-delim.fasta -m GTR -ninit 2 -n 2 -me 0.05 -ninit 10 -n 4 > results/NYU_rec_Delta-part_testthree/masked_filtered-delim.iqtree.log
Command output was:
  ERROR: Alignment must have at least 3 sequences

7 masking sites read from defaults/sites_ignored_for_tree_topology.txt
Building a tree via:
        iqtree -nt AUTO -s results/NYU_rec_Delta-part_testthree/masked_filtered-delim.fasta -m GTR -ninit 2 -n 2 -me 0.05 -ninit 10 -n 4 > results/NYU_rec_Delta-part_testthree/masked_filtered-delim.iqtree.log
        Nguyen et al: IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies.
        Mol. Biol. Evol., 32:268-274. https://doi.org/10.1093/molbev/msu300

ERROR: TREE BUILDING FAILED
Please see the log file for more details: results/NYU_rec_Delta-part_testthree/masked_filtered-delim.iqtree.log

Building original tree took 0.008130550384521484 seconds
[Tue Mar 22 04:34:26 2022]
Error in rule tree:
    jobid: 21
    output: results/NYU_rec_Delta-part_testthree/tree_raw.nwk
    log: logs/tree_NYU_rec_Delta-part_testthree.txt (check log file(s) for error message)
    shell:

        augur tree             --alignment results/NYU_rec_Delta-part_testthree/filtered.fasta             --tree-builder-args '-ninit 10 -n 4'             --exclude-sites defaults/sites_ignored_for_tree_topology.txt             --output results/NYU_rec_Delta-part_testthree/tree_raw.nwk             --nthreads 4 2>&1 | tee logs/tree_NYU_rec_Delta-part_testthree.txt

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

Based on your suggestion, I tried to include a new snakemake rule by:

  1. creating a my_rule.smk.txt file:

def _get_node_data_by_wildcards(wildcards):
    """Return a list of node data files to include for a given build's wildcards.
    """
    # Define inputs shared by all builds.
    wildcards_dict = dict(wildcards)
    inputs = [
        rules.refine.output.node_data,
        rules.ancestral.output.node_data,
        rules.translate.output.node_data,
        rules.rename_emerging_lineages.output.clade_data,
        rules.clades.output.clade_data,
        rules.recency.output.node_data,
        rules.traits.output.node_data,
        rules.logistic_growth.output.node_data,
        rules.mutational_fitness.output.node_data,
#        rules.distances.output.node_data,
        rules.calculate_epiweeks.output.node_data,
    ]

    if "run_pangolin" in config and config["run_pangolin"]:
        inputs.append(rules.make_pangolin_node_data.output.node_data)

    # Convert input files from wildcard strings to real file names.
    inputs = [input_file.format(**wildcards_dict) for input_file in inputs]

    return inputs

…where I removed the rules.distances line as you suggested (not sure I did it the right way), and also referring to the my_rules.smk.txt file in the builds.yaml file:


custom_rules:
  - my_profiles/recomb/my_rules.smk.txt

but I received the following error message:


Error in rule tree:
    jobid: 21
    output: results/NYU_rec_Delta-part_testfour/tree_raw.nwk
    log: logs/tree_NYU_rec_Delta-part_testfour.txt (check log file(s) for error message)
    shell:

        augur tree             --alignment results/NYU_rec_Delta-part_testfour/filtered.fasta             --tree-builder-args '-ninit 10 -n 4'             --exclude-sites defaults/sites_ignored_for_tree_topology.txt             --output results/NYU_rec_Delta-part_testfour/tree_raw.nwk             --nthreads 4 2>&1 | tee logs/tree_NYU_rec_Delta-part_testfour.txt

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

Any thoughts?

Thank you!!

1 Like

It sounds like you didn’t pass more than 2 sequences to augur tree.

Can you check what’s in the alignment file filtered.fasta? Maybe that’s nearly empty?

That would cause the error you’re seeing. Can’t build a tree with no sequences!

Hi Cornelius,

That’s exactly the problem. Only the Wuhan reference sequences remain after filtering and I don’t know how to lift the diverse filters when running subgenomic trees.

Do you have a set of commands I can add to the builds.yaml file to make this work?

Many thanks for your dedication to looking into this.