Installation Issue

Installed and updated Docker Nextstrain container today and ran into this issue when I began a build looking at delta sequences in Florida:

(nextstrain) gvestal@grants-mbp ncov % nextstrain build . --cores 4 --use-conda \

–configfile ./my_profiles/fl_delta/builds.yaml
Building DAG of jobs…
CreateCondaEnvironmentException:
The ‘conda’ command is not available in the shell /bin/bash that will be used by Snakemake. You have to ensure that it is in your PATH, e.g., first activating the conda base environment with conda activate base.
File “/usr/local/lib/python3.7/site-packages/snakemake/deployment/conda.py”, line 232, in create
File “/usr/local/lib/python3.7/site-packages/snakemake/deployment/conda.py”, line 343, in new
File “/usr/local/lib/python3.7/site-packages/snakemake/deployment/conda.py”, line 356, in init
File “/usr/local/lib/python3.7/site-packages/snakemake/deployment/conda.py”, line 396, in _check

I’ve reinstalled, but still get the same issue. I’m sure the issue is on my end, but if anyone could give some guidance, I would appreciate it.

Hi @gvestal, the current Nextstrain container doesn’t support conda, so I think this can be fixed by dropping the --use-conda argument (which is being passed to Snakemake running in the container).

nextstrain build --docker . --cores 4 --configfile ./my_profiles/fl_delta/builds.yaml
# P.S. the --docker argument isn't needed if docker is the default environment, 
# run `nextstrain check-setup` to see the default

If you want to use conda to manage dependencies etc, then we can also run nextstrain “natively” (i.e. not within a container) via:

# ensure conda & snakemake are available in the current environment
nextstrain build --native . --cores 4 --use-conda –configfile ./my_profiles/fl_delta/builds.yaml

@james

Thank you for the clarification. I managed to get a build done, but I had a follow-up question about subsampling. Below is the subsampling scheme we used to for the FL Delta build. We uploaded our sequences to UShER, found the nearest neighbor sequences, combined both sequence sets and are attempting to build a tree to determine the transmission of Delta into FL. The subsampling scheme worked, but is there an optimal way to create a subsampling scheme for this? Or, as I suspect, is Nextstrain really not ideal for building that? Any feedback would be appreciated!

# data
inputs:
  - name: data
    metadata: data/metadata.tsv
    sequences: data/sequences.fasta
use_nextalign: true
#my_profiles/fl_delta/builds.yaml
builds:
  fl_delta:
    subsampling_scheme: delta
    region: North America
    country: USA
    division: Florida
#Delta Subsampling for sequences
subsampling:
  delta:
    division:
      group_by: "year month"
      max_sequences: 1000
      exclude: "--exclude-where 'region!={region}' 'country!={country}' 'division!={division}'"
    country:
      group_by: "year month"
      max_sequences: 1000
      exclude: "--exclude-where 'division={division}'"
      priorities:
        type: "proximity"
        focus: "division"
    region:
      group-by: "country year month"
      max-sequences: 1000
      exclude: "--exclude-where 'country={country}'"
      priorities:
        type: "proximity"
        focus: "division"
    global:
      group_by: "region year month"
      max_sequences: 500
      exclude: "--exclude-where 'region={region}'"
      priorities:
        type: "proximity"
        focus: "division"

We hope that nextstrain is can be used for workflows such as these, and are excited to hear about your plan.

We uploaded our sequences to UShER, found the nearest neighbor sequences, combined both sequence sets and are attempting to build a tree to determine the transmission of Delta into FL.

If you want to use all of the nearest neighbors (and your data) provided by UShER, and assuming data/sequences.fasta represents this, then you can build a tree by using a dummy subsampling scheme which essentially doesn’t do any subsampling:

subsampling:
  delta:
    division:
      group_by: "year"
      seq_per_group: 10000000 # i.e. no subsampling

If you want to combine the above dataset with some other contextual sequences, or if the above dataset is too large and you need to reduce it via subsampling, then let me know and I’ll try to help.

Adding to @james’s response above, you can also skip subsampling by omitting the subsampling_scheme from your build definition or setting the value to all.

Either of the following examples will allow you to effectively skip subsampling and use all sequences defined in your inputs (that also pass your standard filters).

builds:
  fl_delta:
    region: North America
    country: USA
    division: Florida

Or with an explicit subsampling scheme:

builds:
  fl_delta:
    subsampling_scheme: all
    region: North America
    country: USA
    division: Florida