Error with a flu reference sequence for alignment

I am running in Docker on Windows and I have run through the tutorials, which worked fine. Now I am trying to run Nextstrain with a collection of flu sequences and I’m getting stuck on this command (as in the tutorial):

augur align
–sequences results/filtered.fasta
–reference-sequence config/flu_outgroup.gb
–output results/aligned.fasta
–fill-gaps

I have a file flu_outgroup.gb from Genbank, but I consistently get the following error: ERROR: Cannot read reference sequence. make sure the file config/flu_outgroup.gb contains one sequence in genbank or fasta format. I’ve read through this documentation and I think my reference looks okay based on that guidance. I’ve tried it with another reference sequence in FASTA format and get the same result. Can you provide any insight into why this reference sequence isn’t working?

flu_outgroup.gb (30.8 KB)

Thank you.

Hi @maryj, thank you for sharing your reference GenBank file! It looks like the issue is that there are multiple loci in your GenBank file (one per gene segment) and augur align expects a single reference sequence. In this case, the augur error is not entirely accurate, since augur can read the GenBank file but it doesn’t know which reference sequence to use.

Our seasonal flu phylogenetic workflows usually produce a single phylogeny per gene segment. For example, we would make a HA-specific tree using a reference sequence like the one I’ve attached here which only includes “LOCUS 4” from your original file. Can you try running the augur align command above with this reference file and your HA input sequences and let us know if that works?

flu_outgroup_ha.gb (4.0 KB)

For more examples of how we run our seasonal flu analyses, you may want to try out “example build” for the seasonal-flu workflow and the quickstart guide for GISAID data. This seasonal flu repository has other useful configuration files like lineage- and segment-specific reference FASTA and gene map files and lineage-specific outliers.

For our own reference on the Nextstrain team, I created a GitHub issue about this potentially confusing error message from augur align.

Thanks for this explanation - I wondered if that might be the case. Using the HA reference you provided does seem to work. To clarify, does this mean I should use multiple reference sequences to make multiple, locus-specific trees with influenza data?

That’s right, you’ll need one reference sequence file per locus so you can make one tree per locus. You can see this pattern in the configuration file for the quickstart GISAID analysis where the path to the reference FASTA is parameterized by the {segment} value. For each of the gene segments defined in the configuration file, the workflow will look for the corresponding reference file path (e.g., config/h3n2/ha/reference.fasta for H3N2 HA).

Most steps in our seasonal flu workflow are parameterized to run per segment within each “build” defined in the build configuration file like the one linked to above. For example, we produce one alignment FASTA per build and gene segment.

To run the workflow with all 8 gene segments, you’d modify the configuration file linked above such that the segments list looks like this:

segments:
  - ha
  - na
  - pa
  - pb1
  - pb2
  - np
  - ns
  - mp

And you’d update the reference and annotation entries in the build configuration file to point to corresponding FASTA and GFF files per gene segment. If you were to use the existing Nextstrain seasonal flu workflow, you’d need to convert your GenBank file to these separate FASTA and GFF files or you could use one of the existing references in the config/ directory of the seasonal-flu repository.

Thank you for this information! But now I’m hopelessly lost trying to make sense of the configuration files and Snakemake rules in the seasonal flu workflow. I can’t even get the example build to run. When I run this:

nextstrain build .  --configfile profiles/ci/builds.yaml

I get the following error message that I can’t decipher:

Error in rule select_titers:
    jobid: 36
    input: builds/ci_build/strains.txt, example_data/cdc_h3n2_cell_fra_titers.tsv
    output: builds/ci_build/titers/cdc_cell_fra.tsv
    log: logs/select_titers_ci_build_cdc_cell_fra.txt (check log file(s) for error details)
    conda-env: /nextstrain/build/.snakemake/conda/dae81987b7ec4d339b11532c7759e4d4_
    shell:

        head -n 1 example_data/cdc_h3n2_cell_fra_titers.tsv > builds/ci_build/titers/cdc_cell_fra.tsv;
        tsv-join             --key-fields 1             --filter-file builds/ci_build/strains.txt             example_data/cdc_h3n2_cell_fra_titers.tsv >> builds/ci_build/titers/cdc_cell_fra.tsv 2> logs/select_titers_ci_build_cdc_cell_fra.txt

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Can you please help me make sense of this?

I’m sorry about that, @maryj! Can you share the contents of the log file (logs/select_titers_ci_build_cdc_cell_fra.txt)? This file should include details about the nature of the error in the workflow.

It may also be useful to work through this synchronously through a Zoom discussion, if the error message in that file doesn’t explain what’s happening and that’s something you are available for. We host weekly office hours on Thursdays from 10-11am Pacific Time, that might be a good venue for continuing this troubleshooting session.

Thanks, I think office hours might be helpful! The contents of the log file are here:

Error [tsv-join]: Windows/DOS line ending found. Convert file to Unix newlines before processing (e.g. 'dos2unix').
  File: example_data/cdc_h3n2_cell_fra_titers.tsv, Line: 1

Thanks! That was a helpful error message. I just made a couple of changes to the seasonal flu repository that should fix that issue and also help avoid it in the future. You can download the most recent version of the workflow by running git pull in the seasonal-flu directory.

The example build we were recommending in the README for that repository is not really the ideal starting point for most users, so I’ve updated the README to reference a simpler example build. You can run this build like so:

nextstrain build .  --configfile profiles/example/builds.yaml

And then you can run nextstrain view auspice when the workflow completes.

Can you try these commands with the latest repository contents, @maryj, and let us know if that works as expected?