Augur align insertions output - clarification

Hi,
I am using augur align for multiple sequence alignment,
and I am not sure about the insertions csv output file format.

each column that represents an insertion has a title as follows: Xbp @ ref pos Y
for example:
insertion: 2706bp @ ref pos 1968.

what does the Xbp stand for?
In some cases it matches the nucleotide fragment length that was inserted in that sample, but in other cases it does not.

Thank you in advance,
Dana.

Hi Dana,

could you provide more detail? how exactly are you running this? I thought the X bp should match the length of the insertion.

best,
richard

I am running augur align on a fasta file that contains multiple consensus sequences.
I run the command:

augur align \
--sequences not_aligned.fasta \
--reference-sequence REF_NC_045512.2.fasta \
--output aligned.fasta

The alignment output file turns out ok.
Example of insertions csv output:
image

Thank you (:

Hi @dana. I believe this is because we remove “-”, “N” and “?” characters from the insertion before reporting it. So in this case it looks like the 739bp insertion is largely due to missing data.

I don’t think we have an easy way for augur align to produce an alignment which does not remove insertions, but you could try running the following and then examining the alignment file itself to see exactly what the insertion in strain “18925” is

mafft --reorder --anysymbol --nomemsave --adjustdirection --thread <num_threads> <input_fasta> > <output_fasta>

You might also be interested in nextalign which does reference alignments (and translations). We are using this for large SARS-CoV-2 alignments. It reports insertions in a similar way. See here for details:

https://docs.nextstrain.org/projects/nextclade/en/latest/user/nextalign-cli.html