Aligned nucleotide sequence does not match the CDS translation

Hello everyone,

First of all, I would like to thank your team for your work and contributions to the scientific community.

I am currently facing the following issue: I have aligned sequences of a gene (UL73) from the Human Cytomegalovirus using the CLI tool Nextclade. I provided the GFF3 annotation file and specified the CDS of interest (see the command below):

nextclade.exe run --verbose
–in-order
–input-ref data/Human_betaherpesvirus_5/UL73/FJ527563.fasta
–input-annotation=data/Human_betaherpesvirus_5/UL73/FJ527563.gff3
–cds-selection=ACL51142.1
–output-all data/temp/
data/Human_betaherpesvirus_5/UL73/UL73_nucleotide_sequences.fasta

However, when I check the output file nextclade.aligned.fasta, I notice that it does not match the alignment of the CDS translation. Below is a specific example to illustrate the issue.

AF390802.1|Human_betaherpesvirus_5|UL73|3b
MEWKTRVLSFLVLSVAVGSYGNSSSTSTSASTX-XXSSVSTVKSTTSVTTSTTTTTTTTL–TSTKPGSTTHNPNVMKRHDHDDFYNAHCTSHMYELSLSSFAAWWTMLNALILMGAFCIVLRHCCFQNFTATTTKGY*

AF390802.1|Human_betaherpesvirus_5|UL73|3b
ATGGAGTGGAAAACACGAGTACTAAGTTTTTTGGTTTTATCGGTGGCGGTAGGGAGTTATGGTAACAGCTCATCTACGTCAACCTCTGCAAGTACACCG-AGTCCTCTTCTAGTGTATCAACGGTAAAATCGACTACCAGCGTAACAACCTCCACAACACCTACGACGACCACAACCACATTAACAAGT—ACTAAACCAGGTTCTACCACTCACAACCCTAATGTGATGAAACGACACGATCACGATGATTTTTACAATGCACATTGCACATCGCATATGTATGAACTCTCACTGTCCAGCTTTGCAGCCTGGTGGACTATGCTCAATGCTCTCATTCTGATGGGAGCTTTTTGTATCGTACTACGACATTGCTGCTTCCAGAACTTTACTGCAACCACCACCAAAGGCTATTGA

When the sequence (from nextclade.aligned.fasta) is manually translated, it gives the following:
MEWKTRVLSFLVLSVAVGSYGNSSSTSTSASTPSPLLVYQR.NRLPA.QPPQHLRRPQPH.QVLNQVLPLTTLM…NDTITMIFTMHIAHRICMNSHCPALQPGGLCSMLSF.WELFVSYYDIAASRTLLQPPPKAI

I do not receive any errors or warnings. Is it normal that the aligned nucleotide sequence does not match the CDS translation? Are correction mechanisms applied after alignment?

My primary goal is to retrieve the nucleotide sequence corresponding to the translation. Perhaps I can reconstruct it using the information contained in the nextclade.json file?

I remain available if you need additional information.

Best regards,

Dylan.

This probably has to do with the fact that the nucleotide sequence comes from an alignment of the sequence to the reference sequence with insertions relative to the reference being stripped out. This can result in frame-shifted sequences. For this reason, Nextclade provides separate amino acid alignments, again relative to the reference.

If you wanted to reconstruct the full pairwise alignment, you can look into the insertion column of the output table nextclade.tsv and “put them back” into the sequence.