Aligned nucleotide sequence does not match the CDS translation

Dylan_Lebatteux · January 8, 2025, 6:55pm

Hello everyone,

First of all, I would like to thank your team for your work and contributions to the scientific community.

I am currently facing the following issue: I have aligned sequences of a gene (UL73) from the Human Cytomegalovirus using the CLI tool Nextclade. I provided the GFF3 annotation file and specified the CDS of interest (see the command below):

nextclade.exe run --verbose
–in-order
–input-ref data/Human_betaherpesvirus_5/UL73/FJ527563.fasta
–input-annotation=data/Human_betaherpesvirus_5/UL73/FJ527563.gff3
–cds-selection=ACL51142.1
–output-all data/temp/
data/Human_betaherpesvirus_5/UL73/UL73_nucleotide_sequences.fasta

However, when I check the output file nextclade.aligned.fasta, I notice that it does not match the alignment of the CDS translation. Below is a specific example to illustrate the issue.

AF390802.1|Human_betaherpesvirus_5|UL73|3b
MEWKTRVLSFLVLSVAVGSYGNSSSTSTSASTX-XXSSVSTVKSTTSVTTSTTTTTTTTL–TSTKPGSTTHNPNVMKRHDHDDFYNAHCTSHMYELSLSSFAAWWTMLNALILMGAFCIVLRHCCFQNFTATTTKGY*

AF390802.1|Human_betaherpesvirus_5|UL73|3b
ATGGAGTGGAAAACACGAGTACTAAGTTTTTTGGTTTTATCGGTGGCGGTAGGGAGTTATGGTAACAGCTCATCTACGTCAACCTCTGCAAGTACACCG-AGTCCTCTTCTAGTGTATCAACGGTAAAATCGACTACCAGCGTAACAACCTCCACAACACCTACGACGACCACAACCACATTAACAAGT—ACTAAACCAGGTTCTACCACTCACAACCCTAATGTGATGAAACGACACGATCACGATGATTTTTACAATGCACATTGCACATCGCATATGTATGAACTCTCACTGTCCAGCTTTGCAGCCTGGTGGACTATGCTCAATGCTCTCATTCTGATGGGAGCTTTTTGTATCGTACTACGACATTGCTGCTTCCAGAACTTTACTGCAACCACCACCAAAGGCTATTGA

When the sequence (from nextclade.aligned.fasta) is manually translated, it gives the following:
MEWKTRVLSFLVLSVAVGSYGNSSSTSTSASTPSPLLVYQR.NRLPA.QPPQHLRRPQPH.QVLNQVLPLTTLM…NDTITMIFTMHIAHRICMNSHCPALQPGGLCSMLSF.WELFVSYYDIAASRTLLQPPPKAI

I do not receive any errors or warnings. Is it normal that the aligned nucleotide sequence does not match the CDS translation? Are correction mechanisms applied after alignment?

My primary goal is to retrieve the nucleotide sequence corresponding to the translation. Perhaps I can reconstruct it using the information contained in the nextclade.json file?

I remain available if you need additional information.

Best regards,

Dylan.

rneher · January 9, 2025, 9:27pm

This probably has to do with the fact that the nucleotide sequence comes from an alignment of the sequence to the reference sequence with insertions relative to the reference being stripped out. This can result in frame-shifted sequences. For this reason, Nextclade provides separate amino acid alignments, again relative to the reference.

If you wanted to reconstruct the full pairwise alignment, you can look into the insertion column of the output table nextclade.tsv and “put them back” into the sequence.

Topic		Replies	Views
Confirming presence of amino acid that matches the reference sequence Help and Getting Started	0	29	July 30, 2024
Nextalign \| Error: Invalid nucleotide: "L"	1	351	May 18, 2021
Output all amino acids per CDS using Nextclade General	2	28	May 7, 2025
Error in rule align- sequence length Help and Getting Started	9	619	July 12, 2021
Using influenza datasets in clades.nextstrain.org Help and Getting Started	7	72	March 20, 2025

Aligned nucleotide sequence does not match the CDS translation

Related topics