Entire recent SARS-CoV-2 batch from Cote d’Ivoire not output from Nextclade CLI

I was just refreshing my data from GISAID and I noticed the entire recent batch of 65 samples from Cote d’Ivoire did not appear in the output-tsv produced from the Nextclade CLI. There were no messages produced.

For example: EPI_ISL_20233322

Hi Mike @mike_honey

I just ran nextclade in verbose mode (-v) on EPI_ISL_20233322:

$ nextclade run -v --input-dataset=datasets/nextstrain/sars-cov-2/wuhan-hu-1/orfs -O tmp/nextstrain/sars-cov-2/wuhan-hu-1/orfs EPI_ISL_20233322.fasta
...
2025-11-05 08:17:31.185 [I] nextclade_loop.rs:103: Processing sequence 'hCoV-19/Cote d'Ivoire/IPCI-DVE-GR0189/2025|EPI_ISL_20233322|2025-01-19'
2025-11-05 08:17:31.631 [I] nextclade_ordered_writer.rs:163: In sequence #0 'hCoV-19/Cote d'Ivoire/IPCI-DVE-GR0189/2025|EPI_ISL_20233322|2025-01-19': When processing gene "ORF1a": The extracted gene sequence is empty or consists entirely from gaps
2025-11-05 08:17:31.631 [I] nextclade_ordered_writer.rs:163: In sequence #0 'hCoV-19/Cote d'Ivoire/IPCI-DVE-GR0189/2025|EPI_ISL_20233322|2025-01-19': When processing gene "ORF1b": The extracted gene sequence is empty or consists entirely from gaps
2025-11-05 08:17:31.631 [I] nextclade_ordered_writer.rs:163: In sequence #0 'hCoV-19/Cote d'Ivoire/IPCI-DVE-GR0189/2025|EPI_ISL_20233322|2025-01-19': When processing gene "ORF3a": The extracted gene sequence is empty or consists entirely from gaps
2025-11-05 08:17:31.631 [I] nextclade_ordered_writer.rs:163: In sequence #0 'hCoV-19/Cote d'Ivoire/IPCI-DVE-GR0189/2025|EPI_ISL_20233322|2025-01-19': When processing gene "E": The extracted gene sequence is empty or consists entirely from gaps
2025-11-05 08:17:31.631 [I] nextclade_ordered_writer.rs:163: In sequence #0 'hCoV-19/Cote d'Ivoire/IPCI-DVE-GR0189/2025|EPI_ISL_20233322|2025-01-19': When processing gene "M": The extracted gene sequence is empty or consists entirely from gaps
2025-11-05 08:17:31.631 [I] nextclade_ordered_writer.rs:163: In sequence #0 'hCoV-19/Cote d'Ivoire/IPCI-DVE-GR0189/2025|EPI_ISL_20233322|2025-01-19': When processing gene "ORF6": The extracted gene sequence is empty or consists entirely from gaps
2025-11-05 08:17:31.631 [I] nextclade_ordered_writer.rs:163: In sequence #0 'hCoV-19/Cote d'Ivoire/IPCI-DVE-GR0189/2025|EPI_ISL_20233322|2025-01-19': When processing gene "ORF7a": The extracted gene sequence is empty or consists entirely from gaps
2025-11-05 08:17:31.631 [I] nextclade_ordered_writer.rs:163: In sequence #0 'hCoV-19/Cote d'Ivoire/IPCI-DVE-GR0189/2025|EPI_ISL_20233322|2025-01-19': When processing gene "ORF7b": The extracted gene sequence is empty or consists entirely from gaps
2025-11-05 08:17:31.631 [I] nextclade_ordered_writer.rs:163: In sequence #0 'hCoV-19/Cote d'Ivoire/IPCI-DVE-GR0189/2025|EPI_ISL_20233322|2025-01-19': When processing gene "ORF8": The extracted gene sequence is empty or consists entirely from gaps
2025-11-05 08:17:31.631 [I] nextclade_ordered_writer.rs:163: In sequence #0 'hCoV-19/Cote d'Ivoire/IPCI-DVE-GR0189/2025|EPI_ISL_20233322|2025-01-19': When processing gene "N": The extracted gene sequence is empty or consists entirely from gaps
2025-11-05 08:17:31.631 [I] nextclade_ordered_writer.rs:163: In sequence #0 'hCoV-19/Cote d'Ivoire/IPCI-DVE-GR0189/2025|EPI_ISL_20233322|2025-01-19': When processing gene "ORF9b": The extracted gene sequence is empty or consists entirely from gaps

These are Spike-only sequences. So many things will be missing. But I receive a non-empty tsv file.

What exact commands are you running? (for me to reproduce and debug the code in case there’s a bug)

Hi Ivan, thanks for looking into this so quickly. I have tracked down the issue - it seems nextclade CLI making a subtle edit when producing the seqName column. It is converting the apostrophe in the Country name, which is embedded in the values for that field.

I can’t type it in - this forum seems to convert the apostrophe characters, but here is an image from a text editor, where the difference is clearer:

I’m then using that field to merge GISAID and Nextclade data, which is where those samples are falling out. Apologies for the incorrect information above.

I can just work around that by standardising the quote values when I build my merge key.

@mike_honey

Nextclade uses names from fasta file, and outputs them as is. In the original EPI_ISL_20233322 fasta from GISAID I get:

>hCoV-19/Cote d'Ivoire/IPCI-DVE-GR0189/2025|EPI_ISL_20233322|2025-01-19

i.e. it contains the U+0027 Apostrophe character.

I haven’t checked GISAID metadata file and don’t know which character they use there, but I verified that if a handcrafted fasta file contains

Côte d’Ivoire 

with U+2019 Right Single Quotation Mark (as in official correct spelling of this country name), nextclade correctly preserves it in all output files.

The problem you are observing is likely due to a mismatch between GISAID fasta and GISAID metadata, i.e. unrelated to Nextclade. Could you please check both of the original files and let me know if that’s the case?

Yes you are correct. The GISAID fasta file does not match the GISAID metadata file. Apologies for burning your time on this.