GISAID nextfasta QC criteria

nadeaus · July 20, 2020, 10:32am

Hi nextstrain folks,

Thanks for the nextmeta and nextfasta download resources on GISAID. Could you point me towards what QC criteria were applied to generate these files? I am reading the QC section on https://clades.nextstrain.org/ and am wondering if either the ‘Nextclade’ or ‘Nextstrain’ criteria apply, or if a third set of criteria were used.

trvrb · July 20, 2020, 10:59pm

These files were generated from the GISAID EpiCoV database using this https://github.com/nextstrain/ncov-ingest pipeline. Most of the cleaning happens in this script: https://github.com/nextstrain/ncov-ingest/blob/master/bin/transform-gisaid. We have an extensive “annotations” file to further curate metadata: https://github.com/nextstrain/ncov-ingest/blob/master/source-data/gisaid_annotations.tsv.

rneher · July 21, 2020, 6:45am

As Trevor notes, the QC criteria are applied later in our pipeline. nextstrain filters sequences with clustered SNPs (6 in 100 base pairs), too high or too low divergence given a date, and more than 3000 Ns. Nextclade flags sequences with similar criteria (doesn’t have a date, so can only do a rough divergence check) and flags starting at 1000 Ns. We are working on harmonizing these.

emmahodcroft · July 21, 2020, 10:26am

Just to add one more bit of clarification, almost all of the QC applies later in the script, but we do only grab ‘long’ sequences from GISAID for the initial ncov-ingest ingest. So, short SARS-CoV-2 sequences will never appear in the metadata.tsv or sequences.fasta file. I believe we also only take sequences that are >29,000bp here, but it could be slightly shorter - and we take ‘GISAID’s word’ on the length - so if there are lots of gaps or N’s, this isn’t recognised at this point (but they may be kicked out later).

nadeaus · July 22, 2020, 7:29am

Thanks all for the information! @Emma I think the length criteria is >15,000bp (looking at the script Trevor linked: https://github.com/nextstrain/ncov-ingest/blob/fd67f26312bfc218d471fb8f38df62032d2d7f1c/bin/transform-gisaid).

For downstream QC, is there a way to run the nextclade tool programatically or is the drag-and-drop always necessary?

nadeaus · July 22, 2020, 7:44am

Aha, looks like the “diagnostic.py” script in the ncov repo is exactly what I was looking for. In case it helps anyone else: https://github.com/nextstrain/ncov/blob/master/scripts/diagnostic.py

rneher · July 22, 2020, 11:41am

Great. Glad you found it. the diagnostic tool runs over the complete alignment and produces three output files. One with info on every sequence, another with only the sequenced with issues, and a third file that is meant to be added to the exclude.txt. Let us know if you have additional questions!

guslac · December 15, 2020, 7:05am

nextmeta seems to no longer be available for download, as of a few weeks ago. Did it go away, or did it move to a different location?

Topic		Replies	Views
Nextmeta and nextfasta not on GISAID	34	2650	June 30, 2021
Sequence missing after certain dates General	5	224	January 16, 2024
1 fundamental (maybe naive) question on nextStrain	1	440	May 19, 2021
Spike protein sequences filtered for lineage General	1	571	February 10, 2022
GISAID - nextclade designations?	1	461	May 2, 2022

GISAID nextfasta QC criteria

Related topics