GISAID nextfasta QC criteria

Hi nextstrain folks,

Thanks for the nextmeta and nextfasta download resources on GISAID. Could you point me towards what QC criteria were applied to generate these files? I am reading the QC section on https://clades.nextstrain.org/ and am wondering if either the ‘Nextclade’ or ‘Nextstrain’ criteria apply, or if a third set of criteria were used.

These files were generated from the GISAID EpiCoV database using this https://github.com/nextstrain/ncov-ingest pipeline. Most of the cleaning happens in this script: https://github.com/nextstrain/ncov-ingest/blob/master/bin/transform-gisaid. We have an extensive “annotations” file to further curate metadata: https://github.com/nextstrain/ncov-ingest/blob/master/source-data/gisaid_annotations.tsv.

As Trevor notes, the QC criteria are applied later in our pipeline. nextstrain filters sequences with clustered SNPs (6 in 100 base pairs), too high or too low divergence given a date, and more than 3000 Ns. Nextclade flags sequences with similar criteria (doesn’t have a date, so can only do a rough divergence check) and flags starting at 1000 Ns. We are working on harmonizing these.

Just to add one more bit of clarification, almost all of the QC applies later in the script, but we do only grab ‘long’ sequences from GISAID for the initial ncov-ingest ingest. So, short SARS-CoV-2 sequences will never appear in the metadata.tsv or sequences.fasta file. I believe we also only take sequences that are >29,000bp here, but it could be slightly shorter - and we take ‘GISAID’s word’ on the length - so if there are lots of gaps or N’s, this isn’t recognised at this point (but they may be kicked out later).

Thanks all for the information! @Emma I think the length criteria is >15,000bp (looking at the script Trevor linked: https://github.com/nextstrain/ncov-ingest/blob/fd67f26312bfc218d471fb8f38df62032d2d7f1c/bin/transform-gisaid).

For downstream QC, is there a way to run the nextclade tool programatically or is the drag-and-drop always necessary?

Aha, looks like the “diagnostic.py” script in the ncov repo is exactly what I was looking for. In case it helps anyone else: https://github.com/nextstrain/ncov/blob/master/scripts/diagnostic.py

1 Like

Great. Glad you found it. the diagnostic tool runs over the complete alignment and produces three output files. One with info on every sequence, another with only the sequenced with issues, and a third file that is meant to be added to the exclude.txt. Let us know if you have additional questions!