Thanks for the nextmeta and nextfasta download resources on GISAID. Could you point me towards what QC criteria were applied to generate these files? I am reading the QC section on https://clades.nextstrain.org/ and am wondering if either the ‘Nextclade’ or ‘Nextstrain’ criteria apply, or if a third set of criteria were used.
As Trevor notes, the QC criteria are applied later in our pipeline. nextstrain filters sequences with clustered SNPs (6 in 100 base pairs), too high or too low divergence given a date, and more than 3000 Ns. Nextclade flags sequences with similar criteria (doesn’t have a date, so can only do a rough divergence check) and flags starting at 1000 Ns. We are working on harmonizing these.
Just to add one more bit of clarification, almost all of the QC applies later in the script, but we do only grab ‘long’ sequences from GISAID for the initial ncov-ingest ingest. So, short SARS-CoV-2 sequences will never appear in the metadata.tsv or sequences.fasta file. I believe we also only take sequences that are >29,000bp here, but it could be slightly shorter - and we take ‘GISAID’s word’ on the length - so if there are lots of gaps or N’s, this isn’t recognised at this point (but they may be kicked out later).
Great. Glad you found it. the diagnostic tool runs over the complete alignment and produces three output files. One with info on every sequence, another with only the sequenced with issues, and a third file that is meant to be added to the exclude.txt. Let us know if you have additional questions!