Accession in sequences.fasta, but not in nexstrain website

Hi everyone,

I was looking for some accessions in Nexstrain RSV-A (auspice), but cannot find them when filtering for that sample. I downloaded the sequences fasta and can find it there. One example of such an accession is (KU950643). Looking at the downloaded metatdata file, it is not obvious to me as to why the sequence was filtered out. This is the row in the metadata file: KU950643 KU950643.1 KU950643 10/2/2012 North America USA Homo sapiens 4/6/2016 Das et al Das,S.R.,Halpin,R.A.,Shilts,M.,Puri,V.,Akopov,A.,Fedorova,N.,Stockwell,T.,Amedeo,P.,Bishop,B.,Katzel,D.,Schobel,S.,Shrivastava,S.,Hartert,T. 22266 A.D GA2 0 good 45219 1 15225 1 1 1.

Can someone help me understand this?

Thank you!
Thomas

Without looking at the details, my first guess would be subsampling: not all good sequences are included as this would make the analysis take too long, unrepresentative and hard to view.

In the workflow, the subsampling happens here: rule filter (Github link)

Other reasons it doesn’t show up could be QC checks, but as you show the sample having a good quality score, subsampling is the most likely answer for why it’s not included.

The sequences.fasta you’re referring to is probably the output of the ingest pipeline and the input data for the phylogenetic workflow. The input data contains all sequences from Genbank, whereas the workflow filters and subsamples, starting from the full set.

Thank you for the explanation, that makes sense!

1 Like