I am trying to see which data are from Canada specifically. Is there a way to visualize how many data points submitted from Canada along with detailed info? (eg. # of data, which city/region, who submitted, which strains etc.) I am new to Nextstrain - any help or suggestions would be much appreciated!
Hi @yoojinepark! Good question; some recent interface improvements by @james hopefully make this slightly more intuitive - here is a tweet we did explaining how to use the new interface to do things like filter by country and other metadata: https://twitter.com/nextstrain/status/1329535390273957894. This kind of filtering is tracked in the url like so: https://nextstrain.org/ncov/north-america?f_country=Canada. It will tell you how many are included within each filtering category, and hovering over sequences in the tree will show who submitted them. Hope this helps - Eli
HI @eharkins & Nextstrain team!
First: Thanks for making this website! Congratulations for your work!
I’m interested in comparing the latest lineage frequencies between different countries worldwide (I dont have specific interested in the tree). Although I have read the tweet, I have several questions on how to achieve that and how to interpret the visualized data.
Depending on how I access the data for a specific country I get different lineage frequencies. For example, if I want to know the latest frequencies of B1.1.7 in Germany, If I filter for Europe> Germany I get 21% frequency while if I filter for Germany directly (dropping down the countries inside Europe and selecting Germany directly) I get 12% frequency. I guess I’m missing something here. What’s the difference between the two ways of filtering? Which is the best way to filter if I want to compare frequencies among countries all over the world?
Which is the denominator of the % on frequencies? In covariants.org this info is provided together with the frequencies, but I can’t find it in Nextstrains.
The latest frequency provided is calculated per week as in covariants.org or how is that calculated?
Thank you very much for your time and help!
All the best-
Hi @ECG, Thanks for writing! I’m afraid I’m not 100% clear on what you mean by filtering for German sequences in two different ways. Certainly if you are filtering German sequences on different builds, then these frequencies may change due to subsampling. For example, if you access the Germany build maintained by Neher Lab and filter to Germany, it shows about 27% currently.
If you use the Nextstrain Europe build and filter to Germany, it’s also about 27%.
On the global build this significantly lower at about 12% - but the global builds are now extremely subsampled (~4,000 sequences out of >500,000 available), so I’d urge caution when using it to look at country levels - it’s better used just for larger-scale, longer-time-period interpretation.
The denominator of the % for frequencies is the total number of sequences that are visible in the tree given the currently applied filters. If no filters are on, that means it’ll use all the sequences in the tree from the appropriate time-slice. Note that this is different from Covariants.org as CoVariants uses raw sequence data (not a tree) that’s not significantly subsampled, whereas Nextstrain frequencies here reflect the sequences in the tree & thus any subsampling therein.
Frequencies are shown per-week on Nextstrain.org. Note that for CoVariants.org they’re shown per week for the Per Variant plots, but per 2 weeks for the Per Country plots (to smooth jitter).
I hope all this helps!
Indeed, this helps a lot! Sorry if that was not very clear, but you got it right! Thank you very much for your time!
So, if I want to assess the country-level variant distribution, I should use the continental or country build rather than the global build since there is less effect of subsampling.
Link between Metadata & Graph: if I download the metadata and calculate the frequency only out of the variants collected in the last week (filtering by “collection data” and not by “submitting date”) I should get the same frequencies reported in the last time point of the graph, right? Or should I filter by submitting date? Either way, I don’t manage to get the same values as in the graph, and this makes be believe I’m missing something. “Submitting date” shows time windows that don’t match the “collection data” timing.( e,g., samples assigned as submitted 1-2 days ago appear as collected on the 4.2.2021)
Time window I’m a bit confused with the time windows: In one hand the frequencies are provided per week-slices in the graph but on the other hand I can also filter by “Submission date” and filter by “last month”. If I filter per “last month” I still get the per week timewindow in the graph right? So, what’s the “last month” filter for? Should I not use the “last month” filter at all if I just want last week frequencies?
Denominators: Doublechecking If I’m getting it right: If I want to know the variant frequencies with denominator observed in Germany during the last month I should apply these filters:
< select Germany build (Neher lab)
< filter by “last month” → here I get the denominator: 432
<check the latest timeslice of the graph → here I get the frequencies (ex: B1.1.7. 63% at 13.2.2021 (Pangolin lineage))
This allows me to conclude that according to the samples collected in Germany the last month, 63% out of 432 samples belonged to the B1.1.7 variant. Is this correct? or is this frequency only for last week?
Due to the relevance of this information, I want to make sure I’m using the data correctly. I’m not familiarized with phylogenies and this might explain some of my confusions,
Thanks a lot for your patience Emma!