Request for comments: how do you subsample your data?

james · September 12, 2023, 5:03pm

One of the first steps in phylogenetics is collecting together the data you want to analyse. This data may come from multiple sources (public repositories, private data, etc), and is often then pruned down into the specific set of data to construct a tree. Selecting the strains to analyse often involves things like de-duplication, filtering data based on quality and downsampling data either randomly or via certain criteria (e.g. genetic proximity, even sampling across time & space).

It would really help us if you could describe the way(s) you do this, or how you wish you could do this! There’s no need to write any code, just summarising the approach you take (or would take) would be great for us to better understand the common approaches people are taking. If there’s multiple different approaches you take (e.g. to answer different questions) then feel free to write multiple answers! Please include the pathogen that the desired subsampling scheme applies to and ideally the scientific / public health question of interest.

Please respond via this google form (but feel free to respond here if preferred).

Topic		Replies	Views
Sampling of influenza sequences	1	444	February 23, 2021
What is the purpose of "subsampling" in the workflow?	0	355	January 6, 2021
Perform analysis merging my dataset and South America dataset, without subsampling General	3	579	February 24, 2022
Subsampling sequences genetically related to a focal sample Help and Getting Started	0	458	January 14, 2022
Global subsampling question General	3	572	July 24, 2020

Request for comments: how do you subsample your data?

Related topics