Request for comments: how do you subsample your data?

One of the first steps in phylogenetics is collecting together the data you want to analyse. This data may come from multiple sources (public repositories, private data, etc), and is often then pruned down into the specific set of data to construct a tree. Selecting the strains to analyse often involves things like de-duplication, filtering data based on quality and downsampling data either randomly or via certain criteria (e.g. genetic proximity, even sampling across time & space).

It would really help us if you could describe the way(s) you do this, or how you wish you could do this! There’s no need to write any code, just summarising the approach you take (or would take) would be great for us to better understand the common approaches people are taking. If there’s multiple different approaches you take (e.g. to answer different questions) then feel free to write multiple answers! Please include the pathogen that the desired subsampling scheme applies to and ideally the scientific / public health question of interest.

Please respond via this google form (but feel free to respond here if preferred).

1 Like