Subsampling Local DENV dataset based on genetic similarity

Hello,

I hope you are doing well! My goal is to subsample 300 genetically similar sequences to a distinct set of sequences based on the the country they were collected in (Colombia for example). I have went through a few discussions for subsampling DENV, which suggested that the DENV workflow isn’t suited for accepting custom subsampling schemes, just yet. What would be the best alternative way implement the following build/snakefile below:
denv_col.yaml.txt (1.9 KB)

I only need the fasta file of the subsampled data. Thank you for your help! I can provide the input files through email if needed!

1 Like

Hello! Subsampling is a big topic but I think I can point you towards scripts that others have used in the past to do similarity based subsampling for SARS-CoV-2.

The ncov-europe build isn’t in use anymore but it shows how similarity (called “proximity” here) biased subsampling can be done.

This is a script that is used to give each sequence a priority score depending on how closely it’s related to a “focal” set: ncov-europe/scripts/priorities.py at 553615c4c0b46742e103e0db16490deab1add45b · neherlab/ncov-europe · GitHub

If you go through the workflow, you might be able to figure out how the priorities file is generated and used: ncov-europe/scripts/add_priorities_to_meta.py at 553615c4c0b46742e103e0db16490deab1add45b · neherlab/ncov-europe · GitHub

There’s nothing magic here, you could create these things from scratch, but prior art is often useful.

The “priorities” are used in augur filter to non-randomly subsample.

You can search on github for augur filter and --priority and you should find them used in workflows together.

Here’s the help text for --priority:

--priority

tab-delimited file with list of priority scores for strains (e.g., “t”) and no header.

When scores are provided, Augur converts scores to floating point values, sorts strains within each subsampling group from highest to lowest priority,and selects the top N strains per group where N is the calculated or requested number of strains per group.
Higher numbers indicate higher priority. Since priorities represent relative values between strains, these values can be arbitrary.