Is there a way to keep the oldest sample among identical sequences?

During subsampling, among identical sequences, is there a way or a parameter we can set to always keep the sample with earliest collection date? Or is it something we should do before feeding data to Nextstrain? Thanks!

is there a reason why you do not want to keep identical sequences?

It’s quite common for many samples to have identical sequences if a lot of cases are sequenced.

augur filter does not check the sequences themselves for identity. So you would have to script something yourself - e.g. hashing sequences and then dealing with the sequences with identical hashes.

We’re trying to do representative subsampling for phylodynamics analysis. Very good to know, thanks!!

Thanks. So if every case is sequenced only once, you should not exclude identical sequences - or does the method not work if you have multiple identical sequences?

It’s Bayesian analysis - cannot handle all that much data XD

1 Like