Thank you for the detailed analysis, @trs! That’s a perfect summary of the filter logic.
The current behavior is a bug. My intention was for the randomness of strain selection in groups to be entirely based on random priorities that use the random seed argument and assigned to each strain when it is added to the queue.
You can test a minimal version of what I expected to happen with the following code:
from collections import defaultdict
import numpy as np
seed = 3
random_generator = np.random.default_rng(seed)
priorities = defaultdict(random_generator.random)
strains = {"A", "B", "C", "D"}
for strain in strains:
print(priorities[strain])
print(priorities)
This code will always assign the same random priorities to the same strains. But, as you noted, this only works because the strain order happens to be maintained here. If we define the strains set with the same values in a different order, we get different priorities per strain.
I can recreate @thisismynextusername’s issue with the following commands in the ncov repo:
# Generate one set of random strains.
augur filter \
--metadata data/example_metadata.tsv \
--group-by region \
--sequences-per-group 5 \
--output-strains test_1.txt \
--subsample-seed 3
# Generate another set of random strains with the same seed.
# These should be the same strains.
augur filter \
--metadata data/example_metadata.tsv \
--group-by region \
--sequences-per-group 5 \
--output-strains test_2.txt \
--subsample-seed 3
# Compare the two strain sets.
diff -u <(sort -k 1,1 test_1.txt) <(sort -k 1,1 test_2.txt)
The two sets should be identical, but they are not because of the non-deterministic order of the strains in the original sets.
One simple fix would be to order the strains prior to adding them to the heap like so:
diff --git a/augur/filter.py b/augur/filter.py
index 5a15377..4475327 100644
--- a/augur/filter.py
+++ b/augur/filter.py
@@ -1428,10 +1428,11 @@ def run(args):
if queues_by_group is None:
queues_by_group = {}
- for strain, group in group_by_strain.items():
+ for strain in sorted(group_by_strain.keys()):
# During this first pass, we do not know all possible
# groups will be, so we need to build each group's queue
# as we first encounter the group.
+ group = group_by_strain[strain]
if group not in queues_by_group:
queues_by_group[group] = PriorityQueue(
max_size=sequences_per_group,
This change resolves the issue; I can always get the same strains with the same subsample seed.
The same sort of issue appears two other places in the filter module when we use the --subsample-max-sequences
argument. A copy of that same non-deterministic loop through strains occurs in this case. Sorting strains prior to adding them to the queue as above solves the issue. Additionally, the queue sizes in this case are randomly selected from a Poisson distribution. This random generator does not currently use the user-provided seed, so even if strains are added to queues in the same order, the sizes of the queues can be randomly different between runs and produce different results. Passing the random seed to the function that creates the queues and sorting the groups prior to looping through them fixes this issue.
I’ve posted a PR that should resolve these issues. Thank you for catching this issue, @thisismynextusername!