Augur filter --subsample-seed reproducible example

I’m trying to get reproducible examples to help myself understand nextstrain better, but I can’t seem to get the augur filter command to return the same results twice. Am I specifying the seed incorrectly, or is something else going on?

augur filter \
--sequences results/filtered_example-data.fasta.xz \
--subsample-seed 3 \
--metadata results/sanitized_metadata_example-data.tsv.xz \
--sequence-index results/combined_sequence_index.tsv.xz \
--include defaults/include.txt \
--exclude defaults/exclude.txt \
--group-by country year month \
--sequences-per-group 20 \
--output results/global/sample-global.fasta \
--output-strains results/global/sample-global.txt 2>&1 | tee logs/subsample_global.txt

When I run the exact same command back-to-back, each run produces different results.

I believe we intend the output to be stable in this case, but indeed I can reproduce this behaviour. @jlhudd may be able to weigh in here on expected behaviour and where a bug might lie.

(I initially thought [filter]: Probalistic sampling doesn't use the given RNG seed · Issue #770 · nextstrain/augur · GitHub might be implicated, but I don’t think the probabilistic sampling code path comes into play here, and I’m not even sure that issue as-described represents an actual bug.)

Reading the code a bit more, I think I see another potential issue, though can’t say for sure it’s real without some actual debugging and running the code.

Here’s the potential reproducibility issue:

  • Our PriorityQueue implementation used for each group prefers the most recently added entries when there are ties in priority.
  • Strains are added to each queue by iterating over keys in a dict, group_by_strain. In practice in CPython 3.6+ (and guaranteed in Python 3.7+), dict keys are returned in insertion order.
  • The dict of strains group_by_strain is constructed by get_groups_for_subsampling(), which builds it up by iterating over the strains passed in as an argument.
  • The strains passed as an argument, in this case seq_keep, is constructed by apply_filters(), and is a set, which is inherently unordered. The iteration order of a set is arbitrary and non-deterministic run-to-run.
  • Thus, strains are added to each queue in a non-deterministic order, meaning ties are resolved non-deterministically.

As a workaround, try setting PYTHONHASHSEED in your environment when running augur filter. This makes the output stable in my testing.

I’m not sure yet what we should do to address this, if anything, in Augur itself.

1 Like

Thank you for the detailed analysis, @trs! That’s a perfect summary of the filter logic.

The current behavior is a bug. My intention was for the randomness of strain selection in groups to be entirely based on random priorities that use the random seed argument and assigned to each strain when it is added to the queue.

You can test a minimal version of what I expected to happen with the following code:

from collections import defaultdict
import numpy as np

seed = 3
random_generator = np.random.default_rng(seed)
priorities = defaultdict(random_generator.random)

strains = {"A", "B", "C", "D"}
for strain in strains:


This code will always assign the same random priorities to the same strains. But, as you noted, this only works because the strain order happens to be maintained here. If we define the strains set with the same values in a different order, we get different priorities per strain.

I can recreate @thisismynextusername’s issue with the following commands in the ncov repo:

# Generate one set of random strains.
augur filter \
  --metadata data/example_metadata.tsv \
  --group-by region \
  --sequences-per-group 5 \
  --output-strains test_1.txt \
  --subsample-seed 3

# Generate another set of random strains with the same seed.
# These should be the same strains.
augur filter \
  --metadata data/example_metadata.tsv \
  --group-by region \
  --sequences-per-group 5 \
  --output-strains test_2.txt \
  --subsample-seed 3

# Compare the two strain sets.
diff -u <(sort -k 1,1 test_1.txt) <(sort -k 1,1 test_2.txt)

The two sets should be identical, but they are not because of the non-deterministic order of the strains in the original sets.

One simple fix would be to order the strains prior to adding them to the heap like so:

diff --git a/augur/ b/augur/
index 5a15377..4475327 100644
--- a/augur/
+++ b/augur/
@@ -1428,10 +1428,11 @@ def run(args):
                     if queues_by_group is None:
                         queues_by_group = {}

-                    for strain, group in group_by_strain.items():
+                    for strain in sorted(group_by_strain.keys()):
                         # During this first pass, we do not know all possible
                         # groups will be, so we need to build each group's queue
                         # as we first encounter the group.
+                        group = group_by_strain[strain]
                         if group not in queues_by_group:
                             queues_by_group[group] = PriorityQueue(

This change resolves the issue; I can always get the same strains with the same subsample seed.

The same sort of issue appears two other places in the filter module when we use the --subsample-max-sequences argument. A copy of that same non-deterministic loop through strains occurs in this case. Sorting strains prior to adding them to the queue as above solves the issue. Additionally, the queue sizes in this case are randomly selected from a Poisson distribution. This random generator does not currently use the user-provided seed, so even if strains are added to queues in the same order, the sizes of the queues can be randomly different between runs and produce different results. Passing the random seed to the function that creates the queues and sorting the groups prior to looping through them fixes this issue.

I’ve posted a PR that should resolve these issues. Thank you for catching this issue, @thisismynextusername!

1 Like