Question about augur filter

Dear mentors,

I have encountered a new question regarding the usage of augur filter commands and would greatly appreciate your guidance.

My objective is to randomly subsample sequences from Bangladesh for the months of May, June, July, and August, specifically aiming to select 1, 4, 13, and 4 sequences for each month respectively. I attempted to use the following syntax; however, it seems there is an issue with the part --sequences-per-group 1 4 13 4. Upon reviewing the instructions, I did not find any guidance on this specific scenario. The documentation I found only mentions the capability to select any one sample per month from each country. Is it possible to use Augur filter to specify a fixed number of sequences for specific months within a country? If yes, could anyone help me to correct the below syntax? Thanks a lot!!

augur filter
–sequences HA_globalref.fasta
–metadata HA_metadata_globalref_updated.tsv
–query “(country == ‘Bangladesh’) & ((month == 5) | (month == 6) | (month == 7) | (month == 8))”
–group-by country month
–sequences-per-group 1 4 13 4
–output subsampled_sequences.fasta
–output-metadata subsampled_metadata.tsv

Hi @Emma316,

This is a great question that’s relevant to my current work.

--sequences-per-group takes a single number which is the size used for all groups.

There are ongoing discussions to support varying group sizes, but currently this is beyond the capabilities of a single augur filter call. It can be done using multiple calls to augur filter. Something like this:

# 1 sequence from month 5
augur filter \
  --metadata HA_metadata_globalref_updated.tsv \
  --query "(country == 'Bangladesh') & (month == 5)" \
  --subsample-max-sequences 1 \
  --output-strains subsampled_strains_month5.txt

# 4 sequences from month 6
augur filter \
  --metadata HA_metadata_globalref_updated.tsv \
  --query "(country == 'Bangladesh') & (month == 6)" \
  --subsample-max-sequences 4 \
  --output-strains subsampled_strains_month6.txt

# 13 sequences from month 7
augur filter \
  --metadata HA_metadata_globalref_updated.tsv \
  --query "(country == 'Bangladesh') & (month == 7)" \
  --subsample-max-sequences 13 \
  --output-strains subsampled_strains_month7.txt

# 4 sequences from month 8
augur filter \
  --metadata HA_metadata_globalref_updated.tsv \
  --query "(country == 'Bangladesh') & (month == 8)" \
  --subsample-max-sequences 4 \
  --output-strains subsampled_strains_month8.txt

# Combine samples
augur filter \
  --sequences HA_globalref.fasta \
  --metadata HA_metadata_globalref_updated.tsv \
  --exclude-all \
  --include subsampled_metadata_month5.txt \
            subsampled_metadata_month6.txt \
            subsampled_metadata_month7.txt \
            subsampled_metadata_month8.txt \
  --output-sequences subsampled_sequences.fasta \
  --output-metadata subsampled_metadata.tsv

Hi Victorlin,

Well received! Thanks a lot for your information and your work! It’s very helpful!

1 Like

Great question and very useful answer @victorlin! I’ve been struggling to apply multiple filters myself. I can do this for ncov using the configfile, but is it possible to achieve something similar to the code below using multiple calls to augur filter directly in the Snakefile? Specifically I want to subsample strains that are closest to a specific country.

      min_date: "--min-date 1900-01-01"
      query: --query "(country == '{country}')"
      min_date: "--min-date 1900-01-01"
      group_by: "country year month"
      max_sequences: 1
      sampling_scheme: "--probabilistic-sampling"
      query: --query "(country != '{country}')"
        type: "proximity"
        focus: "country"

Hi @jonr,

The proximity-based sampling in your config is implemented as part of the ncov Snakemake workflow. I haven’t tried myself, but if you want to use it in another Snakemake workflow, you should be able to copy the relevant rules/functions/scripts referenced in main_workflow.smk and modify the rule inputs/outputs.

We have future plans to make this available in an Augur command, but no clear timeline.

– Victor

The proximity filtering is really nice. Thanks for the links!