Question about augur filter

Emma316 · March 29, 2024, 9:59am

Dear mentors,

I have encountered a new question regarding the usage of augur filter commands and would greatly appreciate your guidance.

My objective is to randomly subsample sequences from Bangladesh for the months of May, June, July, and August, specifically aiming to select 1, 4, 13, and 4 sequences for each month respectively. I attempted to use the following syntax; however, it seems there is an issue with the part --sequences-per-group 1 4 13 4. Upon reviewing the instructions, I did not find any guidance on this specific scenario. The documentation I found only mentions the capability to select any one sample per month from each country. Is it possible to use Augur filter to specify a fixed number of sequences for specific months within a country? If yes, could anyone help me to correct the below syntax? Thanks a lot!!

augur filter
–sequences HA_globalref.fasta
–metadata HA_metadata_globalref_updated.tsv
–query “(country == ‘Bangladesh’) & ((month == 5) | (month == 6) | (month == 7) | (month == 8))”
–group-by country month
–sequences-per-group 1 4 13 4
–output subsampled_sequences.fasta
–output-metadata subsampled_metadata.tsv

victorlin · March 29, 2024, 6:06pm

Hi @Emma316,

This is a great question that’s relevant to my current work.

--sequences-per-group takes a single number which is the size used for all groups.

There are ongoing discussions to support varying group sizes, but currently this is beyond the capabilities of a single augur filter call. It can be done using multiple calls to augur filter. Something like this:

# 1 sequence from month 5
augur filter \
  --metadata HA_metadata_globalref_updated.tsv \
  --query "(country == 'Bangladesh') & (month == 5)" \
  --subsample-max-sequences 1 \
  --output-strains subsampled_strains_month5.txt

# 4 sequences from month 6
augur filter \
  --metadata HA_metadata_globalref_updated.tsv \
  --query "(country == 'Bangladesh') & (month == 6)" \
  --subsample-max-sequences 4 \
  --output-strains subsampled_strains_month6.txt

# 13 sequences from month 7
augur filter \
  --metadata HA_metadata_globalref_updated.tsv \
  --query "(country == 'Bangladesh') & (month == 7)" \
  --subsample-max-sequences 13 \
  --output-strains subsampled_strains_month7.txt

# 4 sequences from month 8
augur filter \
  --metadata HA_metadata_globalref_updated.tsv \
  --query "(country == 'Bangladesh') & (month == 8)" \
  --subsample-max-sequences 4 \
  --output-strains subsampled_strains_month8.txt

# Combine samples
augur filter \
  --sequences HA_globalref.fasta \
  --metadata HA_metadata_globalref_updated.tsv \
  --exclude-all \
  --include subsampled_metadata_month5.txt \
            subsampled_metadata_month6.txt \
            subsampled_metadata_month7.txt \
            subsampled_metadata_month8.txt \
  --output-sequences subsampled_sequences.fasta \
  --output-metadata subsampled_metadata.tsv

Emma316 · March 30, 2024, 4:37am

Hi Victorlin,

Well received! Thanks a lot for your information and your work! It’s very helpful!

jonr · April 17, 2024, 7:51am

Great question and very useful answer @victorlin! I’ve been struggling to apply multiple filters myself. I can do this for ncov using the configfile, but is it possible to achieve something similar to the code below using multiple calls to augur filter directly in the Snakefile? Specifically I want to subsample strains that are closest to a specific country.

  subsampling-scheme:
    country:
      min_date: "--min-date 1900-01-01"
      query: --query "(country == '{country}')"
    related:
      min_date: "--min-date 1900-01-01"
      group_by: "country year month"
      max_sequences: 1
      sampling_scheme: "--probabilistic-sampling"
      query: --query "(country != '{country}')"
      priorities:
        type: "proximity"
        focus: "country"

victorlin · April 18, 2024, 9:58pm

Hi @jonr,

The proximity-based sampling in your config is implemented as part of the ncov Snakemake workflow. I haven’t tried myself, but if you want to use it in another Snakemake workflow, you should be able to copy the relevant rules/functions/scripts referenced in main_workflow.smk and modify the rule inputs/outputs.

We have future plans to make this available in an Augur command, but no clear timeline.

– Victor

jonr · April 23, 2024, 9:22am

The proximity filtering is really nice. Thanks for the links!
Jon

Topic		Replies	Views
Error with augur filter after latest git pull General	7	1035	October 14, 2020
Where to find the meaning of the following augur commands?	1	380	November 20, 2020
Way to turn off filters? Help and Getting Started	4	463	July 19, 2022
Augur filter --subsample-seed reproducible example Help and Getting Started	3	543	September 23, 2021
Augur error while subsampling - updated Help and Getting Started	0	496	November 21, 2020

Question about augur filter

Related topics