Help with Build

I have a build where I want to focus in on a specific range of dates.

builds:
    #Wave one focused build
  waveone:
    subsampling_scheme: waves-scheme # use a custom subsampling scheme defined below
    country: Zambia
    min_date: 2020-05-13
    max_date: 2020-10-05
    
filter:
  zambia: #when deprecated - remove this line to nest the below to filter:
    min_length: 5000 # Allow shorter genomes. Parameter used to filter alignment.
    skip_diagnostics: True # skip diagnostics (which can remove genomes) for this input

# STAGE 2: Subsampling parameters
subsampling:
  waves-scheme:
    # filter each dataset for each build
    allFromzambia:
      #exclude: "--exclude-where 'country!={country}'"
      min_date: "--min-date {min_date}"
      max_date: "--max-date {max_date}"
      
    allFromworldwide:
      exclude: "--exclude-where 'country={country}'"
      min_date: "--min-date {min_date}"
      max_date: "--max-date {max_date}"
      
    worldwideglobalBackground:
      exclude: "--exclude-where 'country={country}'"
      group_by: year month
      seq_per_group: 5

The json output file from this build creates entries throughout 2020 and 2021. I looked closer and for the zambia sequences there are <500 within the date range, but the augur filter output:

--output results/waveone/sample-allFromzambia.fasta

has almost 700 sequences. Clearly I am doing something wrong or misunderstand something, but unsure where to start. Any help greatly appreciated.

Thanks in advance!

Hi @dbridges! It looks like a couple of small details in the config file could explain the issues you’ve described.

Regarding the number of Zambia sequences making it into the subsampled FASTA, you’ll want to make sure the exclude: line in the section of the config shown below is not commented out. As it’s written, the config will take all sequences collected in the given date range regardless of country. When I run this config with that line uncommented on full GISAID data from Dec 9, I find 320 Zambia sequences in that date range.

The output from 2021 is coming from the worldwideglobalBackground section of the subsampling scheme where there isn’t a min or max date defined. You may want to update this section to match the date ranges in the others. When I made these changes to the config and ran it locally, I didn’t observe any records collected after October 10, 2021 in my subsampled metadata.

One way to test this out is to update your config file, delete the directory named results/waveone, and then run the workflow again. If you do this, do you still see the same odd results?

Thanks @jlhudd for taking the time to explain this. Seems obvious now and seems to be working as expected with some test data. Now trying it out with a much larger dataset!