Hi, I’m confused about the format for custom subsampling, and the instructions in nextstrain docs are sort of sparse on this. I’m wondering if you might look at my builds.yaml file to see if I’m going to get what I want out of the build.
My input files contain data from California (first sequences) and from all over the USA (remaining sequences). I want to build a tree with N_1 CA sequences, and subsample N_2 sequences from USA that they are genetically similar to the CA sequences. I’m not sure if what I’m doing is right.
Here is the text of my build file with N_1=N_2=500. . it’s just a slightly modified version of the “custom-county” subsampling section in the ncov/example/builds.yaml file. Will this give me what I want? :
inputs:
- name: shared-id-gisaid-data
metadata: data/shared-id-gisaid_metadata.tsv
sequences: data/shared-id-gisaid.fasta.gz
builds:
shared-id-gisaid-build:
subsampling_scheme: custom-division
region: North America
country: USA
division: California
subsampling:
custom-division:
focal:
group_by: “division”
max_sequences: 500
query: --query “(country == ‘{country}’) & (division == ‘{division}’)”
related:
group_by: “division”
max_sequences: 500
exclude: “–exclude-where ‘division={division}’”
priorities:
type: “proximity”
focus: “focal”
files:
auspice_config: “my_profiles/example/my_auspice_config.json”
description: “my_profiles/example/my_description.md”
include: “defaults/include.txt”