Hi @gvestal! There are a couple of different ways you could subsample from multiple different inputs with different subsampling parameters.
For context to these approaches, the ncov workflow aggregates all inputs into a single sequences FASTA file and a single metadata TSV file. Then, it subsamples from these aggregated inputs. When the workflow merges multiple metadata inputs, it adds a one-hot encoding column per input to the merged metadata to allow you to easily select records from a specific subsets of inputs during subsample. For an example of this, see the subsampling parameters in the guide for using multiple inputs.
So, given that context, there are three different ways I can think of achieving what you’ve described.
Option 1. Define the three separate inputs as you’ve done in your example above and define a custom subsampling scheme that separately selects records from the North American and global inputs. Your scheme could look like the example in the multiple inputs guide (although here I use the query
option to select records I want instead of excluding records I don’t want):
subsampling:
custom-scheme:
# Select a fixed number of random sequences from North American input.
north-america-subset:
# The column name identifying the dataset comes from the input's name
# in the builds config above.
query: --query "na_subsampling == 'yes'"
max_sequences: 1000
# Select a fixed number of random sequences from global GISAID input.
global-subset:
query: --query "global_subsampling == 'yes'"
max_sequences: 500
The benefit of this approach is that you don’t need to know how the North American subsampling was defined; you can just randomly sample from it. The disadvantage is that the North American input and full GISAID input are redundant with each other; the former is a subset of the latter. This means the workflow will waste a lot of disk input/output merging these two inputs into a single FASTA and metadata file when you could just as easily start with the full GISAID input. Another issue with this approach is that the workflow will align, mask, and filter every sequence from all inputs before it merges the inputs together. This can take a long time even with a lot of CPUs.
Option 2. Start with only the full GISAID input and select your own North American and global subsets. With this approach, you can skip the extra time/disk required to merge multiple inputs and jump right to subsampling. On the other hand, you have to define your own subsampling logic to get North American sequences. You can use the Nextstrain subsampling scheme as a starting point, but this scheme uses some custom Snakemake rules to select early/late subsets of data. You would need to include these rules in your workflow or define a simpler scheme. Like the approach above, this approach will also align, mask, and filter all GISAID sequences.
Option 3. Subsample data prior to running the workflow. You can follow our guide for curating subsampled data from the full GISAID database and produce a single set of subsampled sequences and metadata to pass to the ncov workflow. This approach requires your logic about subsampling to live outside of the ncov workflow, but you could imagine putting that logic in a shell script or even writing your own Snakemake rules to do this work that you inject into the workflow. This approach will only need to align, mask, and filter the subsampled data you select, so it will be much faster than the other two approaches.
If you want to get started as quickly as possible, Option 1 requires the least upfront work on your part and the most computational time to complete. Option 3 requires a little extra upfront work and runs the fastest.
If you want to setup a workflow configuration that will be easy to maintain in the long term, either Option 2 or an automated version of Option 3 are the best.