I am doing some research regarding COVID-19. I have done a nextstrain build for USA and I am getting a very small number of sequences compared to the original data. According to my calculations, the number of sequences from United States or the USA is 700,000 and the auspice visualization produces for around 300-500 sequences.
If you are wondering where the 700K came from, I used nextclade and obtained a TSV file i.e. nextclade.tsv and used pandas to get the approx number of sequences
How should I get nextstrain to be applied for all the sequences? I have attached my build and config file. What should I change to get the output ?
My Build File :
# This is where we define which builds we'd like to run.
# This example includes 4 separate builds, ranging from the regional (global) to location (county) level.
# You can comment-out, remove, or add as many builds as you'd like.
# Each build needs a name, a defined subsampling process, and geographic attributes used for subsampling.
# Geography is specified by build attributes (e.g., `region`, `country`, `division`, `location`) that are referenced from subsampling schemes.
# The default config file, `./defaults/parameters.yaml` has reasonable default subsampling methods for each geographic resolution.
# These subsample primarily from the area of interest ("focus"), and add in background ("contextual") sequences from the rest of the world.
# Contextual sequences that are genetically similar to (hamming distance) and geographically near the focal sequences are heavily prioritized.
# In this example, we use these default methods. See other templates for examples of how to customize this subsampling scheme.
# Define input files.
inputs:
- name: example-data
metadata: data/metadata_gisaid.tsv.gz
sequences: data/sequences_gisaid.fasta.gz
builds:
# This build focuses on the entire U.S.
# with a build name that will produce the following URL fragment on Nextstrain/auspice:
# /ncov/north-america/usa
north-america_usa_new:
region: North America
country: USA
# Here, USA is in North America
# Here, you can specify what type of auspice_config you want to use
# and what description you want. These will apply to all the above builds.
# If you want to specify specific files for each build - you can!
# See the 'example_advanced_customization' builds.yaml
files:
auspice_config: "my_profiles/usa_build_new/my_auspice_config.json"
description: "my_profiles/usa_build_new/my_description.md"
**Config File :**
#####################################################################################
#### NOTE: head over to `builds.yaml` to define what builds you'd like to run. ####
#### (i.e., datasets and subsampling schemas) ####
#####################################################################################
# This analysis-specific config file overrides the settings in the default config file.
# If a parameter is not defined here, it will fall back to the default value.
configfile:
- defaults/parameters.yaml # Pull in the default values
- my_profiles/usa_build/builds.yaml # Pull in our list of desired builds
# Set the maximum number of cores you want Snakemake to use for this pipeline.
cores: 8
# Always print the commands that will be run to the screen for debugging.
printshellcmds: True
# Print log files of failed jobs
show-failed-logs: True