Regarding Build for USA- Missing Data

I am doing some research regarding COVID-19. I have done a nextstrain build for USA and I am getting a very small number of sequences compared to the original data. According to my calculations, the number of sequences from United States or the USA is 700,000 and the auspice visualization produces for around 300-500 sequences.

If you are wondering where the 700K came from, I used nextclade and obtained a TSV file i.e. nextclade.tsv and used pandas to get the approx number of sequences

How should I get nextstrain to be applied for all the sequences? I have attached my build and config file. What should I change to get the output ?

My Build File :

# This is where we define which builds we'd like to run.
# This example includes 4 separate builds, ranging from the regional (global) to location (county) level.
# You can comment-out, remove, or add as many builds as you'd like.

# Each build needs a name, a defined subsampling process, and geographic attributes used for subsampling.
# Geography is specified by build attributes (e.g., `region`, `country`, `division`, `location`) that are referenced from subsampling schemes.

# The default config file, `./defaults/parameters.yaml` has reasonable default subsampling methods for each geographic resolution.
# These subsample primarily from the area of interest ("focus"), and add in background ("contextual") sequences from the rest of the world.
# Contextual sequences that are genetically similar to (hamming distance) and geographically near the focal sequences are heavily prioritized.

# In this example, we use these default methods. See other templates for examples of how to customize this subsampling scheme.
 
# Define input files.
inputs:
  - name: example-data
    metadata: data/metadata_gisaid.tsv.gz
    sequences: data/sequences_gisaid.fasta.gz

builds:

  # This build focuses on the entire U.S.
  # with a build name that will produce the following URL fragment on Nextstrain/auspice:
  # /ncov/north-america/usa
  north-america_usa_new:
    region: North America
    country: USA
    # Here, USA is in North America
    
# Here, you can specify what type of auspice_config you want to use
# and what description you want. These will apply to all the above builds.
# If you want to specify specific files for each build - you can!
# See the 'example_advanced_customization' builds.yaml
files:
  auspice_config: "my_profiles/usa_build_new/my_auspice_config.json"
  description: "my_profiles/usa_build_new/my_description.md"

**Config File :** 

#####################################################################################
#### NOTE: head over to `builds.yaml` to define what builds you'd like to run. ####
#### (i.e., datasets and subsampling schemas)  ####
#####################################################################################

# This analysis-specific config file overrides the settings in the default config file.
# If a parameter is not defined here, it will fall back to the default value.

configfile:
  - defaults/parameters.yaml # Pull in the default values
  - my_profiles/usa_build/builds.yaml # Pull in our list of desired builds

# Set the maximum number of cores you want Snakemake to use for this pipeline.
cores: 8

# Always print the commands that will be run to the screen for debugging.
printshellcmds: True

# Print log files of failed jobs
show-failed-logs: True

Hi @vrmarathe – how many sequences are in your input files (data/metadata_gisaid.tsv.gz, data/sequences_gisaid.fasta.gz)? A nextstrain workflow (“build”) is based off the data in them, so if you want to analyse the entire US data, you’ll need to get hold of that and supply it as inputs to the workflow. We can’t provide this data from GISAID, but have included documentation on how to obtain it. We are able to supply the data available on GenBank.

Hi @james, The data has all the sequence data and the metadata which I downloaded from GISAID.The original MSA_full.fasta file and the metadata from the GISAID website. Should I download from Genbank and do some processing? The original metadata file is around 2-3 GB and the original fasta file is around 90GB

Great - so just to double check, the data you downloaded from GISAID (MSA_full.fasta) is now present in the files data/metadata_gisaid.tsv.gz and data/sequences_gisaid.fasta.gz, as specified in the builds.yaml file you originally included?

I noticed one thing looking at your attached files: the config.yaml is specifying a builds.yaml file in the folder my_profiles/usa_build, however the builds.yaml file you included looks as if it may be in my_profiles/usa_build_new.

Hi @james, I changed that and executed the new build and its missing sequences. Anything else I can do ?

Could you post the snakemake output here? Or scan it for messages as to why strains are getting filtered out at various steps?

Hi, @james , I have pasting the snakemake output.
Google Drive: Sign-in.

I could even do a Google Meet Video call or Zoom meeting to solve the issue.

Hi @james , Could I change anything in the default parameters file(parameters.yaml) to get all strains ?

Hi @vrmarathe - could you make that snakemake output public? The link you provided requires specific access.

Hi @james, There was an error with the settings that it was taking the wrong input for the analysis. When it was building the tree using iqtree with only USA data, there are some problems with the memory.

[Wed Oct 27 03:30:49 2021]
Job 6: Building tree

    augur tree             --alignment results/north-america_usa_with_default/aligned.fasta             --tree-builder-args '-ninit 10 -n 4'             --output results/north-america_usa_with_default/tree_raw.nwk             --nthreads 8 2>&1 | tee logs/tree_north-america_usa_with_default.txt

[Wed Oct 27 03:31:47 2021]
Error in rule tree:
jobid: 6
output: results/north-america_usa_with_default/tree_raw.nwk
log: logs/tree_north-america_usa_with_default.txt (check log file(s) for error message)
shell:

    augur tree             --alignment results/north-america_usa_with_default/aligned.fasta             --tree-builder-args '-ninit 10 -n 4'             --output results/north-america_usa_with_default/tree_raw.nwk             --nthreads 8 2>&1 | tee logs/tree_north-america_usa_with_default.txt
    
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Logfile logs/tree_north-america_usa_with_default.txt:

Complete log: /home/vishwajeet/data/ncov/.snakemake/log/2021-10-27T032002.362655.snakemake.log

ERROR: Shell exited from fatal signal SIGKILL when running: iqtree2 -ninit 2 -n 2 -me 0.05 -nt 8 -s results/north-america_usa_with_default/aligned-delim.fasta -m GTR -ninit 10 -n 4 > results/north-america_usa_with_default/aligned-delim.iqtree.log
Command output was:
/bin/bash: line 1: 183525 Killed iqtree2 -ninit 2 -n 2 -me 0.05 -nt 8 -s results/north-america_usa_with_default/aligned-delim.fasta -m GTR -ninit 10 -n 4 > results/north-america_usa_with_default/aligned-delim.iqtree.log
The OS may have terminated the command due to an out-of-memory condition.

Building a tree via:
iqtree2 -ninit 2 -n 2 -me 0.05 -nt 8 -s results/north-america_usa_with_default/aligned-delim.fasta -m GTR -ninit 10 -n 4 > results/north-america_usa_with_default/aligned-delim.iqtree.log
Nguyen et al: IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies.
Mol. Biol. Evol., 32:268-274. https://doi.org/10.1093/molbev/msu300

ERROR: TREE BUILDING FAILED
Please see the log file for more details: results/north-america_usa_with_default/aligned-delim.iqtree.log

Building original tree took 4454.085117816925 seconds
[Wed Oct 27 04:45:06 2021]
Error in rule tree:
jobid: 6
output: results/north-america_usa_with_default/tree_raw.nwk
log: logs/tree_north-america_usa_with_default.txt (check log file(s) for error message)
shell:

    augur tree             --alignment results/north-america_usa_with_default/aligned.fasta             --tree-builder-args '-ninit 10 -n 4'             --output results/north-america_usa_with_default/tree_raw.nwk             --nthreads 8 2>&1 | tee logs/tree_north-america_usa_with_default.txt
    
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Logfile logs/tree_north-america_usa_with_default.txt:

ERROR: Shell exited from fatal signal SIGKILL when running: iqtree2 -ninit 2 -n 2 -me 0.05 -nt 8 -s results/north-america_usa_with_default/aligned-delim.fasta -m GTR -ninit 10 -n 4 > results/north-america_usa_with_default/aligned-delim.iqtree.log
Command output was:
/bin/bash: line 1: 183525 Killed iqtree2 -ninit 2 -n 2 -me 0.05 -nt 8 -s results/north-america_usa_with_default/aligned-delim.fasta -m GTR -ninit 10 -n 4 > results/north-america_usa_with_default/aligned-delim.iqtree.log
The OS may have terminated the command due to an out-of-memory condition.

LOG FILE:
IQ-TREE multicore version 2.1.4-beta COVID-edition for Linux 64-bit built Jun 24 2021
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Host: yan2-computer2 (AVX, 251 GB RAM)
Command: iqtree2 -ninit 2 -n 2 -me 0.05 -nt 8 -s results/north-america_usa_with_default/aligned-delim.fasta -m GTR -ninit 10 -n 4
Seed: 111503 (Using SPRNG - Scalable Parallel Random Number Generator)
Time: Wed Oct 27 03:41:17 2021
Kernel: AVX - 8 threads (24 CPU cores detected)

Reading alignment file results/north-america_usa_with_default/aligned-delim.fasta … Fasta format detected
Alignment most likely contains DNA/RNA sequences

I have some questions,

  1. Is there a way to not generate the tree and go towards getting the mutations i.e. AA mutations and nucleotide mutations step of the process ? I only need the mutations for my research.
    2.I have asked this question before when I ran the process for the entire GISAID dataset.The VM which I am using has around 250GB of RAM and 5TB of HDD. Currently, I am using the USA parameter in the builds.yaml file for the filtering,and it still runs out of memory. Is there a way to solve this ?
    3.Is there a way we can do a zoom meeting or a Google meet and solve the problem with nextstrain? We can schedule a time and solve the problem and make it easier for me for the execution.