Regarding Build for USA- Missing Data

vrmarathe · October 19, 2021, 8:47am

I am doing some research regarding COVID-19. I have done a nextstrain build for USA and I am getting a very small number of sequences compared to the original data. According to my calculations, the number of sequences from United States or the USA is 700,000 and the auspice visualization produces for around 300-500 sequences.

If you are wondering where the 700K came from, I used nextclade and obtained a TSV file i.e. nextclade.tsv and used pandas to get the approx number of sequences

How should I get nextstrain to be applied for all the sequences? I have attached my build and config file. What should I change to get the output ?

My Build File :

# This is where we define which builds we'd like to run.
# This example includes 4 separate builds, ranging from the regional (global) to location (county) level.
# You can comment-out, remove, or add as many builds as you'd like.

# Each build needs a name, a defined subsampling process, and geographic attributes used for subsampling.
# Geography is specified by build attributes (e.g., `region`, `country`, `division`, `location`) that are referenced from subsampling schemes.

# The default config file, `./defaults/parameters.yaml` has reasonable default subsampling methods for each geographic resolution.
# These subsample primarily from the area of interest ("focus"), and add in background ("contextual") sequences from the rest of the world.
# Contextual sequences that are genetically similar to (hamming distance) and geographically near the focal sequences are heavily prioritized.

# In this example, we use these default methods. See other templates for examples of how to customize this subsampling scheme.
 
# Define input files.
inputs:
  - name: example-data
    metadata: data/metadata_gisaid.tsv.gz
    sequences: data/sequences_gisaid.fasta.gz

builds:

  # This build focuses on the entire U.S.
  # with a build name that will produce the following URL fragment on Nextstrain/auspice:
  # /ncov/north-america/usa
  north-america_usa_new:
    region: North America
    country: USA
    # Here, USA is in North America
    
# Here, you can specify what type of auspice_config you want to use
# and what description you want. These will apply to all the above builds.
# If you want to specify specific files for each build - you can!
# See the 'example_advanced_customization' builds.yaml
files:
  auspice_config: "my_profiles/usa_build_new/my_auspice_config.json"
  description: "my_profiles/usa_build_new/my_description.md"

**Config File :** 

#####################################################################################
#### NOTE: head over to `builds.yaml` to define what builds you'd like to run. ####
#### (i.e., datasets and subsampling schemas)  ####
#####################################################################################

# This analysis-specific config file overrides the settings in the default config file.
# If a parameter is not defined here, it will fall back to the default value.

configfile:
  - defaults/parameters.yaml # Pull in the default values
  - my_profiles/usa_build/builds.yaml # Pull in our list of desired builds

# Set the maximum number of cores you want Snakemake to use for this pipeline.
cores: 8

# Always print the commands that will be run to the screen for debugging.
printshellcmds: True

# Print log files of failed jobs
show-failed-logs: True

james · October 19, 2021, 8:04pm

Hi @vrmarathe – how many sequences are in your input files (data/metadata_gisaid.tsv.gz, data/sequences_gisaid.fasta.gz)? A nextstrain workflow (“build”) is based off the data in them, so if you want to analyse the entire US data, you’ll need to get hold of that and supply it as inputs to the workflow. We can’t provide this data from GISAID, but have included documentation on how to obtain it. We are able to supply the data available on GenBank.

vrmarathe · October 19, 2021, 8:49pm

Hi @james, The data has all the sequence data and the metadata which I downloaded from GISAID.The original MSA_full.fasta file and the metadata from the GISAID website. Should I download from Genbank and do some processing? The original metadata file is around 2-3 GB and the original fasta file is around 90GB

james · October 19, 2021, 9:08pm

Great - so just to double check, the data you downloaded from GISAID (MSA_full.fasta) is now present in the files data/metadata_gisaid.tsv.gz and data/sequences_gisaid.fasta.gz, as specified in the builds.yaml file you originally included?

I noticed one thing looking at your attached files: the config.yaml is specifying a builds.yaml file in the folder my_profiles/usa_build, however the builds.yaml file you included looks as if it may be in my_profiles/usa_build_new.

vrmarathe · October 19, 2021, 9:33pm

Hi @james, I changed that and executed the new build and its missing sequences. Anything else I can do ?

james · October 19, 2021, 9:37pm

Could you post the snakemake output here? Or scan it for messages as to why strains are getting filtered out at various steps?

vrmarathe · October 19, 2021, 9:43pm

Hi, @james , I have pasting the snakemake output.
Google Drive: Sign-in.

I could even do a Google Meet Video call or Zoom meeting to solve the issue.

vrmarathe · October 22, 2021, 9:05pm

Hi @james , Could I change anything in the default parameters file(parameters.yaml) to get all strains ?

james · October 24, 2021, 9:26pm

Hi @vrmarathe - could you make that snakemake output public? The link you provided requires specific access.

vrmarathe · October 27, 2021, 6:54pm

Hi @james, There was an error with the settings that it was taking the wrong input for the analysis. When it was building the tree using iqtree with only USA data, there are some problems with the memory.

[Wed Oct 27 03:30:49 2021]
Job 6: Building tree

    augur tree             --alignment results/north-america_usa_with_default/aligned.fasta             --tree-builder-args '-ninit 10 -n 4'             --output results/north-america_usa_with_default/tree_raw.nwk             --nthreads 8 2>&1 | tee logs/tree_north-america_usa_with_default.txt

[Wed Oct 27 03:31:47 2021]
Error in rule tree:
jobid: 6
output: results/north-america_usa_with_default/tree_raw.nwk
log: logs/tree_north-america_usa_with_default.txt (check log file(s) for error message)
shell:

    augur tree             --alignment results/north-america_usa_with_default/aligned.fasta             --tree-builder-args '-ninit 10 -n 4'             --output results/north-america_usa_with_default/tree_raw.nwk             --nthreads 8 2>&1 | tee logs/tree_north-america_usa_with_default.txt
    
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Logfile logs/tree_north-america_usa_with_default.txt:

Complete log: /home/vishwajeet/data/ncov/.snakemake/log/2021-10-27T032002.362655.snakemake.log

ERROR: Shell exited from fatal signal SIGKILL when running: iqtree2 -ninit 2 -n 2 -me 0.05 -nt 8 -s results/north-america_usa_with_default/aligned-delim.fasta -m GTR -ninit 10 -n 4 > results/north-america_usa_with_default/aligned-delim.iqtree.log
Command output was:
/bin/bash: line 1: 183525 Killed iqtree2 -ninit 2 -n 2 -me 0.05 -nt 8 -s results/north-america_usa_with_default/aligned-delim.fasta -m GTR -ninit 10 -n 4 > results/north-america_usa_with_default/aligned-delim.iqtree.log
The OS may have terminated the command due to an out-of-memory condition.

Building a tree via:
iqtree2 -ninit 2 -n 2 -me 0.05 -nt 8 -s results/north-america_usa_with_default/aligned-delim.fasta -m GTR -ninit 10 -n 4 > results/north-america_usa_with_default/aligned-delim.iqtree.log
Nguyen et al: IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies.
Mol. Biol. Evol., 32:268-274. https://doi.org/10.1093/molbev/msu300

ERROR: TREE BUILDING FAILED
Please see the log file for more details: results/north-america_usa_with_default/aligned-delim.iqtree.log

Building original tree took 4454.085117816925 seconds
[Wed Oct 27 04:45:06 2021]
Error in rule tree:
jobid: 6
output: results/north-america_usa_with_default/tree_raw.nwk
log: logs/tree_north-america_usa_with_default.txt (check log file(s) for error message)
shell:

    augur tree             --alignment results/north-america_usa_with_default/aligned.fasta             --tree-builder-args '-ninit 10 -n 4'             --output results/north-america_usa_with_default/tree_raw.nwk             --nthreads 8 2>&1 | tee logs/tree_north-america_usa_with_default.txt
    
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Logfile logs/tree_north-america_usa_with_default.txt:

ERROR: Shell exited from fatal signal SIGKILL when running: iqtree2 -ninit 2 -n 2 -me 0.05 -nt 8 -s results/north-america_usa_with_default/aligned-delim.fasta -m GTR -ninit 10 -n 4 > results/north-america_usa_with_default/aligned-delim.iqtree.log
Command output was:
/bin/bash: line 1: 183525 Killed iqtree2 -ninit 2 -n 2 -me 0.05 -nt 8 -s results/north-america_usa_with_default/aligned-delim.fasta -m GTR -ninit 10 -n 4 > results/north-america_usa_with_default/aligned-delim.iqtree.log
The OS may have terminated the command due to an out-of-memory condition.

LOG FILE:
IQ-TREE multicore version 2.1.4-beta COVID-edition for Linux 64-bit built Jun 24 2021
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Host: yan2-computer2 (AVX, 251 GB RAM)
Command: iqtree2 -ninit 2 -n 2 -me 0.05 -nt 8 -s results/north-america_usa_with_default/aligned-delim.fasta -m GTR -ninit 10 -n 4
Seed: 111503 (Using SPRNG - Scalable Parallel Random Number Generator)
Time: Wed Oct 27 03:41:17 2021
Kernel: AVX - 8 threads (24 CPU cores detected)

Reading alignment file results/north-america_usa_with_default/aligned-delim.fasta … Fasta format detected
Alignment most likely contains DNA/RNA sequences

I have some questions,

Is there a way to not generate the tree and go towards getting the mutations i.e. AA mutations and nucleotide mutations step of the process ? I only need the mutations for my research.
2.I have asked this question before when I ran the process for the entire GISAID dataset.The VM which I am using has around 250GB of RAM and 5TB of HDD. Currently, I am using the USA parameter in the builds.yaml file for the filtering,and it still runs out of memory. Is there a way to solve this ?
3.Is there a way we can do a zoom meeting or a Google meet and solve the problem with nextstrain? We can schedule a time and solve the problem and make it easier for me for the execution.

Topic		Replies	Views
Subsampling sequences genetically related to a focal sample Help and Getting Started	0	453	January 14, 2022
Only global build found in ./auspice General	4	568	October 23, 2020
Subsampling and Data Download Help and Getting Started	2	563	March 19, 2021
Sequence missing after certain dates General	5	222	January 16, 2024
Losing country info in final build General	4	459	May 17, 2022

Regarding Build for USA- Missing Data

Related topics