Help for phylogenetic tree about Dengue

Hi

Already this is my 2nd topic in this blog of discussion, i have a project incoming, and the questions that are become was: ¿can i do a phylogenetic tree about dengue?, ¿what library i have to use?

Also, i want to express my concern about the image of frecuencies in my phylogenetic trees created, i dont why, because i have like 3000 sequences, the image of frecuencies of clades in my trees only have the option for visualization in the final 8 months of the project, that image see like this:

But when i limit the days for a period out of that 8 months, i didnt see the frecuencies, and the image see like this:

Someone know if i have to do a change in the workflow or in another thing of the run that could bring me the frecuencies not only in one period of the run, rather all the periods of the project?

We have a dengue analysis workflow at https://github.com/nextstrain/dengue/, which may be a useful starting point for you.

There´s a problem the example data in the link that you put, its corrupted or damaged, someone know or have where i can find the example data for a repository of dengue with 1 to 4 serotypes?

@guillermoNR How are the files in https://github.com/nextstrain/dengue/tree/main/example_data corrupted or damaged? The files seem to be fine to me and uncompress with unzstd or zstdcat without error.

Hi @trs.

At this moment the problems about the phylogenetic tree of dengue virus persist i have some questions for you, already i try copie de snakefile, and download the example data just as you mentioned, i think that in my before answer to your first reply itsnt gramatical acceptable, because i know that the files arent corrupted, but when i download the files directly of github, the results its something like this:

but, know i understand that i have to uncompressed the files to saw them fine, so in this moment i want to take up de the phylogenetic tree for dengue secuences published in Colombia in the last 30 years, there are some secuences that are status of genome: partial and anothers that are complete, but anyway, the problem in this moment is that, i copy the snakefile from the github/nextstrain/dengue, i download the example data from that page, but already the problems persist, i think that its because i didnt know certainly if i need more files to do my work that: the metadata and secuences i recolect about Colombia dengue virus sequenciation, the snakefile, one copy of the snakemake of the ncov directory.
I was trying running a build that i create for this tree but, i also copy a picture about the results in the last tries.

i need help, and some explanations, in advance i want to thank you about your replies and your help.

@guillermoNR Ah, ok, that’s a lot of helpful information to understand what you’re doing. Thanks!

You’ll need all (or mostly all) of the files in the https://github.com/nextstrain/dengue repository. That you downloaded just a few of the files is the reason why you see the “Missing input files” errors in your second screenshot.

Probably the easiest way to download the whole repository is by navigating to https://github.com/nextstrain/dengue, clicking the green “Code” button, and clicking “Download ZIP”. This is a snapshot of the repository at the time of download. Extracting the ZIP file produces a dengue-main folder on your computer.

You can also download the repository using the git command, if you have that, with:

git clone https://github.com/nextstrain/dengue

This downloads the entire history of the repository, and it would be possible for you in the future to fetch updates we make to the repository using other git commands. This produces a dengue folder on your computer.

Once you have a folder with a complete copy of the repository, you can verify it works for you and then start making modifications if you need any.

Regarding your screenshot of the unintelligible metadata_all.tsv.zst file you downloaded, that looks to me like the file was not uncompressed with unzstd first. Notepad is attempting to show as text data that’s not text. You can either manually decompress the file to turn the compressed metadata_all.tsv.zst into a plain text metadata_all.tsv, or let the Snakemake workflow do that and find the uncompressed copies in results/.

I’ll also tag in @quietjen as she’s been working on our dengue workflow recently and may be able to offer more help.

Hi @trs @quietjen

So the last reply of Thomas give me a lot of things of what i have to do, and its a partial solution, for the tree that i want to visualize, a pic about the tree thats the product of the phylogenetic analysis, its:

And i confirme thats almost some of the sequences that cand visualize are in my metadata and fasta files, BUT, i didnt know why the tree its only subsampling 20 samples of Colombia when i introduced like 700 samples of the country that are published in https://www.bv-brc.org/view/Taxonomy/12637#view_tab=overview, im concerned because i didnt know that the program its subsampling out a lot of my samples because there have the status of the genome partial and not complete, or why

(its not possibily that its a problem about the subsampling scheme because i already config that parameters, this is a picture that express that:
image

pd: i upload a picture that show you my metadata file and you can confirm that i am talking about like 700+ sequences.

Thanks a lot for Thomas because the last reply, i am closely, i hope. thanks with anticipation.

@guillermoNR Glad you’ve made progress!

It seems like you’re using an adaptation of our GitHub - nextstrain/ncov: Nextstrain build for novel coronavirus SARS-CoV-2 for dengue? Can you share your full Snakefile, config files, and metadata file? As uploads, not as screenshots.

It’ll be hard to say what’s going on without being able to inspect those files.

ok, im trying to upload the full files in the comentary but it send me a pic like this:

Also i have to mention that im using the same snakefile that is in github for dengue.

How i can send you the full files?

Thanks Thomas

To share, you can zip the inputs into a folder and share via any file sharing platform: Google drive, we share. Or send it via email to hello@nextstrain.org if it needs to be private.

Ah, sorry about that upload restriction. I’ve changed the settings to be more permissive, as we do need folks to be able to upload arbitrary files here from time to time. :slight_smile: Can you try again?

1 Like

No problem, @trs

Metadatos
dengue.metadata.prueba1.csv (145.3 KB)

Fasta
dengue.prueba2.fasta (1.8 MB)

Build.yaml
builds.yaml (2.8 KB)

Snakefile
Snakefile (8.8 KB)

1 Like

Ah, this will be the issue then, and what you uploaded confirms it. Unlike our ncov workflow, the dengue workflow’s Snakefile doesn’t use a configuration file (builds.yaml). So your input data files (TSV and FASTA) and subsampling schemes aren’t being used by the workflow.

I expect the Colombian sequences you found in the tree you generated are from NCBI GenBank, by way of our curated dengue data files that are downloaded at the start of the workflow.

The dengue workflow currently isn’t set up well to accept custom input files or custom subsampling schemes, although it’s on our list of things to do.

1 Like

Oh right that’s very instructive, and i already know what have to happen for do the phylogenetic tree that i am concerned, thanks for all your help.

finally, i have a last question: when i did the phylogenetic tree with the actual snakefile, i can subsampling the tree? like configure the tree for dont show me the 1500+ sequences that it was downloading with the snakefile?

when i did the phylogenetic tree with the actual snakefile, i can subsampling the tree?

If you want to subsample the phylogenetic tree, subsampling would happen in the Dengue Snakemake at the augur filter step.

At this step, for example, to subsample to only Columbia sequences, modify the --exclude-where:

--exclude-where country!="Columbia" region=? date=?

You can find more information about augur filter subsampling here.

If your goal is to only use sequences you uploaded here. Then I recommend modifying your metadata file to be comma-separated or tab-separated (instead of semicolon-separated).

head -n3 dengue.metadata.prueba1.csv
genome_id;strain;genus_id;ncbi_id;superkingdom;kingdom;phylum;class;order;family;genus;species;type;genome_status;region;country;region;location;date
11053.106;DENV-1/CO/SAN/LV122007/2009;11053;10239;Orthornavirae;Kitrinoviricota;Flasuviricetes;Amarillovirales;Flaviviridae;Flavivirus;dengue;Dengue virus;I;Partial;South_America;Colombia;Santander;Floridablanca;2009
11053.10603;DENV-1/CO/SAN/LV121086/2010;11053;10239;Orthornavirae;Kitrinoviricota;Flasuviricetes;Amarillovirales;Flaviviridae;Flavivirus;dengue;Dengue virus;I;Partial;South_America;Colombia;Santander;Floridablanca;2010

and modifying your sequences.fasta such that the strain name is the only thing listed in the header

head -n 3 dengue.prueba2.fasta 
>accn|KX901653   Dengue virus 1 isolate DENV-1/CO/SAN/LV122007/2009 polyprotein gene, partial cds.   [Dengue virus 1 DENV-1/CO/SAN/LV122007/2009 | 11053.10600]
atgcgatgtgtgggaataggcaacagagacttcgttgaaggactgtcaggagcaacgtgg
gtggacgtggtattggagcatggaagctgcgtcaccaccatggcaaaaaataaaccaaca

From there, you can try different augur filter commands on your dataset.

# Subsample to only Dengue Type I
augur filter \
  --metadata dengue.metadata.prueba1.csv \
  --sequences dengue.prueba2.fasta \
  --exclude-where type!="I" \
  --output filtered.fasta

grep ">" filtered.fasta | head 

As mentioned, currently the dengue Snakemake is not written for tiered subsampling as you have listed in your build.yaml.

  • dengue.metadata.prueba1.csv: keep all, focal samples
  • columbia context: proximal focal subsampling
  • global context: proximal focal subsampling

But you may be able to use the augur subcommands to get the equivalent functionality. The tiered subsampling is a more complex beast then is currently listed in the dengue Snakemake.

1 Like