Trouble with SARS_CoV_2

alfredobrunoc · January 28, 2025, 8:21pm

Hi mentors, I’m trying to make a tree with augur. I have my sequences in the data folder and I have my metadata. However, when I execute it with:

nextstrain build . --cores 1 --configfile config.yaml

I get this error:

ERROR: TreeTime.reroot -- ERROR: unsupported rooting mechanisms or root not found

ERROR from TreeTime: This error is most likely due to a problem with your input data.
Please check your input data and try again. If you continue to have problems, please open a new issue including
the original command and the error above: <https://github.com/nextstrain/augur/issues/new/choose>

augur refine is using TreeTime version 0.11.1

15.99 TreeTime.reroot: with method or node: Wuhan-Hu-1/2019

It also seems to make a filter and in the masked_filtered file it only leaves one sequence, but my analysis is of more than 2300 sequences. Please could you help me?

Best regards,

victorlin · January 28, 2025, 11:01pm

Hi @alfredobrunoc,

It’s likely that the root is not present after filtering/sampling. Could you share your config.yaml file or even better, the rest of the workflow code? We have a SARS-CoV-2 workflow, however it does not have a “masked_filtered” file so I assume you are using a custom workflow.

– Victor

victorlin · January 29, 2025, 1:39am

Based on the reference to scripts/construct-recency-from-submission-date.py, I’m guessing you are using the Nextstrain SARS-CoV-2 workflow or at least a modified version of it. Have you tried the tutorial series? It would be helpful to know if those are working.

For the new error, it would be helpful to know what the error details in the log file are.

The config.yaml you provided does not follow the expected format: Workflow config file reference. I assume you meant to make something along the lines of

refine:
  root: "Wuhan/Hu-1/2019"

filter:
  min_length: 20000

There may also be an issue with the input data, but I can’t say for sure without specific information about the workflow you are using and the ANDINOS.fasta file.

victorlin · January 29, 2025, 9:34pm

Thanks for providing those files. I was able to reproduce your issue on the latest commit to the ncov workflow (8beaf39). If you look further above in the output, there is a useful error message:

Traceback (most recent call last):
  File "/nextstrain/build/scripts/construct-recency-from-submission-date.py", line 41, in <module>
    node_data['nodes'][strain] = {'recency': get_recency(d['date_submitted'], ref_date)}
  …
ValueError: time data '?' does not match format '%Y-%m-%d'

This is due to ? values under the date_submitted column in the ANDINOS.tsv metadata file. To fix this, you can ensure that all the values under that column are in YYYY-MM-DD format or remove the column entirely to skip steps in the workflow which use that column.

victorlin · January 29, 2025, 10:05pm

Also, I noticed that your files may contain sensitive data. You can edit the post to remove the files now that we have determined the problem.

Let me know if you have any other issues!

alfredobrunoc · January 30, 2025, 5:04am

Dear Victor,

Could you please send me back the metadata with the dates formatted as YYYY-MM-DD? I tried to fix it myself, but I encountered some difficulties. Additionally, I would like to have a tree that includes dated branches and incorporates the Nextstrain graph that shows the different clades over time.

Thank you very much!

victorlin · January 30, 2025, 5:30pm

Hi @alfredobrunoc,

Removing the date_submitted column allows the workflow to run successfully. You should keep the date column to infer the time tree. The default tree with your config file should already show a tree colored by clade over time. Can you try again with just removing the date_submitted column?

– Victor

victorlin · January 30, 2025, 10:53pm

The message

ERROR: Alignment must have at least 3 sequences

means there is likely an issue with one of the earlier steps removing your sequences from the analysis. Can you share the information in the files logs/filtered_default-build.txt, logs/subsample_default-build_all.txt, and the config.yaml you are using? It works fine for me with this one:

# Format reference: <https://docs.nextstrain.org/projects/ncov/page/reference/workflow-config-file.html>

inputs:
  - name: reference_data
    metadata: data/ANDINOS.tsv
    sequences: data/ANDINOS.fasta

# GenBank data includes "Wuhan-Hu-1/2019" which we use as the root for this build.
refine:
  root: "Wuhan/Hu-1/2019"

filter:
  min_length: 20000

Those files should be safe to share, but I still see potentially sensitive data in the files you have shared so far. I suggest editing the posts to remove these files. Let me know if you need help with that.

alfredobrunoc · January 31, 2025, 12:05am

Dear Victor

Thank you very much for your response, please find attached the files and my config.yaml It is outside the data directory but inside ncov.

subsample_default-build_all.txt (141 Bytes)

filtered_default-build.txt (511 Bytes)

Best regards,

alfredobrunoc · January 31, 2025, 12:05am

and this is the config.yaml
config.yaml (362 Bytes)

victorlin · January 31, 2025, 12:37am

The issue is apparent from the filtered_default-build.txt file:

2293 of these were dropped because they were earlier than 2019.92 or missing a date

The values in the date column have changed since your first post. You should keep them as YYYY-MM-DD format. For example, the Wuhan/Hu-1/2019 sequence should have date = 2019-12-26 however it has changed to 26-12-19 in your last upload and same for all the other sequences. It should work if you restore the original date column.

alfredobrunoc · January 31, 2025, 2:11am

Thank you very much Victor, it seems that my Excel was automatically transforming to another format and that is why it did not allow me to run but I opened it with open office and it ran, now how can I view the results, I go to the results folder but I see several directories: custom-build, default-build, translations, and more files within each one, how could I visualize the results of the tree with colors, the lineage distribution graph over time and the main results? Thank you once again for all your patience and help.

alfredobrunoc · January 31, 2025, 2:18pm

Dear Victor,

I used nextstrain view auspice/ and was able to see the results. However, in the Frequencies panel (colored by PANGO Lineage), all the lineages are displayed in grey. How can I get them to be colored by lineage? Currently, only 23I appears in color, and I would like to show the diversity of the lineages in Andean countries.

Thank you!

victorlin · January 31, 2025, 6:43pm

I am slightly confused. Are you using PANGO Lineage or Nextstrain Clade? 23I is a Nextstrain Clade. In either case, non-grey colors are assigned to values defined in the file defaults/color_ordering.tsv. It is a large file, but some of the relevant lines are:

…
clade_membership	23I (BA.2.86)
clade_membership	24A (JN.1)
clade_membership	24B (JN.1.11.1)
…
pango_lineage	JN.1.4
pango_lineage	JN.1.4.1
pango_lineage	JN.1.4.4
…

Note that the metadata values must match exactly for a color to be applied. For example, a pango_lineage value of JN.1.4 (consensus call) will not match JN.1.4.

alfredobrunoc · January 31, 2025, 10:24pm

Dear Victor. Now use the GISAID download but when analyzing the clades or lineages since 2020 only one color appears and that worries me. The result is reflected here: http://127.0.0.1:4000/ncov/default-build
I attach files in case you can help me because one of the most important points that I want to reflect is with colors the different clades over time in the countries but this is difficult for me how can I fix it?

victorlin · January 31, 2025, 10:34pm

Can you attach a screenshot? It’s hard to tell what you are seeing. The link you provided will only work on your computer.

I think it’s best to not share GISAID files publicly, I suggest removing from your post.

alfredobrunoc · January 31, 2025, 11:47pm

COLOR BY.pdf.docx (1.2 MB)

alfredobrunoc · January 31, 2025, 11:49pm

I would like to see the different clades by colour from 2020

victorlin · February 1, 2025, 12:31am

What you are seeing is a reflection of the values in your metadata file for the sequences on the tree.

If this is not what you are expecting, that means your data is being filtered to just a single clade/lineage. This could be due to mismatch between metadata and sequences files. You should check that each ID has an entry in both metadata TSV and sequences FASTA. It could also be due to sampling method, if your config file has changed since the last time it was shared.

Overall, it would be easier for others to debug issues if you share all files in the logs/ folder. These should be safe to share.

alfredobrunoc · February 1, 2025, 1:14am

Which of the log files can I share with you? There are too many and I can’t upload them all?

export_default-build.txt (62.5 KB)
emerging_lineages_default-build.txt (2.2 KB)

Topic		Replies	Views
Problems with `augur traits` and `augur frequencies` using supplied sequences Help and Getting Started	3	955	June 23, 2021
Error message executing new tutorial Help and Getting Started	11	1623	July 16, 2020
Help for phylogenetic tree about Dengue Help and Getting Started	15	907	April 6, 2023
Diagnosing error + filtering issues Help and Getting Started	14	1617	November 9, 2020
Treetime Error analysis terminated Help and Getting Started	2	597	April 2, 2022

Trouble with SARS_CoV_2

Related topics