The most basic of build help


fairly new to manipulating scripts, and I would appreciate some guidance on the builds and parameters yamls.
followed a tutorial to create a custom profile in my_profiles, and edit the builds yaml.

im not exactly sure which entries are required, and which are optional. and where do I adjust the parameters for all the tools (for example, I do not want any entries filtered, so I dont really need and ‘exclude’ parameters). I imagine that if i do not specify some parameters, it defaults to …default values.

as an example,
I uploaded entries onto GISAID, so that will be my starting point for this nextstrain analysis. So i’d appreciate a guide from how to download from there. I know there are several ways (tar, or extracted fasta and metadata), but not sure if there is a difference between them. I’ll have roughly 80 entries, and do need to filter them in any way. All from the same country, just different states. end result should be an auspice map showing strains in each area.

The error message I get has something to do with all entries being filtered out due to ambiguous date in any. I imagine because of default values, and that my dates when submitted to GISAID were dd-mm-yyyy, whereas nextstrain needs it to be yyyy-mm-dd?

tldr; what the minimum i need in builds yaml, and where do i change values in parameter, like removing the need for filtering.

many thanks

Hi @omarkr - I suggest starting with the tutorials for running SARS-CoV-2 builds in nextstrain. Specifically, the data preparation section explains how to download data from GISAID for analysis in Nextstrain.

im not exactly sure which entries are required, and which are optional.

The simplest starting point would be a YAML as follows:

  - name: example-data # change as needed
    metadata: data/example_metadata.tsv # change as needed
    sequences: data/example_sequences.fasta # change as needed

Which will create a default build with no subsampling. The tutorials should provide examples of how to customise the build as desired.

Please get back in touch with any specific questions!

So i sort of managed to go through the general tutorial, and apply my data to it. all the way to auspice, which is good.

but im wondering if there’s anything i gain from attempting the sars cov 2 tutorial, on my sars cov 2 data, or is the output the same whichever path i choose here.

The inputs you provide (sequences + metadata) will determine which genomes are in the final auspice visualisation. If you have specific sequences you wish to analyse, then you’ll need to provide these as inputs; otherwise you may be better off viewing some of the datasets listed on which subsample the entire dataset depending on which geographical area is of interest.

I notice that using the web nextclade gives the option to download auspice.json files. Is there a way to use feed this json from web nextclade to, but with just the sequences we provided to nextclade? basically the phylogeny that auspice generates from that json is far too populated, if i just wanted to know the relationship between by sequences and a few references.

I don’t believe so. While you could write a script to prune out the sequences in the JSON which you didn’t provide, I’d worry about the accuracy of inferences drawn from that data. (This depends on your data and what you want to know, it may be good enough for some questions.) The best approach would be to run this data through the nCoV workflow.

P.S. I would suggest adding in some contextual (background) sequences to preserve the overall structure of the tree, but without knowing what data you are analysing it’s hard to say more.