I want to carry out the analysis of Colombia and the 6 global regions, but at the end of my analysis, Colombia and the regions appear but without South America
any advice? this is the file i use
Hi @juan_dc – I think this is the same issue as this recent post: Losing country info in final build and should be fixed by removing the region: South America from your builds declaration. (Your subsampling scheme doesn’t use this value so it should have no detrimental effects.)
Hi, I tried deleting the region line and in the final phase I find this error:
augur traits is using TreeTime version 0.8.6
Assigned discrete traits to 1286 out of 1286 taxa.
NOTE: previous versions (<0.7.0) of this command made a 'short-branch
length assumption. TreeTime now optimizes the overall rate numerically
and thus allows for long branches along which multiple changes
accumulated. This is expected to affect estimates of the overall rate
while leaving the relative rates mostly unchanged.
ERROR: 300 or more distinct discrete states found. TreeTime is currently not set up to handle that many states.
[Thu May 12 15:19:08 2022]
Error in rule traits:
jobid: 32
output: results/global_prueba3_1/traits.json
log: logs/traits_global_prueba3_1.txt (check log file(s) for error message)
shell:
augur traits --tree results/global_prueba3_1/tree.nwk --metadata results/global_prueba3_1/metadata_adjusted.tsv.xz --output results/global_prueba3_1/traits.json --columns country division --confidence --sampling-bias-correction 5.0 2>&1 | tee logs/traits_global_prueba3_1.txt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Logfile logs/traits_global_prueba3_1.txt:
augur traits is using TreeTime version 0.8.6
Assigned discrete traits to 1286 out of 1286 taxa.
NOTE: previous versions (<0.7.0) of this command made a 'short-branch
length assumption. TreeTime now optimizes the overall rate numerically
and thus allows for long branches along which multiple changes
accumulated. This is expected to affect estimates of the overall rate
while leaving the relative rates mostly unchanged.
ERROR: 300 or more distinct discrete states found. TreeTime is currently not set up to handle that many states.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2022-05-12T145042.133201.snakemake.log
the error is fixed if I write region again, but again I lose South America
When you specify a region in the build (as above), the metadata is modified such that a sample from outside South America (e.g. “France”) has it’s country changed to the corresponding region (e.g. “Europe”). This has the effect of reducing the set of values for the country key which you are performing a DTA on (via augur traits).
When you remove region: South America you don’t modify the metatada (good) but this results in lots of countries which you are performing DTA on (bad). In this case, I’d recommend removing "country" from the list of traits to run DTA on; I’d also remove division (which will probably have more demes than country). Reconstructing region might be ok.
So we do want region: South America in the build declaration, as that will change the metadata of non-South-American counties to be their region (continent) as shown on the second screenshot. Note that this happens after subsampling.
I’m not sure why South American counties are being filtered out, it looks to me like the global part of your subsampling scheme should include these. (The reason Colombian samples are included is due to the country part of the scheme scheme.)
You could add the following to your subsampling scheme, but it’s more of a hack than understanding what’s actually not working as expected:
region:
group_by: "division year month"
max_sequences: 500
exclude: "--exclude-where 'region!={region}'"
the result I got when adding the lines, in other attempts I have obtained similar results but I can’t get only the South American region to be represented
but my point is that in South America you only see Colombia, the rest of the countries (Argentina, Peru, Ecuador…) must be represented in a circle that corresponds to the region of South America, as in the other regions of the world (North America , Europe …)
I hope you can help me, I have tried different configurations and I have not achieved my goal.
In the case of the lines, it is because I am using a test dataset, once I am clear about how to perform my analysis, I will proceed to do it with my real dataset.
the rest of the countries (Argentina, Peru, Ecuador…) must be represented in a circle that corresponds to the region of South America, as in the other regions of the world (North America , Europe …)
This isn’t possible with the current workflow but it shouldn’t be too hard for you to do this.
Option 1. Modify the adjust metadata regions rule to use the country instead of the region, and adjust the underlying script accordingly so that any sample not in Colombia has their metadata changed appropriately.
Option 2. Use a script to change your metadata before running the pipeline so that the country field is as you desire. Note that this will slightly change the subsampling algorithm, as your current scheme groups by “country” (which you will have replaced with the corresponding region).