Hi, I’m working on a story about what insights might be gained using genetic epidemiology with regards to the growing outbreak in Haiti. The first sequences from the country were published last week, consisting of sequences before Feb 2021. I understand newer sequences including B.1.1.7 and P.1 samples will be released soon and I’d like to have something to interpret them with.
In order to get the Haiti specific sequences (there are ~31) on GISAID I go to downloads, custom selection, search for “Haiti” and download the sequences.
I do not see the link for “nextmeta” or “nextfasta” as shown in the documentation. I emailed gisaid to ask about this.
So I downloaded “Region-specific Auspice source files”, “Global” and “North America”… and ran the pipeline against the files in “global” just to get a running build… but am not sure if/how these files are sampled which of course makes anything downstream uncertain (and I do note that only a few of the Haiti samples are in this set).
I used “example_multiple_inputs” as a template, with one input the set of all samples from Haiti, 2000 proximity samples from the worldwide set and worldwide background samples grouped by year/month with five samples per group.
This gives me something that looks reasonable… the main takeaway that I can see is that the main cluster of sequences have have no near ancestor in time from another country… and the most similar sequences are fairly widely distributed geographically… so I would tentatively interpret presence of in country transmission over the winter peak and just generally and unsurprisingly not enough sequencing to conclude much else about the transmission chains.
It’s not wildly exciting, but is there somewhere that I can share the methodology/results/interpretation for critique? I’d like to develop a reasonable baseline process that I can use going forward as more sequences
Would be great to see this, and we’d be happy to provide advice on the results once we’ve seen it! The easiest way to share a one-off dataset like this is via our community data sharing which operates through GitHub. This also allows you to share the snakemake file you used which can be helpful.
Since I don’t have access to the complete GISAID file, I’m downloading the sequences from Haiti explicitly and combining them with the nextregions/global dataset. Does this seem like a reasonable build?
Some additional things might be to add in the files for North and South America as well as specific downloads for other nations in the Caribbean… though really it would be nicer to just have the data in one place.
My main goal right now is to get familiar with the process. One observation - perhaps not that telling - is the biggest cluster of connected cases in the Haiti dataset branches pretty early in the pandemic from other sequences in the dataset… though perhaps that’s an artifact of the various sampling biases. Mainly it seems like there’s not enough data to reconstruct very much except that community transmission was happening at the actual time the sequences were taken.
Hopefully the sequences with B.1.1.7 and P.1 will be released soon… things got very bad, very fast.
Since I don’t have access to the complete GISAID file, I’m downloading the sequences from Haiti explicitly and combining them with the nextregions/global dataset. Does this seem like a reasonable build?
Yes - this approach is probably the easiest to analyse the Haitian samples in a global context & the builds.yaml seems to be correctly configured to select all Haiti samples (why there’s 28 in your dataset but 31 on GISAID I’m not sure, but they may have been filtered out at various QC steps in the pipeline).
Mainly it seems like there’s not enough data to reconstruct very much
Unfortunately, with only ~30 sequences and none since Feb 1st, I think this is the case.
So we have forty or so more sequences now, mostly sampled at about the same time in the middle of the current outbreak. I’m told that the national lab is waiting for some more to be processed at external sites.
My guess is that the number of samples is enough to say that there in-country transmission since ~march, as the model says.
Is the genetic diversity (mutations seem well distributed over time) consistent with a relatively steady spread over that period? And yes… still not too much sequencing, but hopefully more on the way.
One reason I’m asking is that there was very little indication in March/April/Early May that there was anything going wrong… in fact people were writing articles about what a mystery it was that cases/deaths were so low. I’d like to consider, retrospectively, if more genetic sequencing might have helped the country prepare better, and might actually do so in the future (am working on an article on the topic).