Hi - I see a few questions related to this same issue but none that have been answered so bumping again - when running a basic global build, I keep getting an error in the augur filter step saying ‘all samples have been dropped! Check filter rules and metadata file format.’. It then deletes the filtered.fasta file and exits. I’ve looked at all the log files and they’re almost all empty so the actual issue is very difficult to diagnose. I’ve used many different subsets of sequences, both my own and those downloaded directly from GISAID, as well as the metadata format downloaded directly from GISAID as well. I’ve been able to build trees successfully in the past but would get this error seemingly randomly, and now have been getting it every time. Happy to post other results/logs if that would be helpful but most everything is deleted when the program exits, so there isn’t much to show. Any help would be much appreciated! Thank you!
Hi @aeroder – could you post the output that Snakemake prints at the filtering step which may give us some clues.
Thank you for sharing the filter log, @aeroder. My best guess is that there is a mismatch between strain names in the sequence and metadata inputs, but without access to those input data, I can’t be sure.
We just released a new version of Augur (11.2.0) that improves the filter report and always prints the full report to the logs even if all samples have been dropped. Are you able to upgrade your Augur installation so you can re-run this filter step and post the improved report here?
If you’ve installed Augur with pip, you can upgrade with:
python3 -m pip install --upgrade nextstrain-augur
If you are running your analysis with the Nextstrain CLI and Docker, you can get the latest Augur by running:
If you are running your ncov workflow with
snakemake --use-conda, you update the conda environment file to reference
nextstrain-augur==11.2.0 (instead of 11.1.2).
If you have installed Augur with Bioconda, the latest version should be available by tonight or tomorrow morning (new Bioconda packages require manual human approval while the other approaches above do not). You’ll be able to upgrade with:
conda activate nextstrain conda update --all
The update seemed to fix the issue! If I get it again, I’ll post the filter message here. Thank you!
Okay after one successful run, I’m getting the same error again. I’m attaching a screenshot of the error message which says that all samples were dropped because there is no sequence data. However when I grep for the sequence name in the fasta file and the metadata file, I’ve confirmed that a number of them are exactly the same. I’m sure there is something small I’m missing but any help would be appreciated! Thank you!!
@aeroder, would you mind sharing the output of the following command (you may need to use
cat -A if you’re on a Linux system)?
cat -e 031121-usethis.fasta | grep 'USA/DC-HP00054/2020'
I’m wondering if the issue is related to whitespace characters that Augur isn’t handling properly, since your strain names look fine in the metadata and sequence data.
This is the result that I got. I downloaded these sequences directly from GISAID so I’m not sure if the issue arises when I append my sequences to the GISAID file or if it is in the GISAID sequences when I download them
Ok, that looks like we’d expect. I should have asked this at the same time, but what do you see for the metadata with a similar command?
cat -e metadata-4.tsv | grep 'USA/DC-HP00054/2020'
As another test, could you generate a sequence index for the sequences and search for the same sample there? The index should take ~10 minutes to build.
# Build the sequence index. augur index --sequences 031121-usethis.fasta --output sequence_index.tsv # Search for a specific sample, showing whitespace characters. cat -e sequence_index.tsv | grep 'USA/DC-HP00054/2020'
Behind the scenes, Augur creates a set of strains from the metadata and a set of strains from sequence data. It calculates the set of available strains as the intersection of the metadata and sequence strain sets. Since these sets consist only of strain names, the bug must be related to how we’re parsing those names from the current data. Now, my best guess is that there’s an additional whitespace character in the metadata, since the sequence id looks fine.
Here is the results of the cat command on the metadata file:
And here is the result on the sequence index (as a note, the sequence index only took about 15 seconds to build)
Ok, I have one final idea, since
cat -e doesn’t do everything I thought it did on OS X. To show non-printing characters (including tabs), we need to use
cat -vet. Can you share the output of the following commands?
# Search metadata. cat -vet metadata-4.tsv | grep "USA/DC-HP00054/2020" # Search sequences. cat -vet 031121-usethis.fasta | grep "USA/DC-HP00054/2020" # Search sequence index. cat -vet sequence_index.tsv | grep "USA/DC-HP00054/2020"
I’m sorry this is so complicated! Whitespace is the bane of bioinformatics. Here is an example output from my computer for the metadata (tabs are shown as
$ cat -vet data/example_metadata.tsv | grep Wuhan/WH01/2019 Wuhan/WH01/2019^Incov^IEPI_ISL_406798^ILR757998^I2019-12-26^IAsia^IChina^IHubei^IWuhan^IAsia^IChina^IHubei^Igenome^I29866^IHuman^I44^IMale^IGeneral Hospital of Central Theater Command of People's Liberation Army of China^IBGI & Institute of Microbiology, Chinese Academy of Sciences & Shandong First Medical University & Shandong Academy of Medical Sciences & General Hospital of Central Theater Command of People's Liberation Army of China^IWeijun Chen et al^Ihttps://www.gisaid.org^I?^I2020-01-30$
No problem - I really appreciate you helping me diagnose this!
Here is the output of those three commands.
I don’t see any obvious whitespace aside from the tabs. I’ve also tried looking in both files for blank lines but don’t see any. I seem to get the ‘all samples dropped’ error seemingly randomly (I’m sure it isn’t random, but I can never identify what exactly triggers it).
Interesting…everything looks good unless somehow we have an issue processing Windows-style line endings.
One more thing to check is the header row of the metadata file. What do you get when you run this command?
cat -vet metadata-4.tsv | head -n 1
Thanks, @aeroder! I’m a little stumped now, since none of the most logical explanations I can think of seem to explain the bug. I’m going to experiment with metadata that have Windows-style line endings just to confirm that isn’t the issue, but I’ll follow up with you about maybe getting a copy of your data to see if I can reproduce the problem locally.
Basically the same issue here (macOS Catalina) where all sequences are filtered. Filter rules & log files are not informative. Can provide
TSV if you wish.
Hi @underscore, it would be great if you could provide FASTA and metadata that are producing this issue. You can message me directly to work out the transfer, if you cannot share these data publicly.
AFAIK the problem was never with the FASTA files but with the metadata where I found various issues (not explained at Preparing your data — Nextstrain documentation) For example
- “XX” in date field is not allowed
- the minimum number of columns is 12
- the 2 Wuhan sequences need to be always included
- added dummy dates
After correcting these errors, the filter issue suddenly disappeared
So strange. None of those issues should affect the filtering step, but this is helpful to know. It sounds like we need a way to validate metadata and sequence index files independently from
For our own team’s notes, we could consider adding a subcommand for
metadata to the existing
augur validate command. A simpler immediate step would be to implement a validation schema for metadata and apply the Snakemake’s validate function to each metadata input at the beginning of the workflow.