"Could not determine delimiter" error with metadata files

maryj · August 28, 2024, 6:02pm

I am running in Docker on Windows and I’m fairly new to Nextstrain, but I have run through the Zika tutorial and a flu example build successfully. Now I’m trying to tweak and modify the example flu build and I’m running into an issue that I can’t seem to diagnose.

This is the error I see:

Error in rule join_metadata:
    jobid: 15
    input: data/h3n2/metadata_ha.tsv, data/h3n2/metadata_na.tsv
    output: data/h3n2/metadata_joined.tsv
    log: logs/join_metadata_h3n2.txt (check log file(s) for error details)
    conda-env: /nextstrain/build/.snakemake/conda/dae81987b7ec4d339b11532c7759e4d4_
    shell:

        python3 scripts/join_metadata.py             --metadata data/h3n2/metadata_ha.tsv data/h3n2/metadata_na.tsv             --segments ha na             --segment-columns accession             --how outer             --output data/h3n2/metadata_joined.tsv 2>&1 | tee logs/join_metadata_h3n2.txt

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

And this is the corresponding log:

Traceback (most recent call last):
  File "/nextstrain/augur/augur/io/metadata.py", line 469, in _get_delimiter
    return csv.Sniffer().sniff(file.readline(), "".join(valid_delimiters)).delimiter
  File "/usr/local/lib/python3.10/csv.py", line 187, in sniff
    raise Error("Could not determine delimiter")
_csv.Error: Could not determine delimiter

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nextstrain/build/scripts/join_metadata.py", line 28, in <module>
    segment_metadata = read_metadata(segment_metadata_file)
  File "/nextstrain/augur/augur/io/metadata.py", line 79, in read_metadata
    "sep": _get_delimiter(metadata_file, delimiters),
  File "/nextstrain/augur/augur/io/metadata.py", line 473, in _get_delimiter
    raise InvalidDelimiter from error
augur.io.metadata.InvalidDelimiter

I ran into a very similar problem once in the past, and it turned out to be related to this issue. At that time, it was easy to fix by manually changing the file extensions to .tsv instead of .txt. Now, my file extensions are already .tsv files. I tried reading all my metadata files into R as tsv files and that worked fine. I’m not sure what the issue is now. Any help would be appreciated! Thank you.

victorlin · August 28, 2024, 6:14pm

Hi @maryj, thanks for providing those details and sorry you’re running into this issue. Would it be possible for you to share the metadata file, or at least a few lines from it? I suspect that it is using something other than the default delimiters of , and \t. I’m not familiar with R, but a quick search shows that data.table’s fread function uses a more expansive set of default delimiters: ,\t |;: so that could explain why it works with R and not read_metadata here.

maryj · August 28, 2024, 6:26pm

I just sent you a message with the data. If it’s useful information, the R readr function read_tsv specified the delimiter as “\t” for my metadata files.

victorlin · August 28, 2024, 6:40pm

Thanks for sending that! I see the metadata file has a single column strain. In the case of single-column tabular files, the delimiter needs to be explicitly set. This explains why it works in R with delimiter as \t.

read_metadata currently doesn’t allow explicitly setting a delimiter. This has worked fine for us because all our metadata files have multiple columns, and delimiter detection is possible.

I’m not sure what the use case for a single-column metadata file is. I suspect that you’ll want to update the metadata file with more columns (FAQ: How do I prepare metadata?) for further analysis.

maryj · August 28, 2024, 7:52pm

Yes, thanks for pointing this out! I had a full metadata file and when creating separate HA and NA metadata files for flu sequences, it didn’t keep all the data and I somehow didn’t clock it as a problem. However, when I use the full metadata, I get another error that I don’t understand either:

Error message:

Error in rule join_metadata:
    jobid: 15
    input: data/h3n2/metadata_ha.tsv, data/h3n2/metadata_na.tsv
    output: data/h3n2/metadata_joined.tsv
    log: logs/join_metadata_h3n2.txt (check log file(s) for error details)
    conda-env: /nextstrain/build/.snakemake/conda/dae81987b7ec4d339b11532c7759e4d4_
    shell:

        python3 scripts/join_metadata.py             --metadata data/h3n2/metadata_ha.tsv data/h3n2/metadata_na.tsv             --segments ha na             --segment-columns accession             --how outer             --output data/h3n2/metadata_joined.tsv 2>&1 | tee logs/join_metadata_h3n2.txt

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Log:

Traceback (most recent call last):
  File "/nextstrain/build/scripts/join_metadata.py", line 41, in <module>
    segment_metadata.loc[:, [segment] + args.segment_columns].rename(columns=segment_columns),
  File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1067, in __getitem__
    return self._getitem_tuple(key)
  File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1256, in _getitem_tuple
    return self._getitem_tuple_same_dim(tup)
  File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 924, in _getitem_tuple_same_dim
    retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
  File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1301, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1239, in _getitem_iterable
    keyarr, indexer = self._get_listlike_indexer(key, axis)
  File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1432, in _get_listlike_indexer
    keyarr, indexer = ax._get_indexer_strict(key, axis_name)
  File "/usr/local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6070, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "/usr/local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6133, in _raise_if_missing
    raise KeyError(f"{not_found} not in index")
KeyError: "['accession'] not in index"

Is it looking for an accession number column? I don’t think I had that data field when running the flu example build, so I wasn’t expecting it to look for that. Thanks for your help on this.

victorlin · August 28, 2024, 9:16pm

Yes, it seems to be looking for a column named accession. In the example build, this is specified in the FASTA ID headers which get converted to metadata columns in rule parse.

I’m not very familiar with this workflow, but it seems that you shouldn’t need to manually create separate HA and NA metadata files – just separate FASTA files. You may have to tweak the fasta_fields in the builds.yaml file if your FASTA ID header format differs from the format used in the example build. An example of the latter:

>A/Tehran/1219/2012|flu|EPI446792|2012-01-28|2016-09-03|west_asia|iran|tehran|tehran|undetermined|?|other_database_import|?|?

accession is the third entry in fasta_fields so EPI446792 would be the accession here.

Let me know if this doesn’t make sense, and I can loop in someone that’s more familiar with this workflow.

Topic		Replies	Views
ImportError: Cannot import name 'read_metadata' from 'augur.io' Help and Getting Started	5	517	October 26, 2021
Ncov: Errors from combine_metadata.py due to unexpected behavior in sanitize_metadata.py Help and Getting Started	28	1245	May 1, 2023
Followed data prep instructions, nextstrain fails Help and Getting Started	20	822	December 16, 2021
Error in rule sanitize_metadata: ncov workflow Help and Getting Started	5	539	November 7, 2021
Sanitize_metadata.py error: ERROR: ' ' expected after '"' Help and Getting Started	6	487	November 5, 2021

"Could not determine delimiter" error with metadata files

Related topics