Followed data prep instructions, nextstrain fails

Hi,

I followed the instructions for preparing metadata in

https://docs.nextstrain.org/projects/ncov/en/latest/analysis/data-prep.html

but nextstrain fails. The example data in ncov/data runs fine though with the command

snakemake --use-conda --cores 16 --profile my_profiles/example -p

I’m wondering if someone could look at my input files and the error messages to figure out how to get my data running. Please respond to my email :

nelsonerikd@gmail.com

Thanks

Hi @enelson, could you provide some details of how “nextstrain fails”? For instance, the snakemake output, contents of relevant log files. We’d be happy to take a look, and doing so in a public forum may end up helping others who run into similar issues.

Hi James, Thanks for responding. Here is the snakemake command I used and
the output (see below). The log file for the error is attached

(nextstrain) *enelson@igi-biotite*:*/groups/wyman/users/erik/ncov*$
snakemake --use-conda --cores 16 --profile my_profiles/ucla_test -p

Config file defaults/parameters.yaml is extended by additional config
specified via the command line.

Building DAG of jobs...

Using shell: /bin/bash

Provided cores: 16

Rules claiming more threads will be scaled down.

Job stats:

job                         count    min threads    max threads

------------------------  -------  -------------  -------------

add_branch_labels               1              1              1

all                             1              1              1

ancestral                       1              1              1

clades                          1              1              1

distances                       1              1              1

emerging_lineages               1              1              1

export                          1              1              1

finalize                        1              1              1

include_hcov19_prefix           1              1              1

logistic_growth                 1              1              1

mask                            1              1              1

mutational_fitness              1              1              1

recency                         1              1              1

refine                          1              1              1

rename_emerging_lineages        1              1              1

tip_frequencies                 1              1              1

traits                          1              1              1

translate                       1              1              1

tree                            1              8              8

total                          19              1              8


Select jobs to execute...


[Thu Dec  9 12:14:03 2021]

Job 7:

        Mask bases in alignment
results/north-america-usa-california-ucla/aligned.fasta

          - masking 100 from beginning

          - masking 50 from end

          - masking other sites: 21987 21846





        python3 scripts/mask-alignment.py             --alignment
results/north-america-usa-california-ucla/aligned.fasta
--mask-from-beginning
100             --mask-from-end 50             --mask-sites 21987 21846
        --mask-terminal-gaps             --output
results/north-america-usa-california-ucla/masked.fasta 2>
logs/mask_north-america-usa-california-ucla.txt




[Thu Dec  9 12:14:03 2021]

Job 28: Use metadata on submission date to construct submission recency
field



        python3 scripts/construct-recency-from-submission-date.py
  --metadata
results/north-america-usa-california-ucla/metadata_adjusted.tsv.xz
    --output results/north-america-usa-california-ucla/recency.json 2>&1 |
tee logs/recency_north-america-usa-california-ucla.txt



Activating conda environment:
/groups/wyman/users/erik/ncov/.snakemake/conda/e0dc933a74c997ac1f5e5b47adb5e33c

Activating conda environment:
/groups/wyman/users/erik/ncov/.snakemake/conda/e0dc933a74c997ac1f5e5b47adb5e33c

Traceback (most recent call last):

  File "scripts/construct-recency-from-submission-date.py", line 41, in
<module>

    node_data['nodes'][strain] = {'recency':
get_recency(d['date_submitted'], ref_date)}

  File "scripts/construct-recency-from-submission-date.py", line 7, in
get_recency

    date_submitted = datetime.strptime(date_str, '%Y-%m-%d').toordinal()

  File
"/groups/wyman/users/erik/ncov/.snakemake/conda/e0dc933a74c997ac1f5e5b47adb5e33c/lib/python3.8/_strptime.py",
line 568, in _strptime_datetime

    tt, fraction, gmtoff_fraction = _strptime(data_string, format)

  File
"/groups/wyman/users/erik/ncov/.snakemake/conda/e0dc933a74c997ac1f5e5b47adb5e33c/lib/python3.8/_strptime.py",
line 349, in _strptime

    raise ValueError("time data %r does not match format %r" %

ValueError: time data '?' does not match format '%Y-%m-%d'

[Thu Dec  9 12:14:11 2021]

Error in rule recency:

    jobid: 28

    output: results/north-america-usa-california-ucla/recency.json

    log: logs/recency_north-america-usa-california-ucla.txt (check log
file(s) for error message)

    conda-env:
/groups/wyman/users/erik/ncov/.snakemake/conda/e0dc933a74c997ac1f5e5b47adb5e33c

    shell:



        python3 scripts/construct-recency-from-submission-date.py
  --metadata
results/north-america-usa-california-ucla/metadata_adjusted.tsv.xz
    --output results/north-america-usa-california-ucla/recency.json 2>&1 |
tee logs/recency_north-america-usa-california-ucla.txt



        (one of the commands exited with non-zero exit code; note that
snakemake uses bash strict mode!)

Logfile logs/recency_north-america-usa-california-ucla.txt:

Traceback (most recent call last):

  File "scripts/construct-recency-from-submission-date.py", line 41, in
<module>

    node_data['nodes'][strain] = {'recency':
get_recency(d['date_submitted'], ref_date)}

  File "scripts/construct-recency-from-submission-date.py", line 7, in
get_recency

    date_submitted = datetime.strptime(date_str, '%Y-%m-%d').toordinal()

  File
"/groups/wyman/users/erik/ncov/.snakemake/conda/e0dc933a74c997ac1f5e5b47adb5e33c/lib/python3.8/_strptime.py",
line 568, in _strptime_datetime

    tt, fraction, gmtoff_fraction = _strptime(data_string, format)

  File
"/groups/wyman/users/erik/ncov/.snakemake/conda/e0dc933a74c997ac1f5e5b47adb5e33c/lib/python3.8/_strptime.py",
line 349, in _strptime

    raise ValueError("time data %r does not match format %r" %

ValueError: time data '?' does not match format '%Y-%m-%d'



[Thu Dec  9 12:14:12 2021]

Finished job 7.

1 of 19 steps (5%) done

Shutting down, this might take some time.

Exiting because a job execution failed. Look above for error message

Complete log:
/groups/wyman/users/erik/ncov/.snakemake/log/2021-12-09T121359.157926.snakemake.log

(nextstrain) *enelson@igi-biotite*:*/groups/wyman/users/erik/ncov*$

(Attachment recency_north-america-usa-california-ucla.txt is missing)

. . P.S. The nextstrain site blocked my .txt attachment

Thanks - those error messages provide a clue to debug this:

Job 28: Use metadata on submission date to construct submission recency
field
...
get_recency(d['date_submitted'], ref_date)}
...
ValueError: time data '?' does not match format '%Y-%m-%d'

Can you examine the values for date_submitted in your metadata and check they are all in the format of YYYY-MM-DD (e.g. 2021-12-10)? (If the value is unknown then you can leave this as an empty string.)

Hi James, I don’t have values for date_submitted, I only include values for
the *required fields *(i.e., strain, virus, date, region) plus country. For
everything else I enter a question mark, ?.

. . and the date field is in the correct format YYYY-MM-DD

I don’t have values for date_submitted

For everything else I enter a question mark, ?

the date field is in the correct format YYYY-MM-DD

Does your metadata file have a “date_submitted” column? If so, the values there need to be either YYYY-MM-DD or empty strings. They cannot be “?”.

I see, let me try this. What time zone are you located in?

I generate the file with a C++ script using the null character ‘\0’ for that column, but it didn’t work (the error reads, in part

pandas.errors.ParserError: NULL byte detected. This byte cannot be processed in Python’s native csv library at the moment, so please pass in engine=‘c’ instead

what should I use here?

I just used the “date” value in place of the “date_submitted” value . . not sure what this does, but the program at least finishes without error messages now.

I’ll check the output and get back to you. I have a number of questions about the metadata fields.

Hi James, thanks for the help yesterday. The output helped us notice some problems with the data that was sent to us. I had more questions, but I think I’ll start another ticket to go into detail on those. However, I am still wondering if you could clear something up for me – that is, if I only include the required columns / header labels (strain, virus, date, region) in my metadata file, will the code still generate output, or do I need to do something extra, like include a flag of some sort. The reason I ask is that I tried this but still got an error (i.e., nextstrain failed to complete).

It should work with only those minimal fields (as long as your config doesn’t refer to another column which is missing e.g. “country”). What error did you get when running with only those fields?

I just used the “date” value in place of the “date_submitted” value

We use the date_submitted field for some QC heuristics to identify problematic strains. It is probably preferable to leave this column out entirely rather than duplicate the “date” column values.

Thanks. I ran into another issue today. When I try to re-run a job with slightly different build parameters, I get messages saying the nothing is done, job is complete :slight_smile:

Building DAG of jobs…
Nothing to be done (all requested files are present and up to date).
Complete log: /groups/wyman/users/erik/ncov/.snakemake/log/2021-12-14T093109.809309.snakemake.log

I tried using the rerun-incomplete flag, but it still doesn’t completely rerun the job. How do I fix this?

Here is the screen output using the flag --rerun-incomplete:

(nextstrain) enelson@igi-biotite:/groups/wyman/users/erik/ncov$ snakemake --use-conda --cores 16 --profile my_profiles/ucal_test -p --nocolor --rerun-incomplete &
[1] 7009
(nextstrain) enelson@igi-biotite:/groups/wyman/users/erik/ncov$ Config file defaults/parameters.yaml is extended by additional config specified via the command line.
Building DAG of jobs…
WARNING: No valid subsampling scheme is defined for build ‘north-america-usa-california-ucal’. Skipping subsampling and using all available samples.
WARNING: No valid subsampling scheme is defined for build ‘north-america-usa-california-ucal’. Skipping subsampling and using all available samples.

(nextstrain) enelson@igi-biotite:/groups/wyman/users/erik/ncov$ Nothing to be done (all requested files are present and up to date).
Complete log: /groups/wyman/users/erik/ncov/.snakemake/log/2021-12-14T113851.635022.snakemake.log

[1]+ Done snakemake --use-conda --cores 16 --profile my_profiles/ucal_test -p --nocolor --rerun-incomplete
(nextstrain) enelson@igi-biotite:/groups/wyman/users/erik/ncov$

If the final output files which snakemake is being asked to generate already exist, then there is “Nothing to be done (all requested files are present and up to date).” The snakemake FAQ is a helpful resource for starting out with snakemake, which can be confusing. In this case, adding --forceall should force all rules to rerun.

Thanks, that is very useful to know. I did search for how to do this, but my search terms must have been way off.

Hi James, just one last note . . somehow, removing the date_submitted column didn’t work. Seems like I need to include it for some reason. Wouldn’t just hitting “return” (i,e., ‘\n’) when I enter this column value be ok instead of entering the “date” value like I did before?

Yes – things should work if the values of that column are empty, i.e. "". (Things should also work without that column, so this is a :bug:.)

Hi James, Actually, I’m not sure if this* is* a bug. I just realized that
there was a tab after the last field of my header. Things still seem to run
with the false date_submitted though.