Hi @enelson, could you provide some details of how “nextstrain fails”? For instance, the snakemake output, contents of relevant log files. We’d be happy to take a look, and doing so in a public forum may end up helping others who run into similar issues.
Thanks - those error messages provide a clue to debug this:
Job 28: Use metadata on submission date to construct submission recency
field
...
get_recency(d['date_submitted'], ref_date)}
...
ValueError: time data '?' does not match format '%Y-%m-%d'
Can you examine the values for date_submitted in your metadata and check they are all in the format of YYYY-MM-DD (e.g. 2021-12-10)? (If the value is unknown then you can leave this as an empty string.)
Hi James, I don’t have values for date_submitted, I only include values for
the *required fields *(i.e., strain, virus, date, region) plus country. For
everything else I enter a question mark, ?.
I generate the file with a C++ script using the null character ‘\0’ for that column, but it didn’t work (the error reads, in part
pandas.errors.ParserError: NULL byte detected. This byte cannot be processed in Python’s native csv library at the moment, so please pass in engine=‘c’ instead
I just used the “date” value in place of the “date_submitted” value . . not sure what this does, but the program at least finishes without error messages now.
I’ll check the output and get back to you. I have a number of questions about the metadata fields.
Hi James, thanks for the help yesterday. The output helped us notice some problems with the data that was sent to us. I had more questions, but I think I’ll start another ticket to go into detail on those. However, I am still wondering if you could clear something up for me – that is, if I only include the required columns / header labels (strain, virus, date, region) in my metadata file, will the code still generate output, or do I need to do something extra, like include a flag of some sort. The reason I ask is that I tried this but still got an error (i.e., nextstrain failed to complete).
It should work with only those minimal fields (as long as your config doesn’t refer to another column which is missing e.g. “country”). What error did you get when running with only those fields?
I just used the “date” value in place of the “date_submitted” value
We use the date_submitted field for some QC heuristics to identify problematic strains. It is probably preferable to leave this column out entirely rather than duplicate the “date” column values.
Thanks. I ran into another issue today. When I try to re-run a job with slightly different build parameters, I get messages saying the nothing is done, job is complete
Building DAG of jobs…
Nothing to be done (all requested files are present and up to date).
Complete log: /groups/wyman/users/erik/ncov/.snakemake/log/2021-12-14T093109.809309.snakemake.log
I tried using the rerun-incomplete flag, but it still doesn’t completely rerun the job. How do I fix this?
Here is the screen output using the flag --rerun-incomplete:
(nextstrain) enelson@igi-biotite:/groups/wyman/users/erik/ncov$ snakemake --use-conda --cores 16 --profile my_profiles/ucal_test -p --nocolor --rerun-incomplete &
[1] 7009
(nextstrain) enelson@igi-biotite:/groups/wyman/users/erik/ncov$ Config file defaults/parameters.yaml is extended by additional config specified via the command line.
Building DAG of jobs…
WARNING: No valid subsampling scheme is defined for build ‘north-america-usa-california-ucal’. Skipping subsampling and using all available samples.
WARNING: No valid subsampling scheme is defined for build ‘north-america-usa-california-ucal’. Skipping subsampling and using all available samples.
(nextstrain) enelson@igi-biotite:/groups/wyman/users/erik/ncov$ Nothing to be done (all requested files are present and up to date).
Complete log: /groups/wyman/users/erik/ncov/.snakemake/log/2021-12-14T113851.635022.snakemake.log
If the final output files which snakemake is being asked to generate already exist, then there is “Nothing to be done (all requested files are present and up to date).” The snakemake FAQ is a helpful resource for starting out with snakemake, which can be confusing. In this case, adding --forceall should force all rules to rerun.
Hi James, just one last note . . somehow, removing the date_submitted column didn’t work. Seems like I need to include it for some reason. Wouldn’t just hitting “return” (i,e., ‘\n’) when I enter this column value be ok instead of entering the “date” value like I did before?
Hi James, Actually, I’m not sure if this* is* a bug. I just realized that
there was a tab after the last field of my header. Things still seem to run
with the false date_submitted though.