Proximity confusion

I feel that I’m somehow misunderstanding the proximity filter.

(Running a fully updated version of nextstrain software and using a github directory pull from this week)

What I’m trying to do is hand Nextstrain a small number of pre-defined sequences, and have it spit out a tree of genetically-similar sequences. For this trial-run, I took the GISAID database, filtered it to USA-only samples, and marked three samples (which all came from the same exposure event and are all delta variant sequences from mid-June 2021) with a new column in the metadata. Then I asked nextstrain to construct a build that had 100 sequences in these samples’ proximity:

subsampling:
  target:
    testfocus:
      exclude: "--exclude-where 'testcolumn!=focus'"
    testproximity:
      max_sequences: 100
      priorities:
        type: "proximity"
        focus: "testfocus"

First I had a problem where it was ignoring Wuhan/Hu-1, but not Wuhan/WH01, both of which were being included in the testfocus subsampling. So I manually removed WH01 from my metadata to prevent this issue. The build ran without complaint.

However, I would presume that with >15000 delta variants in the dataset, and my requesting the 100 closest genetic matches, all 100 would naturally be delta variants. What I get instead is mostly 20B and 20C, with only a single delta mixed in:

target

What am I doing wrong? When I look at proximity_testfocus.tsv, I see a numerical score for each sequence. The ones with the lowest ‘distances’ are the sequences I probably want, right? There are fifty samples with a distance of 0, 1, or 2.

When I look at priority_testfocus.tsv, and sort by the second column, I get priority scores between -1.01 and -6174.04. Here is where I become more unsure, because when I search in this list for the low-distance numbers mentioned above, here’s what I get:
-6170.12
-6171.09

It seems like better matches have more negative values?

When I open up sample-testproximity.txt to see what samples were picked by the algorithm, here are the priority scores of the first five on the list:

-53.15
-53.12
-52.39
-52.22
-47.62

So I am just thoroughly confused about what the algorithm is doing and what I could be doing wrong.

Thank you for your time.

Hi @mbosmeny! Thank you for writing up such a detailed explanation of what you’re trying to accomplish and what isn’t working. This helped me figure out a possible explanation for the pattern you’re seeing.

First, the distances that you reported from the proximities file look reasonable. Although the script that calculates these distances is not straightforward to read, there shouldn’t be any unexpected, random, or nonlinear behavior from this step.

The priorities script is more complicated in that it penalizes sequences with high distances or many gap bases. This script also penalizes many closely related sequences by applying a “crowding penalty” that prevents too many similar sequences from getting pulled into an analysis. This latter penalty seems to be a likely reason why only a couple of Delta strains make it into your outputs, as most others after the first few would get penalized.

This effect is likely compounded beyond what we’d normally notice because you only have three focal sequences. Candidate contextual sequences are grouped by focal sequence name and the crowding penalty increases linearly (by the counter variable i in the code linked above) with each of the >15000 contextual sequences that are considered for those three focal sequences.

To test this idea, you can try running the priorities script manually and setting the crowding penalty to 0 like so (this example is copied from the Nextstrain test suite, but you can replace the paths here with your own files):

python3 scripts/priorities.py \
  --sequence-index results/combined_sequence_index.tsv.xz \
  --proximities results/europe/proximity_region.tsv \
  --crowding-penalty 0 \
  --output results/europe/priorities_region.tsv

If the crowding penalty is the cause of the issue, you should get a list of very similar priorities for most candidate contextual strains. You should also end up with more Delta strains in your final build.

The output file produced by the priorities script is designed for use by the --priority argument of augur filter. This filter command sorted priorities values in descending order before selecting strains in each group, so higher numbers correspond to higher priorities. This means a value like -47.62 represent a higher priority than a value of -53.15, for example.

Would you be able to try out the priorities script command above and let us know if that resolves the issue? If so, we should consider parameterizing the crowding penalty or perhaps setting the default to 0. @rneher may have more thoughts here and can correct me about what I’ve gotten wrong in my explanation above.

Thanks, @jlhudd! I haven’t looked at this script in a while, but your description matches my recollection. The crowing and gap penalties are meant to enrich for high quality sequences and prevent a too big focus on particular genotypes. This was implemented with the assumption that we later subsample in a spatio-temporal manner and priorities are evaluated separately in each group of the subsampling process.

1 Like

From my understanding, we select a focal set of sequences, then a priority score based on proximity is calculated for each of the other sequences in the input data. The crowding penalty, if applied, will decrease the priority scores for sequences on branches that have too many closely related sequences.

If the focal set is on the same clade:
With --crowding-penalty 0, many sequences closely related to the focal sequences (in blue) will show up on the branch:

With default --crowding-penalty which is 0.01 here, the branch is trimmed to omit some closely related sample (they have high proximity but penalized for crowding)

If the focal set is scattered in different clades
With --crowding-penalty 0

With default

So if the goal is to pull as many closely related samples to the focal set as possible, it is helpful to set --crowding-penalty 0. But should be aware of the fact that only pulling closely related samples will bias the molecular clock rate estimation so it’s good to add more context (such as temporal subsampling).

Attaching the builds.yaml for all plots above for reference. The focal sequences are in include.txt, the root is supplied as a separate input so it can be taken outside of include.txt. --crowding-penalty 0 was set here.

inputs:
  - name: "root_ref"
    metadata: "data/references_metadata.tsv"
    sequences: "data/references_sequences.fasta"
  - name: "gisaid"
    metadata: "data/metadata_gisaid.tsv"
    sequences: "data/sequences_gisaid.fasta"

builds:
  aspen:
    country: USA
    division: California
    location: Santa Clara County
    region: North America
    subsampling_scheme: group_plus_context
    title: 'Test run'

## Auxiliary files
files:
  include: "my_profiles/include.txt"   # where the focal set is defined

## Parameters
refine:
  keep_polytomies: True
skip_travel_history_adjustment: True

## Subsampling schemas
subsampling:
  group_plus_context:
  
    root:
      exclude: "--exclude-where 'root_ref!=yes'"
      
    focal:
      exclude: "--exclude-all"
      
    closest:
      max_sequences: 100
      priorities:
        type: "proximity"
        focus: "focal"

    group:
      group_by: "year month"
      max_sequences: 200
      query: --query "(location == '{location}') & (division == '{division}')"
      priorities:
        type: "proximity"
        focus: "focal"

    state:
      group_by: "location year month"
      max_sequences: 100
      query: --query "(location != '{location}') & (division == '{division}')" # exclude add'l samples from {location}
      priorities:
        type: "proximity"
        focus: "focal"

    country:
      group_by: "division year month"
      max_sequences: 50
      query: --query "(division != '{division}') & (country == '{country}')" # exclude add'l samples from CA
      priorities:
        type: "proximity"
        focus: "focal"

    international:
      group_by: "region year month"
      max_sequences: 20
      query: --query "(country != '{country}')" # exclude add'l samples from USA
      priorities:
        type: "proximity"
        focus: "focal"

(For ppl who read my previous post, where crowding penalty didn’t seem to do much, it was b/c the input dataset was too small. The crowding penalty adjustment on priority was too small to dramatically change the order of priorities, and very similar samples ended up on the trees).

1 Like

Thank you for this excellent analysis, @dlu! This should be a section in our workflow documentation.

These results also support the idea of exposing the crowding penalty as a workflow configuration parameter, even if most people will never use it. This would be a minor change to the workflow, but it would enable analyses like @mbosmeny was originally attempting and that are likely to be more common as folks create variant-specific builds.

I’ve created an issue in the ncov repository for this new workflow parameter.

1 Like