Last updated: 2022-09-09

General

How is the data curated?

We analyse the previous 3-months of sequences available in NCBI GenBank as of the ‘alignment updated’ date given in the report. We use the preprocessed files made available by NextStrain following Open Data principles. Sequences are deduplicated as far as possible by name and additional information in the metadata, and those missing accurate dates (that is, only recording month and/or year) are removed.

Sequences are aligned with Mafft v7.487 using the --addfragments and --keeplength options to align to the reference sequence (Wuhan-Hu-1, GISAID accession EPI_ISL_402125) and remove insertions. Any sequences with less than 95% coverage of the ORFs (i.e. >5% gaps) are removed, and the Spike extracted and translated for further analysis.

How often is the alignment/report updated?

The alignment is updated weekly, usually on Fridays, with the associated reports generally available the same evening. The dates of alignment update and when the report was compiled are given at the top of each report.

Why does the number of sequences taper towards the present?

It takes time for a sample to processed from swab being taken, tested to see if it is positive, received by the sequencing lab, sequenced, bioinformatically processed and uploaded to online databases Turnaround time for all these steps varies from lab to lab for a number of reasons, including sequencing technology, capacity, and current number of Covid-19 cases in the area. Typical turnaround time in the US is about 2 weeks.

To help mitigate the effects of low numbers of sequences, we include rolling 4-week averages where appropriate.

Where can I find country-specific data?

Breakdowns for countries with more than 200 sequences after cleaning the alignment can be found in the regional reports.

How are the reports compiled?

The reports are written in Rmarkdown in RStudio, and compiled using knitr. The sequence data are read in and processed using the Biostrings and ape packages, with other processing done using the tidyverse. The plots, tables and interactive elements are made possible by ggplot2, patchwork, plotly, kableExtra, DT, sparkline and htmltools.

Lineages of interest report

Why do some variants of concern (VOCs) appear to represent 0% in the proportion of sequences?

This could be for several reasons, including:

  • the VOC has been displaced by another circulating variant(s), and is now only found in low frequency,
  • the VOC has not yet or only recently arrived in a given region,
  • there is a lack of surveillance in given region, so the VOC is circulating there, but not been identified through sequencing yet.

Are variants of concern (VOCs) and variants of interest (VOIs) ever downgraded?

Yes, variants can be up- or downgraded according to new evidence indicating that the risk to public health has changed. One example is Epsilon (B.1.427/B.1.429), which was initially a VOI in the US, later upgraded to a VOC due to the impact on neutralisation by some Emergency Use Authorisation (EUA) therapeutics (e.g. bamlanivimab), and has since been downgraded to a Variant Being Monitored (VBM). The ECDC denotes such variants as ‘de-escalated’, terminology which we follow in the report.

Are consensus sequences for just the last three months?

Yes, the consensus sequences are calculated according to the current 3-monthly alignment. This means they reflect the consensus of the currently circulating sequences of the lineage, useful for looking at temporal and regional shifts.

Can consensus sequences for the same VOC be different between regions?

They can, if a local sublineage has acquired mutation(s) and subsequently spread to become the majority in that region. Often these sublineages will get their own pango lineage classification, depending on the epidemiology.

What are the Xs in consensus sequences?

Xs are given in consensus sequences where all of the sequences have an unknown amino acid, denoted X, at that location, and thus are generally found for lineages where there are only a few sequences in the alignment. Sites with missing data are shown in grey in the corresponding consensus sequence plots for easy reference.

Mutations of interest report

Are mutations listed only if they are currently circulating?

Yes. A site must have at least 100 sequences with non-reference amino acids in the current alignment to be included in the report.

What is the antibody binding profile, and how is it calculated?

The antibody binding profile indicates the contribution of NTD or RBD sites in the recognition of known antibodies that have resolved spike-antibody complex structures deposited in the Protein Data Bank (PDB) as of 4th May 2022.

For a given site and a given antibody, the interaction score \(w\), which characterizes the interaction between the site and the antibody (improved based on Bai et al, 2019), was defined as: \[w=\frac{1}{2}\left( \frac{n_c}{\langle {n_c} \rangle} +\frac{n_{nb}}{\langle {n_{nb}} \rangle} \right)\] where \(n_c\) is the number of contacts with the antibody (i.e. the number of non-hydrogen antibody atoms within 4 Å of the site); \(n_{nb}\) is the number of neighboring antibody residues; \(\langle {n_c} \rangle\) is the mean number of contacts \(n_c\) and \(\langle {n_{nb}} \rangle\) is the mean number of neighboring antibody residues \(n_{nb}\) across all epitope sites. A weight of 1.0 is attributed to the average interaction across all epitope sites. Neighboring residue pairs were identified by Delaunay tetrahedralization of side-chain centers of residues (\(C_{\alpha}\) is counted as a side chain atom, and pairs further than 8.5 Å were excluded).

As there are multiple RBD-targeting antibodies, we sum the interaction score for the given site to all RBD-targeting antibodies.Then we normalize the summed score (min-max scaling; minimum, 0; maximum, 99 percentile of the summed score) and used it as the antibody binding profile. The same procedure was repeated for the NTD-targeting antibodies to get the antibody binding profile for all NTD sites.

Quickhull (Barber et al, 1996) was used for the tetrahedralization and Biopython PDB (Hamelryck and Manderick, 2003) to handle the protein structure.