Last updated: 2022-09-09
We analyse the previous 3-months of sequences available in NCBI GenBank as of the ‘alignment updated’ date given in the report. We use the preprocessed files made available by NextStrain following Open Data principles. Sequences are deduplicated as far as possible by name and additional information in the metadata, and those missing accurate dates (that is, only recording month and/or year) are removed.
Sequences are aligned with Mafft v7.487 using the --addfragments
and --keeplength
options to align to the reference sequence (Wuhan-Hu-1, GISAID accession EPI_ISL_402125) and remove insertions. Any sequences with less than 95% coverage of the ORFs (i.e. >5% gaps) are removed, and the Spike extracted and translated for further analysis.
The alignment is updated weekly, usually on Fridays, with the associated reports generally available the same evening. The dates of alignment update and when the report was compiled are given at the top of each report.
It takes time for a sample to processed from swab being taken, tested to see if it is positive, received by the sequencing lab, sequenced, bioinformatically processed and uploaded to online databases Turnaround time for all these steps varies from lab to lab for a number of reasons, including sequencing technology, capacity, and current number of Covid-19 cases in the area. Typical turnaround time in the US is about 2 weeks.
To help mitigate the effects of low numbers of sequences, we include rolling 4-week averages where appropriate.
Breakdowns for countries with more than 200 sequences after cleaning the alignment can be found in the regional reports.
The reports are written in Rmarkdown in RStudio, and compiled using knitr
. The sequence data are read in and processed using the Biostrings
and ape
packages, with other processing done using the tidyverse. The plots, tables and interactive elements are made possible by ggplot2
, patchwork
, plotly
, kableExtra
, DT
, sparkline
and htmltools
.
This could be for several reasons, including:
Yes, variants can be up- or downgraded according to new evidence indicating that the risk to public health has changed. One example is Epsilon (B.1.427/B.1.429), which was initially a VOI in the US, later upgraded to a VOC due to the impact on neutralisation by some Emergency Use Authorisation (EUA) therapeutics (e.g. bamlanivimab), and has since been downgraded to a Variant Being Monitored (VBM). The ECDC denotes such variants as ‘de-escalated’, terminology which we follow in the report.
Yes, the consensus sequences are calculated according to the current 3-monthly alignment. This means they reflect the consensus of the currently circulating sequences of the lineage, useful for looking at temporal and regional shifts.
They can, if a local sublineage has acquired mutation(s) and subsequently spread to become the majority in that region. Often these sublineages will get their own pango lineage classification, depending on the epidemiology.
Xs are given in consensus sequences where all of the sequences have an unknown amino acid, denoted X, at that location, and thus are generally found for lineages where there are only a few sequences in the alignment. Sites with missing data are shown in grey in the corresponding consensus sequence plots for easy reference.
Yes. A site must have at least 100 sequences with non-reference amino acids in the current alignment to be included in the report.
The antibody binding profile indicates the contribution of NTD or RBD sites in the recognition of known antibodies that have resolved spike-antibody complex structures deposited in the Protein Data Bank (PDB) as of 4th May 2022.
For a given site and a given antibody, the interaction score \(w\), which characterizes the interaction between the site and the antibody (improved based on Bai et al, 2019), was defined as: \[w=\frac{1}{2}\left( \frac{n_c}{\langle {n_c} \rangle} +\frac{n_{nb}}{\langle {n_{nb}} \rangle} \right)\] where \(n_c\) is the number of contacts with the antibody (i.e. the number of non-hydrogen antibody atoms within 4 Å of the site); \(n_{nb}\) is the number of neighboring antibody residues; \(\langle {n_c} \rangle\) is the mean number of contacts \(n_c\) and \(\langle {n_{nb}} \rangle\) is the mean number of neighboring antibody residues \(n_{nb}\) across all epitope sites. A weight of 1.0 is attributed to the average interaction across all epitope sites. Neighboring residue pairs were identified by Delaunay tetrahedralization of side-chain centers of residues (\(C_{\alpha}\) is counted as a side chain atom, and pairs further than 8.5 Å were excluded).
As there are multiple RBD-targeting antibodies, we sum the interaction score for the given site to all RBD-targeting antibodies.Then we normalize the summed score (min-max scaling; minimum, 0; maximum, 99 percentile of the summed score) and used it as the antibody binding profile. The same procedure was repeated for the NTD-targeting antibodies to get the antibody binding profile for all NTD sites.
Quickhull (Barber et al, 1996) was used for the tetrahedralization and Biopython PDB (Hamelryck and Manderick, 2003) to handle the protein structure.