The raw data and the associated meta data were downloaded from GISAID periodically. The following computational pipeline was executed to automatically produce a dated phylogeny using a range of bioinformatic tools. We endeavored to eliminate human intervention in the automated pipeline so that results are more reproducible.

Genome annotation

We first compute MD5 checksums for all input sequences to determine whether certain samples have a completely identical haplotype (differences in missing data are counted as a distinct haplotype in this step). For any newly added sequence, we first ran blastx to determine the rough positions of the protein-coding sequences (CDS). ORF1 contains a ribosomal slippery site, which shifts the coding frame. The query peptide probe was thus broken into two parts at this site for ORF1, and those two segments are treated as two separate genes for the purpose of the pipeline. Following BLASTX, Genewise was used to align the reference protein sequence to the nucleotide sequences to determine the correct reading frame.

Any samples containing a truncated protein prediction, or a split protein annotation by Genewise were excluded. The rational is that most errors occurring in the consensus sequence produced by 3rd generation sequencing technologies contain insertion/deletions, but rarely substitutions. Such frame-shifting errors may suggest an overall low-quality of the sequence. Our stringent cutoff excluded approximately 39.7% of the raw input sequences.

Multiple sequence alignment

We concatenated each protein-coding gene, as well as non-coding regions (as a single partition), into separate files. TranslatorX + MAFFT were used to align the protein-coding DNA sequences, using a codon-aware mode. Non-coding regions were aligned with MAFFT. Resulting alignments were cleaned with GBlocks to remove any ambiguous regions. At this stage, we again computed md5 checksums for all cleaned sequences, and further collapsed any identical haplotypes for the following steps.

Partitioned Maximum-Likelihood tree search

Aligned partitions were concatenated back into a single full length alignment, with partitioning information recorded. Each codon positions in each gene were placed in a separate initial partition, plus a single partition for the non-coding regions, resulting in a total of 34 initial partitions. IQTREE was used to perform a partition scheme optimization, followed by a search of the Maximum-Likelihood phylogeny using the optimized partitioning scheme. We then further optimized branch lengths by using RAxML on the final tree, using the partition scheme inferred by iqtree.

Preliminary rooted tree inferred by root-to-tip regression

Trees inferred by IQTree are unrooted (we chose reversible-time substitution models). In order to provide BEAST with a reasonable starting rooted tree, we use Root-to-tip regression implemented by the rtt function in R. Briefly, a randomly chosen root was inserted on the unrooted tree, after which all root-to-tip branch lengths can be computed. These lengths are then correlated to the sample collection date. Under the assumption of a strict molecular clock, there should be a strong positive linear correlation. The root which produces the lowest root-mean-squared error was chosen.

For BEAST search, a prior distribution of the molecular clock rate (in the unit of substitutions/site/year) is needed. We approximated this distribution by taking random samples collected >2 months apart, and computed the difference of their root-to-tip branch lengths, and the collection dates. Rate is computed by Δ(Branch length)/Δ(Date). The mean and standard deviation of this distribution were specified as the prior for BEAST clock.

BEAST dated phylogeny inference with tip date sampling

BEAST v1.10.4 was used to infer a dated phylogeny with a relaxed clock (logNormal) model and tip-date sampling. Priors for the relaxed clock model was as described in the previous section, with between-clade variation rate prior set to the same value as the rate standard deviation prior. Tip date distribution was modeled as a normal distribution with a mean 5 days prior to the sample collection date, which corresponds to the average incubation period of the disease, and a standard deviation of 2.5 days, which corresponds to the maximal incubation period of 14 days. We used the SkyGrid model as the tree shape prior with 100 (full dataset) or 50 (without outgroup) population size parameters.

The analysis was run either with or without bat and pangolin outgroup sequences.

We ran 3 independent MCMC chains for each analysis. For every half hour, we examine the chains to check 1) whether each chain has reached stationary state based on parameter traces, analyzed by the Gelman Rubin statistics in the VMCMC package. and if so 2) do any of the 2 chains converge, based on PRSF statistics < 1.1. The following parameters were used to determine convergence: joint, prior, likelihood, treeModel.rootHeight, age(root), treeLength, skygrid.precision, ucld.mean, ucld.stdev.

When a run is converged, the good chains were merged with burn-ins excluded and summarized with TreeAnnotator. The output was converted to the Auspice browser json format for display.