C-19: Exploring a more robust method to forecast infection spread

Updated: May 11, 2020

Combining reported death, case detection and mobility statistics into a holistic mathematical model provides a more robust forecasting model for COVID-19 infections. This could be helpful to understand the progression of the virus, and for scenario planning of future interventions. Whereas analysis of death data yields reasonably good predictability of future deaths and an inference of infection growth; case detection data meanwhile often yields limited consequential value in the same regard. Often riddled with insufficiency (gaps) and time lags, resulting in poorly fit trend lines; case detection data has also been observed to be strongly correlated to testing volumes. Utilising all three datasets concurrently yields a best of breed forecast model that leverages each ones unique advantage.

This article is part of our COVID-19 daily projections report, and was previously titled "A folly of trusting data quantity over quality?"

Visit: https://www.agility.asia/covid for more up to date information.

Outline & introduction

The issues that arise from using case detection statistics alone as primary indicator for how countries are doing vis-a-vis infection growth of COVID-19 are multiple; not least, potential underestimation of the actual number of infected persons often caused by significant delays in testing.

Underlying this issue are often major constraints in availability of test resources, leading to sufficiency gaps in testing capacity, and significant testing delays in detecting the infected from first contraction;

  1. A study in Hubei established a median testing delay of 10–12 days from initial infection [The Epidemiological Characteristics of an Outbreak of 2019, CDC China]

  2. Test kits are limited and often, by the time they are available numbers have increased exponentially leaving health authorities to play a desperate game of catch up

The result may be a misguided view for decision making if case infection figures alone are used

Meanwhile, two additional statistical datasets provide crucial complementary information that could further improve on the ability to detect the spread of the virus;

  1. Reported death statistics offer possibly more definitive and timely information versus case detection due to less uncertainty in reporting delays and lower volume of information (quality over quantity)

  2. Mobility data provides timing and magnitude information around movement controls and lock-downs, that translate directly to forecasting of future taper trajectory in infections

Objectives of this analysis are therefore to:

  1. Utilise 3 statistical data sources: case detections, reported deaths, and mobility information to build a more holistic picture of COVID-19 infection trajectory from country to country

  2. Offer objective commentary on how well current testing efforts have been performing in a given country, in terms of; timeliness (measure on time as and when infections happen) and sufficiency (measure every infected person as opposed to only a subset)

  3. Develop a continuously evolving and robust forecast of COVID-19 spread and deaths, as crucial input to decision making on future interventions and pathways


European countries have generally been very behind in testing. In part due to their earlier outbreaks compared to the rest of the world, but also their delayed action in implementing social distancing and lock-downs; allowing infections to grow and outpace test resources (insufficient testing). Also, in such scenarios it is only natural to focus dispensing care to the sick, leading to detection at a matured stage of disease development (late testing).

Deaths have now started to grow unhinged in multiple countries, indicating that infections have already surged — making testing/detection even harder to keep up. Worsened further by the fact that all infected persons will eventually need to be tested as long as they remain infectious if further spread is to be prevented. Deaths will continue where mitigation steps fail, the pipeline is filled with infected persons and it is a matter of time for more deaths to come.

Asian countries, across the South and Southeast, are generally faring better given earlier stage actions to execute some form of social distancing and lock-down. As such, testing is more on track and potentially able to keep up with growth in infections, as they grow at slower pace and from a lower base. Some countries stand out as being best in class in testing, and with that controlling infection growth — almost all Chinese provinces ex. Hubei in fact, South Korea, and Singapore are a few examples.

Mapping the epidemiological unknowns of COVID-19. It bears noting, that at this point, many of the epidemiological characteristics of the COVID-19 pandemic have yet to be fully mapped and quantified. The novelty of the virus means that there is insufficient knowledge in how the virus affects people of different ages and backgrounds, as well as other crucial questions: what the median timelines are for critical phases, what the asymptomatic prevalence could be, and what the true mortality rate from the virus is. Additionally, whether these characteristics vary from country to country, possibly due to predominance of different viral strains. The inter-relation between these factors is mapped in the graphic below;

Challenges arise when epidemiological characteristics are uncertain. Focusing on mortality rates; there are multiple factors that affect death statistics from which mortality is estimated; and on the other hand, many more affect infections, formally determined via confirmed case detections. As mentioned, the underlying drivers for each of these has yet to be definitively understood and quantified. For example, it is still not known whether environmental factors play a role in symptom development and mortality from COVID-19. Or if test insufficiency and delays is a significant enough constraint that a large proportion of the infected population may be going totally undetected due to weak or no symptoms.

To illustrate one of many problems arising from this: there is potentially a circular debate as to whether the true mortality rate of COVID-19 infection is in fact constant* and therefore the differences in case fatality measured from country to country indicates errors on the testing side; or that testing is adequate (albeit perhaps not perfect), and therefore the mortality or death computation is in error or merely reflecting true differences in how the disease affects different people. (*Notwithstanding ‘minor’ differences due to age and prevalence of underlying medical conditions). In short one cannot determine one side of this equation without first conclusively quantifying the other.

Asymptomatic carriers. What is an extremely concerning prospect is that of asymptomatic transmission of the virus; given this group of individuals would not be naturally tested under typical testing regimes, and therefore a key question that remains to be solved is what the prevalence of asymptomatic carriers are from one country to the next. A recent study in the US measured an astounding 88% asymptomatic prevalence amongst 400+ participants in a homeless shelter in Boston, whereas a Japanese study measured 50% prevalence among the 600+ passengers of the Diamond Princess cruise ship (although a follow up test measurement confirmed only 18% asymptomatic cases vs. this initial figure). Many more studies exist in available literature acknowledging the existence and relatively high prevalence of asymptomatic carriers for COVID-19.

Computing the true infection envelope. We believe that there is strong merit in computing a theoretical infection envelope for each country. One that by definition precedes case detection and reported death statistics, and indicates the true number of infected persons at any given time. Given a reasonable estimate of the true mortality rate (~0.7–1.2%), we can expect the true daily infection line to peak at a higher figure than peak case detection — given that, in comparison, recorded case fatality rates range between 2–15%. We term this relative difference the ‘case detection error’ and it ranges between 1x to 20x depending on the country.

Current lockdown measures. In general, blunt measures such as social distancing and movement control i.e. ‘lock-downs’ seem to have been the only effective measure to taper infection and death. We note a very strong correlation between changes in mobility and transmission rate decline from country to country, which over the month of April 2020 has been the single biggest drive of tapering infections and deaths from COVID-19. While a huge relief, it is a worrying sign of things to come, as economies sit at a literal standstill waiting for the virus to pass or a vaccine to be introduced.

It is expected that two things; (1) faster and more comprehensive/sufficient testing, and (2) use of technology for tracking and tracing infected persons, need to be used effectively in tandem if any country hopes to gain back normalcy soon, after the passing of these earlier waves. We sincerely hope that countries that are lucky enough to have mitigated a massive death crisis through early lockdown measures are actively thinking about this as a step forward.


Lockdowns should probably have been the primary advice. Looking back at what has transpired over the past weeks, it is highly questionable as to whether testing (and track + trace) as the primary strategy to overcome the COVID-19 spread was ever going to work. And it is reasonable to have expected that the general guidance from global health authorities (e.g. WHO) should have been for countries to have led with a lock-down and social distancing strategy before deploying testing in full force. On balance, this would arguably lead to a more palatable outcome over the long run;

COVID-19 spreads at tremendous speed. An R-naught >2, asymptomatic and pre-symptomatic transmission, long incubation period of up to 14 days, and infection concentrating at the upper respiratory tract. The virus SARS-CoV-2 itself is hardy enough to last up to 3 days on solid surfaces like metal and plastic, while aerosolisable with individual particle size within the order of ~100nm. All these make COVID-19 extremely transmissible.

Importance of speed in testing is understated. Mismatch of exponential growth of COVID-19 infection vs. expected measurement lag further compounds the issue. It is well documented how even in Hubei, where the virus first broke out, the median time between infection and detection via testing was an est. 10–12 days. With an infection doubling time of 2–4 days in most countries, that means testing taken today would already be underestimated by a factor of 2² (4x), 2³ (8x) or 2⁴ (16x) by the test results are tallied up. The result is thus a highly predictable failure to catch up.

Test-kit shortages have plagued many countries. Constraints in test kit availability as infections surge through the population make matters worse. Not all countries had created assays by the time testing was to start, and the required PCR test kits are well-known to be in short supply. To make matters worse testing creates queues for patients to be swabbed and processed, further exacerbating the time lag. South Korea for example has, to date, administered some 461,000+ tests at a confirmation yield of ~2.2% (as at 5th April 2020); a highly ‘inefficient’ and time consuming exercise against a backdrop of a fast growing and potent virus spread. And this is a country that manufactures its own PCR test kits and had, comparably speaking, ample supply at the ready as the pandemic took off. The authorities had also co-opted telco companies in the country early on to deploy mobile technology to help with surveillance, monitoring and contact tracing. The reality for the rest of us perhaps is that South Korea should be considered the exception rather than the rule in its ability to resist a total lock-down; emerging economies in the world may not be so well resourced to enjoy the same privilege.

Leading to potentially underestimated infections. As a result, and based on our analysis, reliance on testing data alone often leads to a vast underestimate of the actual total infections in each country, often between 5–20 times. While not factually incorrect, as these figures represent the actual detected number of cases, the figure understates the magnitude of infection spreads, and may lead to the general public perceiving it to be a smaller problem than it actually is. We believe it would be better if estimates are made as to the actual figure, and are communicated transparently to the public from time to time as it may allow for greater vigilance by each person to play their individual part in reducing the spread of infections.

While it is unclear what the eventual and definitive median estimates for the duration of COVID-19 incubation, infectiousness, recovery and death are, it is plausible that the time from infection to death is within a similar timeframe of the testing delay (i.e. the delay in verifying an infected person after initial exposure), given testing delays and the fact that most countries prioritise testing of the sick. As such, testing for infections may not yield an advantage over analysing death statistics; a perverse yet meaningful insight that should be an invaluable takeaway for those in decision making positions.

Testing is however still very necessary. This is by no means a criticism against testing itself, or a proposal to do away with it. Testing remains of great use for authorities to identify cluster networks where infections are growing so that interventions can be appropriately targeted and deployed, and for epidemiologists and scientists to study generational evolution and mutations of the virus strain. Rather, we simply believe it would have been appropriate for most countries to have enforced a social distancing and lockdown earlier, to buy time for testing to be rolled out at a manageable pace. And eventual draconian monitoring and control measures perhaps to come in time; notwithstanding challenges arising from societal non-acceptance. The three strategies can then, in time, be synchronised with one another at a manageable pace in line with resources available.

Use deaths, case detection and mobility changes to model forecasts. Lastly, and as a core argument of this write up. Whereas analysis of death data yields reasonably good predictability of future deaths and an inference of infection growth; case detection data meanwhile often yields limited consequential value in the same regard. Often riddled with insufficiency (gaps) and time lags, resulting in poorly fit trend lines; case detection data has also been observed to be strongly correlated to testing volumes, i.e. the more persons tested within a given day, the higher the case detection volume. A perverse and illogical takeaway that leads to bad decision making if data is taken at face value: the statement ‘garbage in, garbage out’ comes to mind. Those making key decisions should at least be made aware of the shortcomings in data collection gaps and lags, and similarly, the intricacies of modelling exponents where time-sensitivities are involved. This is perhaps not a matter where mistakes can be afforded.

The method we outline here and is used in our daily projection reports combine data across 3 sources; case detection, death statistics and mobility changes. Each dataset provides unique inputs that help complement the overall information used to model the forecast projections for a particular country. The following graphic helps explain these datasources, why and how they are used, and what output is derived from them;

It should be noted also that reported death statistics are not however without their own shortcomings. Several countries have cited data insufficiency issues in reporting deaths that occur outside of the healthcare system, i.e. within communities such as care homes, where noted causes of recorded death are not fully robust; namely the UK, France and Indonesia — something that has now been termed ‘excess deaths’. However the results of curve fitting with current death data sets (daily and cumulative) are promising and yield good forecasting ability; suggesting that reporting errors are systematic or relatively low in volume compared to official records.

Notes on analysis

All analysis uses mortality ~1.2% (initial estimate based loosely on Chinese provincial data ex. Hubei). Exceptions are Guangdong (known figure ~0.7%), Hubei (where detection lagged, leading to higher mortality of 4–5%, Korea (assumed 1%), and Singapore (assumed 0.8%). Note that assumptions will continue to be revised from time to time based on available data.

Three generalised models; a symmetrical bell distribution, Weibull distribution, and numerical distribution based on a non-continuous transmission factor were each assessed to seek an agreeable fit with recorded data. While the first two distributions provided reasonably good fit at the rising edge of growth exponent, the latter was eventually chosen based on its ability to best represent specific interventions such as social distancing and lock-downs; and ultimately provided the best fit to data available. For modelling purposes, infection growth takes the form of three phases:

  1. a rising exponent formulated as a geometric progression utilising initial death growth to represent the transmission factor,

  2. a gradual decline in this exponent to a factor <1 representing a process of locking down and social distancing; this is also synchronised with reported mobility changes

  3. a constant long term factor of <1 representing the latent eventual transmission in each country as it seeks a long term strategy to manage the virus spread. As an initial assumption, Hubei data is used (~0.91), but this is revised on an on-going based on reported statistics from each country

A cohort consideration is made for each incoming wave of new daily infected cases, resulting in the next day as one of 3 outcomes: continuing to be infected, recovery or death.

A median estimate for time from detection to death of 14 days is used across all country models.

All countries assume lockdown/partial lockdown within months of March and April 2020 (ex. China); this is illustrated by a visible decline in growth of the daily infected bell curve. This may not represent all cases but has been the path taken in nearly every country, and as the non-action path naturally leads to unthinkable infections and death, modelled by a simple exponent. The specific assumptions on timing and effectiveness of these lockdown actions however need refinement based on actual results in each country, and as such we will be continuously updating our input assumptions. Future forecast may not be perfectly modelled, and should be read as indicative only.

Additional data from Apple and Google’s COVID-19 Mobility Reports (which we started from 17 April onwards) provides a direct comparison of the model transmission factor and is used to further refine assumptions. This data is available T-2 vs date of our reports hence practically gives an almost ‘present day’ view of physical movement by the population (i.e. mobility information).

While this approach is a departure from conventional SIR epidemiological mathematical modelling, the approximation is shown to be more than adequate for the purposes of computation where total infected << total population; with the added elegance of not having to solve several ODEs in the process.

COVID-19 statistics used are compiled from Johns Hopkins University. Link: https://github.com/CSSEGISandData/COVID,19/tree/master/csse_covid_19_data

Mobility data is sourced respectively from Apple and Google’s COVID-19 Mobility Reports.


Recent Posts

See All