Available data

The Simulacrum is a synthetic dataset based on datasets held on the National Disease Registration Service (NDRS) Cancer Analysis System (CAS) at NHS England. The Simulacrum has similar data structure and statistical properties to the real cancer datasets, however, is made up entirely of synthetic patients and does not contain private patient information.   

Simulacrum can be used to conduct data exploration, scope research hypotheses and write and develop code that can be run on CAS data to produce analysis outputs. To learn more about the Simulacrum please refer to our Background section.

There are two versions of Simulacrum available for download from the website:  

Simulacrum v1.2.0 

  • This includes patient and tumour information for diagnosis years from 2013-2017 and systemic anti-cancer therapy (e.g. chemotherapy) treatment data. 

Simulacrum v2.1.0  

  • This is the latest release and includes newer diagnosis years from 2016-2019 and additional radiotherapy and genomics testing data. 

Typical data items found in the Simulacrum:  

  • Synthetic patient characteristics, such as age, sex and ethnicity; 
  • Synthetic tumour characteristics, such as cancer site, staging and pathology information; 
  • The vital status of each synthetic patient, i.e., whether the patient has died or been censored, which can be used for survival analysis; 
  • Details about systemic anti-cancer therapy (SACT) treatments that synthetic patients receive, e.g., chemotherapy, such as which drugs were prescribed or the weight of the patient at the start of a regimen; 
  • Details about radiotherapy treatments, such as which part of the body they were administered to and at what strength/dosages (only in Simulacrum v2); 
  • Details about somatic genomic tests, such as which tumour genes were tested and test outcomes, i.e., normal vs abnormal result (only in Simulacrum v2). 

For a detailed description of all the data fields within each version, please see our Simulacrum v1.2.0 data dictionary and Simulacrum v2.1.0 data dictionary, both downloadable from our Library page.

A breakdown of characteristics of the available Simulacrum v1.2.0 and v2.1.0 releases can be found in the comparison chart below. 

  Simulacrum v1.2.0   Simulacrum v2.1.0  
Diagnosis years  2013-2017  2016-2019 
Synthetic patients  2,200,626  1,871,605 
Synthetic tumours  2,371,281  1,995,570 
Synthetic patients with SACT  366,266  352,372 
Synthetic patients with RTDS  --   413,169 
Synthetic patients with genomic testing data  --  94,908 
Years of SACT data  2012 onwards  2012-2022 
Years of RTDS data  --  2012-2022 
Years of genomics data  --  2016-2019 
SACT regimens   730,472  781,389 
SACT cycles  2,442,037  2,741,674 
SACT drug administrations  6,385,828  7,662,030 
RTDS episodes   --  656,560 
RTDS prescriptions  --  657,648 
RTDS exposures  --  13,201,531 
Genomic tests  --  255,728 
Non-melanoma skin cancer diagnoses (C44)  607,619  514,517 
Breast cancer diagnoses (C50)  226,406  187,204 
Prostate cancer diagnoses (C61)  201,785  179,478 
Lung cancer diagnoses (C34)  169,118  156,927