Background
What is Simulacrum?
The Simulacrum is a collection of synthetic datasets that imitates patient records held by the National Disease Registration Service (NDRS), NHS England. It was developed by our synthetic data team at Health Data Insight CiC (HDI), with support from AstraZeneca (AZ) and IQVIA.
The datasets in the Simulacrum are made up entirely of artificial cancer patients and their tumour diagnoses, cancer treatments and genetic information. While the artificial records look realistic, they do not contain any real patient information, so cannot be used to identify a real person.
NDRS cancer data and patient confidentiality
NDRS collects data on all cancers diagnosed in England and links it with other data from NHS England. This is stored as a large and complex database in the Cancer Analysis System (CAS). This data contains highly valuable and detailed information about cancer patients, their chemotherapy and radiotherapy treatments and genomic tests. It has the potential to be used by researchers, academia, pharma and others to answer important questions about cancer and conduct research that can lead to improved patient outcomes.
However, such data also contains confidential patient information, which must be protected, thus it is not readily available to the public. The data can only be accessed if it has first been anonymised, unless specific legal and ethical approvals are in place. To access the data, one must apply to NHS England’s Data Access Release Service (DARS) for a specific data release. Without being able to explore the data first, it can be difficult for a researcher to know whether the data will be useful for answering their specific questions or what data to request.
Where did the idea for Simulacrum come from?
The idea for the Simulacrum came from a conversation between Jem Rashbass (HDI’s CEO), Hilary Wilderspin (an HDI Board Director) and Matt Williams (a clinical oncologist and research fellow at Imperial College London), over coffee at Store Street Espresso in late 2014. Matt’s idea was this: that it would be useful to have a publicly available but completely anonymous version of the data held by the NDRS. This would allow users to write queries, which, following appropriate approvals, could then be tested against the real data, releasing fully anonymised outputs, ultimately enabling research without ever needing to give direct data access to users.
Simulacrum properties and use
The Simulacrum was therefore created to enable anyone to use cancer data that looks like the real thing to ask questions about cancer, without compromising patient confidentiality. Access to this synthetic data does not require many of the essential controls that are needed to allow access to the real patient data held by NDRS.
The Simulacrum has a similar data structure to the real data in the CAS and maintains many of the statistical properties of the original data with a high degree of accuracy. Thus, it can be used to learn about the structure of the data, formulate hypotheses and write code for running analyses before making a formal request to NDRS to analyse the real data.
However, it does have limitations: more complex statistical properties are not so well captured. This means that the more complex the query run on the synthetic data, the more approximate the results will be compared to the real data. Therefore, it is important that Simulacrum alone is not used to make epidemiological inferences or clinical decisions.