The idea for the Simulacrum came from a conversation over coffee at Store Street Espresso in late 2014 when Matt Williams suggested to Jem Rashbass and Hilary Wilderspin that it would be useful to have a publicly available version of the data model of the cancer dataset held securely by the National Cancer Registration and Analysis Service (NCRAS) and procured by the National Disease Registration Service which is part of Public Health England (PHE). His suggestion was that a publicly available model would allow users to write queries, that following appropriate approvals, could then be tested against the real data held in PHE. Over time users would contribute to a communal code repository.
Researchers and others need access to data to answer important questions about diseases like cancer, but if real data is used patient confidentiality must be protected. The Simulacrum was therefore created with synthetic data to help researchers and others who want to ask questions about cancer to do so on a data set that looks a lot like the real thing, but never compromises patient confidentiality. This removes the need for many of the essential controls that are required to access the real patient data in Public Health England.
The Simulacrum looks like the real cancer data within NCRAS, but does not contain any real patient information. Anyone can use it to learn more about cancer in England without compromising patient privacy. Also, because we have kept the data model the same as the real one in PHE, the Simulacrum can be used to write and test queries that (with the right permissions and ethical approval) could be run on the real data.
NCRAS collects data on all cancers diagnosed in England. This is then linked with other data from the NHS to create a large and complex database that is held in the Cancer Analysis System (CAS). The Simulacrum has been designed to mimic some of the data held on the CAS.
Although the data is synthetic, the Simulacrum maintains most of the properties of the original data with a high degree of accuracy. But, there are limitations; the more complex the data query the more approximate the results. However, because the data model (but not the data) is the same as the real model in the Cancer Analysis System in PHE, researchers can use the Simulacrum to plan and test their hypotheses before making a formal request to PHE to analyse the real data.
Lora Frayling speaking at the HDR UK Synthetic Data Special Interest Group, December 2020. (Video opens in a new window).