Frequently Asked Questions
About the project
What is the Simulacrum?
The Simulacrum is synthetic data which imitates some of the data held securely by the National Cancer Registration and Analysis Service (NCRAS) within Public Health England (PHE). NCRAS collects data on all cancers diagnosed in England, and this data is then linked with other datasets from the NHS to create a database called the Cancer Analysis System (CAS). The simulated data in the Simulacrum mimics some of the CAS data, but because it is synthetic data the Simulacrum contains no real patient information. Therefore, the Simulacrum can be used to test hypotheses and gather information using data that has the “look and feel” of the cancer data within Public Health England without ever compromising patient confidentiality.
Also because we have kept the data model the same as the real one in PHE, the Simulacrum can be used to write and test queries that (with the right permissions and ethical approval) could be run on the real data.
The Simulacrum was developed by Health Data Insight, with support from AstraZeneca and IQVIA.
Why was the Simulacrum developed?
Cancer affects one in two people in their lifetime and if we are to improve the diagnosis, treatment and survival of individuals we need to use data to understand the disease. Public Health England collects cancer data on everyone diagnosed with cancer in England and this data is used by the NHS, health care professionals and many others to benefit patients. However, the original data is personal and therefore confidential and access is, quite rightly, tightly controlled. This can create a barrier for researchers and others trying to develop solutions that will benefit patients.
The Simulacrum has been created to protect patient confidentiality but at the same time make it possible for anyone who needs to ask questions on cancer data to do so.
Why do researchers want to access the data held by NCRAS?
The National Cancer Registration and Analysis Service (NCRAS) contains a wealth of data about all cancers diagnosed in England. There are up to 1,000 data items for each tumour diagnosed, including information on demographics, staging, pathology, treatment and hospital use, making it the largest dataset on cancer in the world. It is therefore a valuable resource for researchers who want to learn more about the disease and how best to treat it. This might include evaluating the effectiveness of new cancer drugs or treatment regimens or for planning how and where services are delivered.
How should I use the Simulacrum?
Are there plans to expand the Simulacrum?
Protecting patient confidentiality
The Simulacrum was built to facilitate research based on data held by the National Cancer Registration and Analysis Service in PHE while protecting patient confidentiality. By using synthetic data in place of the real data, researchers can work with the data that has the look and feel of the real data, and maintains the same data model – without any risk to patient confidentiality.
How can you be sure that you cannot identify a real patient in the Simulacrum?
If I do not want my data to be included in the Simulacrum, what can I do about this?
Your data is not in the Simulacrum.
The team building the Simulacrum only ever used anonymous data that was provided with the approval of PHE’s Office for Data Release and was made from data pooling more than 50 similar cases. Because the original data is completely anonymous PHE has released examples of the original data used to build the Simulacrum on data.gov.uk. The synthetic data in the Simulacrum therefore has no real patient data in it – and even the synthetic patients we have created will not mimic any one individual.
However, if you are a cancer patient, and you do not wish for your data to be used in PHE’s National Cancer Registration and Analysis Service, you can ask PHE to remove all of your details from the cancer registry at any time. This will not affect your treatment or care. For details of how to opt out, please visit: https://www.ndrs.nhs.uk/national-disease-registration-service/patients/opting-out/
Did AstraZeneca and IQVIA see any individual patient data during the development of the Simulacrum?
Was any individual patient identifiable data shared with HDI, AstraZeneca or IQVIA in the development of the Simulacrum?
Design and use of the simulated data
How is the synthetic data generated?
The simulated data is generated using a machine learning algorithm. A detailed description of the methodology will be published in 2019. For a technical description, please visit our library page.
How often will the Simulacrum be updated to include more recent diagnoses?
Who is able to use the Simulacrum and for what purposes?
The data model (not the data – which is synthetic) in the Simulacrum is the same as the original one in PHE. Once a user has refined their query using the Simulacrum, they can make a formal request to Public Health England’s Office for Data Release to have their queries run on the real data. In this way, the Simulacrum can be used to assist research for public health, epidemiology, commissioning and service planning.
How robust is the data for clinical research purposes?
Is the Simulacrum relevant for use beyond the UK?
The Simulacrum is totally synthetic data and is therefore available to researchers anywhere in the world. The data in the Simulacrum was built from anonymous data derived originally from the population of England. Access to the original cancer data collected by Public Health England is managed by PHE’s Office for Data Release.
Will this approach be extended for use in other disease areas, for example cardiovascular and metabolic diseases?
Will synthetic data be viable for use with regulators and market access decision-makers?
Can I use my preferred analytical package with the Simulacrum?
Can I publish any results generated directly from Simulacrum?
Information about project sponsors
The Simulacrum is a joint project between Health Data Insight CIC, AstraZeneca (AZ) and IQVIA. Started in January 2016, staff from all organisations worked together to develop the statistical and technical elements of the build. The parties involved in this initiative firmly believe that improving access to this data will directly benefit patients by improving health outcomes.
What are the roles of AZ and IQVIA on the project?
AZ and IQVIA co-funded the development of the Simulacrum pilot.
At no time were AZ or IQVIA given access to patient identifiable data.
What is the role of Public Health England (PHE) in the project?
What is the role of Health Data Insight CIC on the project?
HDI owns the intellectual property and was responsible for managing the testing and development of the Simulacrum.