Frequently Asked Questions

About the project

What is the Simulacrum?

The Simulacrum is synthetic data which imitates some of the data held securely by the National Cancer Registration and Analysis Service (NCRAS) within Public Health England (PHE).  NCRAS collects data on all cancers diagnosed in England, and this data is then linked with other datasets from the NHS to create a database called the Cancer Analysis System (CAS).  The simulated data in the Simulacrum mimics some of the CAS data, but because it is synthetic data the Simulacrum contains no real patient information. Therefore, the Simulacrum can be used to test hypotheses and gather information using data that has the “look and feel” of the cancer data within Public Health England without ever compromising patient confidentiality.

Also because we have kept the data model the same as the real one in PHE, the Simulacrum can be used to write and test queries that (with the right permissions and ethical approval) could be run on the real data.

The Simulacrum was developed by Health Data Insight, with support from AstraZeneca and IQVIA.

Why was the Simulacrum developed?

Cancer affects one in two people in their lifetime and if we are to improve the diagnosis, treatment and survival of individuals we need to use data to understand the disease.  Public Health England collects cancer data on everyone diagnosed with cancer in England and this data is used by the NHS, health care professionals and many others to benefit patients.  However, the original data is personal and therefore confidential and access is, quite rightly, tightly controlled.  This can create a barrier for researchers and others trying to develop solutions that will benefit patients.

The Simulacrum has been created to protect patient confidentiality but at the same time make it possible for anyone who needs to ask questions on cancer data to do so.

Why do researchers want to access the data held by NCRAS?

The National Cancer Registration and Analysis Service (NCRAS) contains a wealth of data about all cancers diagnosed in England. There are up to 1,000 data items for each tumour diagnosed, including information on demographics, staging, pathology, treatment and hospital use, making it the largest dataset on cancer in the world.  It is therefore a valuable resource for researchers who want to learn more about the disease and how best to treat it. This might include evaluating the effectiveness of new cancer drugs or treatment regimens or for planning how and where services are delivered.

How should I use the Simulacrum?
Download the Simulacrum and query the data to conduct your research.  The more complex the queries, the more approximate the results. The Simulacrum data is synthetic and therefore not completely accurate so is not suitable  for clinical decisions.
Are there plans to expand the Simulacrum?
Yes – PHE holds other data sets that could be included in the Simulacrum. Before these can be used to expand the current version PHE will need to produce and release anonymous datasets.

Protecting patient confidentiality

The Simulacrum was built to facilitate research based on data held by the National Cancer Registration and Analysis Service in PHE while protecting patient confidentiality.  By using synthetic data in place of the real data, researchers can work with the data that has the look and feel of the real data, and maintains the same data model – without any risk to patient confidentiality.

How can you be sure that you cannot identify a real patient in the Simulacrum?
The Simulacrum data is designed to model and mimic many of the statistical properties and linked relationships of the original cancer data in PHE but never contains any actual patient data.  To understand how patient privacy was protected during the development of the Simulacrum please visit our development page, and for a technical description, please visit our library section.
If I do not want my data to be included in the Simulacrum, what can I do about this?

Your data is not in the Simulacrum.

The team building the Simulacrum only ever used anonymous data that was provided with the approval of PHE’s Office for Data Release and was made from data pooling more than 50 similar cases.  Because the original data is completely anonymous PHE has released examples of the original data used to build the Simulacrum on data.gov.uk.  The synthetic data in the Simulacrum therefore has no real patient data in it – and even the synthetic patients we have created will not mimic any one individual.

However, if you are a cancer patient, and you do not wish for your data to be used in PHE’s National Cancer Registration and Analysis Service, you can ask PHE to remove all of your details from the cancer registry at any time.  This will not affect your treatment or care.  For details of how to opt out, please visit: https://www.ndrs.nhs.uk/national-disease-registration-service/patients/opting-out/

Did AstraZeneca and IQVIA see any individual patient data during the development of the Simulacrum?
The Simulacrum was entirely built from anonymous data and no individual identifiable data as ever shared with with AZ or IQVIA.
Was any individual patient identifiable data shared with HDI, AstraZeneca or IQVIA in the development of the Simulacrum?
No. PHE anonymised the original NCRAS data used to generate the Simulacrum before it was provided to HDI – examples of the data used are available on data.gov.uk. This was then shared with HDI following approval from the Office of Data Release who then created the data cubes used for simulation.

Design and use of the simulated data

How is the synthetic data generated?

The simulated data is generated using a machine learning algorithm. A detailed description of the methodology will be published in 2019. For a technical description, please visit our library page.

How often will the Simulacrum be updated to include more recent diagnoses?
The Simulacrum will be updated on an annual basis to include more recent diagnoses.
Who is able to use the Simulacrum and for what purposes?
The Simulacrum is entirely synthetic data and is available for anyone to use.  Because it only approximates to the original data results from the Simulacrum should not be used for clinical decisions.

 

The data model (not the data – which is synthetic) in the Simulacrum is the same as the original one in PHE.  Once a user has refined their query using the Simulacrum, they can make a formal request to Public Health England’s Office for Data Release to have their queries run on the real data. In this way, the Simulacrum can be used to assist research for public health, epidemiology, commissioning and service planning.

How robust is the data for clinical research purposes?
The data contained within the Simulacrum is synthetic; and it should never be used to make clinical decisions. The more complex the queries, the more approximate the results. Researchers who wish to run their analyses on the real data can make a formal request to Public Health England’s Office for Data Release.
Is the Simulacrum relevant for use beyond the UK?

The Simulacrum is totally synthetic data and is therefore available to researchers anywhere in the world.  The data in the Simulacrum was built from anonymous data derived originally from the population of England.  Access to the original cancer data collected by Public Health England is managed by PHE’s Office for Data Release.

Will this approach be extended for use in other disease areas, for example cardiovascular and metabolic diseases?
This is potentially possible – but requires high quality, well structured, and detailed original datasets.
Will synthetic data be viable for use with regulators and market access decision-makers?
The data contained within the Simulacrum is synthetic; we therefore do not recommend submitting analyses based purely on the Simulacrum to regulatory agencies or market access decision makers without appropriate caveats. Researchers who wish to run their analyses on the real data can make a formal request to Public Health England’s Office for Data Release.
Can I use my preferred analytical package with the Simulacrum?
The Simulacrum is released as downloadable microdata in flat-file or database format so that users can use their preferred analytical package with the Simulacrum.
Can I publish any results generated directly from Simulacrum?
Yes, you can publish your results from the Simulacrum. By doing so you accept your published results are based on synthetic data. If you wish to run your queries on the real data, you can make a request to Public Health England’s Office of Data Release.  We ask that anyone publishing Simulacrum results acknowledges the Simulacrum project. Please visit our acknowledgments page for more information.

Information about project sponsors

The Simulacrum is a joint project between Health Data Insight CIC, AstraZeneca (AZ) and IQVIA.  Started in January 2016, staff from all organisations worked together to develop the statistical and technical elements of the build.  The parties involved in this initiative firmly believe that improving access to this data will directly benefit patients by improving health outcomes.

What are the roles of AZ and IQVIA on the project?
AZ and IQVIA provided data science, oncology and technology expertise; and along with academic and research colleagues were responsible for ensuring that the simulated database could be used to answer relevant and important questions. AZ and IQVIA tested the prototype of the Simulacrum before it was released.

AZ and IQVIA co-funded the development of the Simulacrum pilot.

At no time were AZ or IQVIA given access to patient identifiable data.

What is the role of Public Health England (PHE) in the project?
PHE is responsible for the National Cancer Registration and Analysis Service (NCRAS) that collects the sensitive confidential cancer information. PHE recognises the benefits of using data for the public good but also takes its responsibility for protecting confidentiality very seriously and is committed to optimising the use of public data.  As such, PHE has no direct role in the development of the Simulacrum through the PHE Office for Data Release did provide the anonymous data that enabled the project to progress.
What is the role of Health Data Insight CIC on the project?
Health Data Insight (HDI) CIC is a social enterprise that has been created to develop technical solutions that deliver benefits to patients the NHS and public.  As a Community Interest Company HDI is required to ensure that all the value produced from this work is captured for the social good.  The Simulacrum is therefore available freely for anyone to use under an MIT License.

HDI owns the intellectual property and was responsible for managing the testing and development of the Simulacrum.