Protecting patient confidentiality

The Simulacrum was built to facilitate research based on data held by the National Disease Registration Service in NHS England while protecting patient confidentiality.  By using synthetic data in place of the real data, researchers can work with the data that has the look and feel of the real data and maintains the same data structure, without any risk to patient confidentiality. 

Information about project sponsors

The Simulacrum was developed by Health Data Insight CiC (HDI) in partnership with AstraZeneca (AZ) (Simulacrum v1 and v2) and IQVIA (Simulacrum v1). Started in January 2016, staff from all organisations worked together to develop the statistical and technical elements of the build. The parties involved in this initiative firmly believe that improving access to this data will directly benefit patients by improving health outcomes.  

HDI have been responsible for generating and evaluating the synthetic database, while AZ and IQVIA have guided the development by providing data science, oncology and technology expertise. Additionally, AZ and IQVIA have helped with testing the Simulacrum to ensure it meets researcher’s needs. 

Frequently asked questions

i. What is the Simulacrum? 

The Simulacrum is synthetic data which imitates some of the data held securely by the National Disease Registration Service (NDRS) within NHS England. NDRS collects data on all cancers diagnosed in England that is then linked with other datasets from NHS England to create a database called the Cancer Analysis System (CAS).  The simulated data in the Simulacrum mimics some of the CAS data,but is made up entirely of artificial patients contains no real patient information. Therefore, the Simulacrum can be used to test hypotheses and gather information using data that has the “look and feel” of the cancer data within NHS England without ever compromising patient confidentiality.

Also, because we have kept the data model the same as the real one in NHS England, the Simulacrum can be used to write and test queries that (with the right permissions and ethical approval) could be run on the real data.

The Simulacrum was developed by Health Data Insight, with support from AstraZeneca and IQVIA 

ii. Why was the Simulacrum developed? 

Cancer affects one in two people in their lifetime and if we are to improve the diagnosis, treatment and survival of individuals we need to use data to understand the disease. NHS England collects cancer data on everyone diagnosed with cancer in England and this data is used by the NHS, health care professionals and many others to benefit patients.  However, the original data is personal and confidential, therefore access is, quite rightly, tightly controlled. This can create a barrier for researchers and others trying to develop solutions that will benefit patients.

The Simulacrum has been created to protect patient confidentiality but at the same time make it possible for anyone who needs to ask questions on cancer data to do so. 

iii. Why do researchers want to access the data held by NCRAS? 

The National Disease Registration Service (NDRS) collects a wealth of data about all cancers diagnosed in England. There are up to 1,000 data items for each tumour diagnosed, including information on demographics, staging, pathology, treatment and hospital use, making it the largest dataset on cancer in the world.  It is therefore a valuable resource for researchers who want to learn more about the disease and how best to treat it. This might include evaluating the effectiveness of new cancer drugs or treatment regimens or for planning how and where services are delivered. 

iv. How should I use the Simulacrum? 

Download the Simulacrum and query the data to conduct your research.  The more complex the queries, the more approximate the results. The Simulacrum data is synthetic and therefore not completely accurate so is not suitable for clinical decisions. You can request to have your queries run on the real NCRAS data. Get in touch with us at simulacrumdata@healthdatainsight.org.uk or through the Contact Us page to find out more.

v. Are there plans to expand the Simulacrum? 

Currently, we do not have plans to expand the Simulacrum further, however it would certainly be possible. NHS England holds many other datasets that could be included in the Simulacrum and, as the CAS data gets updated, we may wish to update the Simulacrum to reflect any changes

vi. How can you be sure that you cannot identify a real patient in the Simulacrum? 

The Simulacrum data is designed to model and mimic many of the statistical properties and linked relationships of the original cancer data in NHS England but never contains any actual patient data. To understand how patient privacy was protected during the development of the Simulacrum please visit our development page, and for a technical description, please visit our library page.

vii. If I do not want my data to be included in the Simulacrum, what can I do about this? 

Your data is not in the Simulacrum: Simulacrum is made up of artificial data and records, so no individual in the Simulacrum corresponds to a real person. Any resemblance of a real person has just occurred by chance.

The team generated the Simulacrum from anonymous data summaries that were provided with the approval of Public Health England’s Office for Data Release and were made by aggregating data from groups of at least 50 similar patient records. Because the original data summaries are completely anonymous, Public Health England released examples of them for Simulacrum v1 on data.gov.uk. The synthetic data in the Simulacrum therefore has no real patient data in it – and even the synthetic patients we have created will not mimic any one individual.

However, if you are a cancer patient, and you do not wish for your data to be used in NHS England’s NDRS, you can ask them to remove all of your details from the cancer registry at any time.  This will not affect your treatment or care.  For details of how to opt out, please visit: https://digital.nhs.uk/services/national-data-opt-out.

viii. Did AstraZeneca and IQVIA see any individual patient data during the development of the Simulacrum?

The Simulacrum was entirely built from anonymous data and no individual identifiable data as ever shared with AstraZeneca or IQVIA. 

ix. What data was shared with HDI, AstraZeneca and IQVIA in the development of the Simulacrum? 

As part of the partnership agreements in place between HDI and the NDRS, a pseudonymised version of the original patient data was securely accessed by HDI to use for creating the Simulacrum. The pseudonymisation applied meant that direct identifiers including names, and addresses were removed to reduce identifiability of the data. At no point was this data shared with AstraZeneca or IQVIA.

Beta testing versions of the synthetic datasets in the Simulacrum, were shared with Astra Zeneca and IQVIA during development. These testing versions of the synthetic datasets contained no real patient information, posing no risk to patient privacy, and were used to support testing activities and identify improvements that could be made to the synthetic generation algorithm prior to a full public release.

x. How is the synthetic data generated? 

The simulated data is generated using a machine learning algorithm. A detailed description of the methodology will be published in 2019. For a technical description, please visit our library page.

xi. Who is able to use the Simulacrum and for what purposes? 

The Simulacrum is made up of entirely synthetic data and is available for anyone to use.  Because it only approximates to the original data results from the Simulacrum should not be used for clinical decisions.  

The data  structure in the Simulacrum is the same as the original one in NHS England.  This means that once a user has refined their query using the Simulacrum, they can make request to the National Disease Registration Service to have their queries run on the real data. Get in touch with us at simulacrumdata@healthdatainsight.org.uk or through the Contact Us page to find out more. In this way, the Simulacrum can be used to assist research for public health, epidemiology, commissioning and service planning. 

xii. How robust is the data for clinical research purposes? 

The data contained within the Simulacrum is synthetic; and it should never be used to make clinical decisions. The more complex the queries, the more approximate the results. Researchers who wish to run their analyses on the real data can make a formal request to the National Disease Registration Service. Get in touch with us at simulacrumdata@healthdatainsight.org.uk or through the Contact Us page to find out more.

xiii. Is the Simulacrum relevant for use beyond the UK? 

The Simulacrum is totally synthetic data and is therefore available to researchers anywhere in the world. The data in the Simulacrum was built from anonymous data derived originally from the population of England. Access to the original cancer data collected is managed by NHS England’s Data Access Request Service. 

xiv. Will this approach be extended for use in other disease areas, for example cardiovascular and metabolic diseases? 

This is potentially possible – but requires high quality, well structured, and detailed original datasets. 

xv. Will synthetic data be viable for use with regulators and market access decision-makers? 

The data contained within the Simulacrum is synthetic; we therefore do not recommend submitting analyses based purely on the Simulacrum to regulatory agencies or market access decision makers without appropriate caveats. Researchers who wish to run their analyses on the real data can make a formal request to the National Disease Registration Service. Get in touch with us at simulacrumdata@healthdatainsight.org.uk or through the Contact Us page to find out more.

xvi. Can I use my preferred analytical package with the Simulacrum? 

The Simulacrum is released as downloadable microdata in flat-file or database format so that users can use their preferred analytical package with the Simulacrum. 

xvii. Can I publish any results generated directly from Simulacrum? 

Yes, you can publish your results from the Simulacrum. By doing so you accept your published results are based on synthetic data. If you wish to run your queries on the real data, you can make a request to the National Disease Registration Service. Get in touch with us at simulacrumdata@healthdatainsight.org.uk or through the Contact Us page to find out more.  We ask that anyone publishing Simulacrum results acknowledges the Simulacrum project. Please visit our citation page for more information. 

xviii. What are the roles of AstraZeneca and IQVIA on the project? 

AstraZeneca (AZ) and IQVIA provided data science, oncology and technology expertise; and along with academic and research colleagues were responsible for ensuring that the simulated database could be used to answer relevant and important questions. AZ and IQVIA tested the prototype of the Simulacrum before it was released. 

AZ and IQVIA co-funded the development of the Simulacrum v1 pilot. AZ funded the development of Simulacrum v2. 

At no time were AZ or IQVIA given access to patient identifiable data. 

xix. What is the role of Health Data Insight CIC on the project? 

Health Data Insight (HDI) CIC is a social enterprise that has been created to develop technical solutions that deliver benefits to patients the NHS and public.  As a Community Interest Company HDI is required to ensure that all the value produced from this work is captured for the social good.  The Simulacrum is therefore available freely for anyone to use under an MIT License. 

HDI owns the intellectual property and was responsible for managing the testing and development of the Simulacrum.