The day has finally arrived when our synthetic dataset is ready to launch.  We have been working on the Simulacrum for the best part of two years and it is now freely available for anyone to download. We are excited to be launching this at a time when real world data is increasingly playing a bigger part in our understanding of the effectiveness of different health treatments on patient outcomes. The Simulacrum gives researchers, and those interested in cancer treatments, the opportunity to conduct pilot studies on a synthetic data set that mimics the real data held securely by Public Health England.

Why did we develop the Simulacrum?

The National Cancer Registration and Analysis Service (NCRAS)  has permission under Section 251 of the Health and Social Care Act 2006 to collect data on patients with a diagnosis of cancer. Its absolute responsibility is to protect individual patient confidentiality.

Quite rightly, there are restrictions on the use of this data. There are clear criteria for the conditions under which this data can be shared, and PHE’s Office for Data Release (ODR) oversees this process.  Accessing this valuable data resource can be time consuming as each request has to be carefully assessed and checked to ensure that it meets the criteria for access.

So we asked ourselves – could there be another approach to accessing the data? This led to the development of the Simulacrum.

What is the Simulacrum?

The Simulacrum is completely synthetic – it looks and feels like a real dataset but is not real.

We knew we needed to do this while maintaining patient confidentiality and you can read a report about the methodology and mathematical techniques we used to generate the synthetic data here.

What benefits does it bring to you?

Because the Simulacrum is made up of dummy data, it can be made available to anyone and can be used to test out hypotheses and to run pilot studies before applying for the real data. Users will be able to run queries and ask questions of the synthetic dataset as though they are looking at the real data.

To give you an idea of how we envisage the Simulacrum can be used, here is an example of Freda, an oncology researcher.

Freda is a researcher who wants to study the effects of a new treatment designed to treat late stage cancer patients. She wants to see if treated patients have better survival, investigate treatment patterns and establish if there are enough patients for her to set up a full research study.

Thanks to the Simulacrum, it’s now a lot easier. Freda simply downloads the data from simulacrum.healthdatainsight.org.uk and writes code to report on the items she is interested in. This gives her a Simulacrum answer, which is a pretty good estimate of what she might expect from the real data, and proves that her methodology will work on PHE’s cancer data.  To get the true answer, which can be used in her study protocol, Freda can get in touch with PHE and request for the same code she has developed to be run on the real data. The code will be checked carefully to make sure that the outputs are anonymous and safe, and meet the criteria for release.

In this example, the Simulacrum will have facilitated a faster and more efficient access to real data queries. By allowing users to test, build and run code on the synthetic dataset, they are able to get an initial answer to their questions. Further, they can apply to run the same code on the real data.

Want to know more?

Find out more about using the Simulacrum click here, and read more about our other work visit Health Data Insight CIC.