Limitations

Simulating multiple types of data while trying to protect patient privacy is a complex mathematical challenge. Although the data in the Simulacrum is very accurate for simple queries as these queries become more complex so the results become less reliable.

For example, if you want to count one field at a time, such as, ‘the number of cancers at stage four’ the accuracy of the data is very high and the results are strongly indicative.  But, a more complex query, such as, ‘the number of breast cancers diagnosed at stage four who received drug X and survived for more than 90 days’ will be more approximate.

The Simulacrum data is synthetic and must not be used to make clinical decisions. But, because the structure of the Simulacrum data is the same as the real data, it can be used to plan and refine analyses before making a formal request to Public Health England’s Office for Data Release to conduct the same analysis on the real data.

The Simulacrum can be used to:

  • Explore the format of the cancer data and understand the codes and structure used within the Public Health England data model
  • Calculate direct results, which are highly reliable for simple queries, and indicative for more complex queries
  • Develop and test code to select complex cohorts of cancer patients
  • Plan analysis and research before requesting data from PHE