Synthetic data generation method

The Simulacrum was generated with the aim of capturing important properties of the real data, while ensuring that no individual patient information is replicated.

The method follows several steps:

Data preparation

NDRS data on cancer patients, which is already pseudonymised (i.e. all directly identifying patient characteristics such as name, date of birth, address, NHS number, etc. are removed), is extracted from the Cancer Analysis System (CAS).

Correlation analysis

Important relationships between patient characteristics in the data are estimated using computational methods and domain knowledge. For example, there is a strong correlation between age and type of cancer, with some cancers found most commonly in certain age groups.

These relationships are identified so they can be replicated in the synthetic data and make the Simulacrum as realistic as possible.

Based on these identified relationships, the data is aggregated into groups known as data cubes, which will be used to generate the synthetic data. These groups are always based on at least 50 patient records, ensuring that patient information cannot be directly replicated during the data generation phase. However, it means that for rarer cancers, patient records may not look realistic.

These data cubes are anonymous and safe to release without risk to patient privacy. For Simulacrum v1.0.0, these were publicly released alongside the release of the synthetic data.

Data generation

The synthetic data is then created from the anonymous data cubes. For each synthetic patient record, their patient, tumour and treatment events characteristics are given values through sampling: values are randomly chosen from the relevant data cube that represents the specific characteristic and its associated relationships in the data.

This ensures that the important relationships are replicated in the synthetic data, while protecting patient privacy: due to random sampling the procedure cannot be reversed to recreate real patient information.

For a more technical description of the methods used, you can read our white paper or watch the talk by Lora Frayling speaking at the HDR UK Synthetic Data Special Interest Group, December 2020. (Video opens in a new window)