Development

The Simulacrum was developed by Health Data Insight CiC (HDI), a Cambridge-based social enterprise, with support from AstraZeneca and IQVIA.

The Simulacrum was generated synthetically by mathematical and computational analysis of tens of thousand of anonymous extracts from the original NCRAS data. It is a collection of linked data tables which contain the same structure, the same number of, and types of, characteristics as the original data.  It also maintains the properties of the original data with a high degree of accuracy, making it a realistic and valuable resource about cancer.

The Simulacrum was generated in several steps:

  1. NCRAS data on cancer patients was pulled together and anonymised by Public Health England before it was made available to HDI.
  2. This data is divided into groups known as data cubes. They are completely anonymous and have been safely released to the public by Public Health England.
  3. Using an algorithm, the strong relationships between pairs of patient characteristics were identified. For example, there is a strong relationship between age and type of cancer, with certain cancers found most commonly in certain age groups. These relationships were identified so they could be replicated in the synthetic data to make the Simulacrum as realistic and accurate as possible.
  4. Patient records which showed these strong relationships were gathered into groups of no fewer than 50.  This was done to improve accuracy and add extra protection for privacy.  For this reason, it was impossible to create synthetic versions of patients with very rare cancers.
  5. The new synthetic data was then created by Health Data Insight using the data cubes. For each synthetic patient record, a value was randomly chosen from the relevant data cube to represent a particular characteristic. This process was repeated for each new characteristic simulated. This random sampling further protects patient privacy because the procedure cannot be reversed to recreate real patient information.

For a more technical description of the methods used, please visit the library page.