User Guidance

Please read our User Guide before analysing Simulacrum. It provides: 

  • A description of the structure of Simulacrum v2.1.0 and underlying CAS data
  • A description of the properties of the Simulacrum v2.1.0 and where they might differ from those of the CAS data  
  • Guidance on the writing of code that will produce reliable outputs when run on the CAS data. 

For each data table, it gives: 

  • Descriptions of each table in the dataset  
  • The data quality of the dataset in the CAS
  • The caveats of the synthetic version of the dataset
  • Examples of SQL queries that can be used for analysis.

Below are some more examples of how to query Simulacrum v2.1.0 using SQL. For additional examples, please also see NCRAS’s guide to writing SQL queries for the CAS database.

Query Examples

Patient counts can easily be derived from the SIM_AV_ tables.  Below are some typical feasibility queries which can be used to get patient counts. 

Example 1: Getting patients with cancers of the breast

For this query we can simply query the SIM_AV_TUMOUR table. Here the patientid can be used to count distinct patients. 

SELECT COUNT(DISTINCT PATIENTID) FROM SIM_AV_TUMOUR 

WHERE SITE_ICD10R4_O2_3CHAR_FROM2013=’C50′ 

This query gives us all patients who had tumours arising in the anatomical site of breast as denoted by the high-level ICD code C50.

Example 2: Getting patients with a certain morphology 

The SIM_AV_TUMOUR table also contains morphology codes, so we can further classify breast cancers into distinct histological types.  For example: 

SELECT COUNT(DISTINCT PATIENTID) FROM SIM_AV_TUMOUR 

WHERE SITE_ICD10R4_O2_3CHAR_FROM2013=’C50′ 

AND MORPH_ICD10_O2= ‘8500’ 

AND BEHAVIOUR_ICD10_O2=’3′ 

The above code will count all patients with breast cancers who have a ductal carcinoma of unspecified type, which comprise the largest group of invasive breast cancers.  

If the codes are not known in advance, you can use the data to see which ones are used the most. 

For example: 

SELECT MORPH_ICD10_O2|| ‘/’ || BEHAVIOUR_ICD10_O2 as morphology, COUNT(DISTINCT PATIENTID) as N 

FROM SIM_AV_TUMOUR 

WHERE SITE_ICD10R4_O2_3CHAR_FROM2013 = ‘C50’ 

GROUP BY MORPH_ICD10_O2|| ‘/’ || BEHAVIOUR_ICD10_O2 

ORDER BY N DESC 

This gives the following result (top 5 rows shown): 

MORPHOLOGY  N 
8500/3  137860 
8520/3  23728 
8480/3  3994 
8522/3  3961 
8211/3  2516 

Example 3: Getting patients who have received systemic therapy 

The SIM_SACT… tables contain data on patients who have received systemic anti-cancer therapy.  If we are interested in treatment patterns for patients with specific characteristics, such as having a specific tumour morphology or stage, then we need to link between SIM_AV and SIM_SACT datasets. This can be different depending on whether you are using Simulacrum v1.2.0 or Simulacrum v2.1.0, as linkage between the SIM_AV_TUMOUR and SIM_SACT_REGIMEN is different in both. 

For example, to look at top 10 regimens for patients who have been diagnosed with lung cancers in Simulacrum v2.1.0, one can link directly from AV_TUMOUR to SACT_REGIMEN using the following code: 

Top level regimens are given in the SIM_SACT_REGIMEN table, and a patient could have multiple cycles of the same regimen, which are given in the SIM_SACT_CYCLE table.

SELECT R.BENCHMARK_GROUP, COUNT(DISTINCT T. PATIENTID) as N 

FROM SIM_AV_TUMOUR T INNER JOIN SIM_SACT_REGIMEN R ON R.ENCORE_PATIENT_ID=T. PATIENTID 

WHERE T.SITE_ICD10R4_O2_3CHAR_FROM2013 = ‘C34’ 

GROUP BY R.BENCHMARK_GROUP 

ORDER BY N DESC 

LIMIT 10 

This returns: 

BENCHMARK_GROUP  N 
CARBOPLATIN + ETOPOSIDE  10621 
PEMBROLIZUMAB  6000 
CISPLATIN + VINORELBINE  4682 
CARBOPLATIN + PEMETREXED  4357 
CARBOPLATIN + VINORELBINE  4134 
CARBOPLATIN + GEMCITABINE  3963 
CISPLATIN + PEMETREXED  2426 
ATEZOLIZUMAB  2369 
CISPLATIN + ETOPOSIDE  1605 
PEMETREXED  1568 

For the same query written on Simulacrum v1.2.0, one needs to apply a different linkage between tables, as there is not a direct linkage between AV_TUMOUR and SACT_REGIMEN. Instead, linkage is as followed: 

  • Begin with av_tumour to for defining the tumour cohort 
  • Link to av_patient by patientid to get the linknumber 
  • Link to sact_patient to get merged_regimen_id 
  • Link to sact_regimen to get regimen information, e.g. benchmark group.

The SQL code to do this would be: 

SELECT R.BENCHMARK_GROUP, COUNT(DISTINCT T. PATIENTID) as N 

FROM SIM_AV_TUMOUR T  

INNER JOIN SIM_AV_PATIENT P ON T.PATIENTID=P.PATIENTID 

INNER JOIN SIM_SACT_PATIENT SP ON SP.LINK_NUMBER = P.LINKNUMBER 

INNER JOIN SIM_SACT_REGIMEN R ON R.MERGED_PATIENT_ID = SP. MERGED_PATIENT_ID 

WHERE T.SITE_ICD10_O2_3CHAR = ‘C34’ 

GROUP BY R.BENCHMARK_GROUP 

ORDER BY N DESC 

LIMIT 10