User Guidance
Please read our User Guide before analysing Simulacrum. It provides:
- A description of the structure of Simulacrum v2.1.0 and underlying CAS data
- A description of the properties of the Simulacrum v2.1.0 and where they might differ from those of the CAS data
- Guidance on the writing of code that will produce reliable outputs when run on the CAS data.
For each data table, it gives:
- Descriptions of each table in the dataset
- The data quality of the dataset in the CAS
- The caveats of the synthetic version of the dataset
- Examples of SQL queries that can be used for analysis.
Below are some more examples of how to query Simulacrum v2.1.0 using SQL. For additional examples, please also see NCRAS’s guide to writing SQL queries for the CAS database.
Query Examples
Patient counts can easily be derived from the SIM_AV_ tables. Below are some typical feasibility queries which can be used to get patient counts.
Example 1: Getting patients with cancers of the breast
For this query we can simply query the SIM_AV_TUMOUR table. Here the patientid can be used to count distinct patients.
SELECT COUNT(DISTINCT PATIENTID) FROM SIM_AV_TUMOUR WHERE SITE_ICD10R4_O2_3CHAR_FROM2013=’C50′
This query gives us all patients who had tumours arising in the anatomical site of breast as denoted by the high-level ICD code C50.
Example 2: Getting patients with a certain morphology
The SIM_AV_TUMOUR table also contains morphology codes, so we can further classify breast cancers into distinct histological types. For example:
SELECT COUNT(DISTINCT PATIENTID) FROM SIM_AV_TUMOUR WHERE SITE_ICD10R4_O2_3CHAR_FROM2013=’C50′ AND MORPH_ICD10_O2= ‘8500’ AND BEHAVIOUR_ICD10_O2=’3′
The above code will count all patients with breast cancers who have a ductal carcinoma of unspecified type, which comprise the largest group of invasive breast cancers.
If the codes are not known in advance, you can use the data to see which ones are used the most.
For example:
SELECT MORPH_ICD10_O2|| ‘/’ || BEHAVIOUR_ICD10_O2 as morphology, COUNT(DISTINCT PATIENTID) as N FROM SIM_AV_TUMOUR WHERE SITE_ICD10R4_O2_3CHAR_FROM2013 = ‘C50’ GROUP BY MORPH_ICD10_O2|| ‘/’ || BEHAVIOUR_ICD10_O2 ORDER BY N DESC
This gives the following result (top 5 rows shown):
MORPHOLOGY | N |
8500/3 | 137860 |
8520/3 | 23728 |
8480/3 | 3994 |
8522/3 | 3961 |
8211/3 | 2516 |
Example 3: Getting patients who have received systemic therapy
The SIM_SACT… tables contain data on patients who have received systemic anti-cancer therapy. If we are interested in treatment patterns for patients with specific characteristics, such as having a specific tumour morphology or stage, then we need to link between SIM_AV and SIM_SACT datasets. This can be different depending on whether you are using Simulacrum v1.2.0 or Simulacrum v2.1.0, as linkage between the SIM_AV_TUMOUR and SIM_SACT_REGIMEN is different in both.
For example, to look at top 10 regimens for patients who have been diagnosed with lung cancers in Simulacrum v2.1.0, one can link directly from AV_TUMOUR to SACT_REGIMEN using the following code:
Top level regimens are given in the SIM_SACT_REGIMEN table, and a patient could have multiple cycles of the same regimen, which are given in the SIM_SACT_CYCLE table.
SELECT R.BENCHMARK_GROUP, COUNT(DISTINCT T. PATIENTID) as N FROM SIM_AV_TUMOUR T INNER JOIN SIM_SACT_REGIMEN R ON R.ENCORE_PATIENT_ID=T. PATIENTID WHERE T.SITE_ICD10R4_O2_3CHAR_FROM2013 = ‘C34’ GROUP BY R.BENCHMARK_GROUP ORDER BY N DESC LIMIT 10
This returns:
BENCHMARK_GROUP | N |
CARBOPLATIN + ETOPOSIDE | 10621 |
PEMBROLIZUMAB | 6000 |
CISPLATIN + VINORELBINE | 4682 |
CARBOPLATIN + PEMETREXED | 4357 |
CARBOPLATIN + VINORELBINE | 4134 |
CARBOPLATIN + GEMCITABINE | 3963 |
CISPLATIN + PEMETREXED | 2426 |
ATEZOLIZUMAB | 2369 |
CISPLATIN + ETOPOSIDE | 1605 |
PEMETREXED | 1568 |
For the same query written on Simulacrum v1.2.0, one needs to apply a different linkage between tables, as there is not a direct linkage between AV_TUMOUR and SACT_REGIMEN. Instead, linkage is as followed:
- Begin with av_tumour to for defining the tumour cohort
- Link to av_patient by patientid to get the linknumber
- Link to sact_patient to get merged_regimen_id
- Link to sact_regimen to get regimen information, e.g. benchmark group.
The SQL code to do this would be:
SELECT R.BENCHMARK_GROUP, COUNT(DISTINCT T. PATIENTID) as N FROM SIM_AV_TUMOUR T INNER JOIN SIM_AV_PATIENT P ON T.PATIENTID=P.PATIENTID INNER JOIN SIM_SACT_PATIENT SP ON SP.LINK_NUMBER = P.LINKNUMBER INNER JOIN SIM_SACT_REGIMEN R ON R.MERGED_PATIENT_ID = SP. MERGED_PATIENT_ID WHERE T.SITE_ICD10_O2_3CHAR = ‘C34’ GROUP BY R.BENCHMARK_GROUP ORDER BY N DESC LIMIT 10