Query Examples

Below are some examples of how to query the Simulacrum data using SQL.  For additional examples, please see Public Health England’s guide to writing SQL queries for the CAS database.

Patient Counts

Patient counts can easily be derived from either the SIM_AV_ or SIM_SACT_ tables independently.  Below are some typical feasibility queries which can be used to get patient counts.

Example 1: Getting patients with cancers of the breast

For this query we can simply query the SIM_AV_TUMOUR table, for convenience the patientid has been included in this table to facilitate counts of distinct patients.

SELECT COUNT(DISTINCT PATIENTID) FROM SIM_AV_TUMOUR

WHERE SITE_ICD10_O2_3CHAR=’C50′

This query gives us all patients who had tumours arising in the anatomical site of breast as denoted by the high-level ICD code C50.x.

Example 2: Getting patients with a certain morphology

The SIM_AV_TUMOUR table also contains morphology codes so we can further classify breast cancers into distinct histological types.  For example:

SELECT COUNT(DISTINCT PATIENTID) FROM SIM_AV_TUMOUR

WHERE SITE_ICD10_O2_3CHAR=’C50′

AND MORPH_ICD10_O2= ‘8500’

AND BEHAVIOUR_ICD10_O2=’3′

The above code will count all patients with breast cancers who have a ductal carcinoma of unspecified type; which comprise the largest group of invasive breast cancers.  For looking up the right morphology codes for your question please see http://codes.iarc.fr/codegroup/2.

If the codes are not known in advance, you can use the data to see which ones are used the most.tFor example:

SELECT COUNT(DISTINCT PATIENTID) as N, MORPH_ICD10_O2|| ‘/’ || BEHAVIOUR_ICD10_O2 as morphology

FROM SIM_AV_TUMOUR

WHERE SITE_ICD10_O2 LIKE ‘C50%’

GROUP BY morphology

ORDER BY N DESC

This gives the following result (top 5 rows shown):

N Morphology
98776 8500/3
16679 8520/3
2890 8480/3
2762 8522/3
2279 8140/3

Example 3: Getting patients who have received systemic therapy

The SIM_SACT… tables contain data on patients who have received systemic anti-cancer therapy.  If you are interested in broad treatment patterns and do not need morphology or other data items contained in the SIM_AV… registry tables then it is fairly straightforward to find treatment patterns for a particular tumor site.  Top level regimens are given in the SIM_SACT_REGIMEN table, and a patient could have multiple cycles of the same regimen and details of these are given in the SIM_SACT_CYCLE table.  The primary diagnosis of the tumour if given in the SIM_SACT_TUMOUR table and this can be linked to the above to get top regimens for a given tumour site.

 For example to look at top 10 regimens for patients who have been diagnosed with lung cancers:

SELECT R.BENCHMARK_GROUP, COUNT(DISTINCT R.MERGED_PATIENT_ID) as N

FROM SIM_SACT_REGIMEN R INNER JOIN SIM_SACT_TUMOUR T ON R.MERGED_TUMOUR_ID=T.MERGED_TUMOUR_ID

WHERE T.PRIMARY_DIAGNOSIS LIKE ‘C34%’

GROUP BY R.BENCHMARK_GROUP

ORDER BY N DESC

LIMIT 10

This returns:

BENCHMARK_GROUP N
CARBOPLATIN + ETOPOSIDE 6165
GEMCARBO 4319
CARBOPLATIN + PEMETREXED 3884
CISPLATIN + PEMETREXED 3545
CISPLATIN + VINORELBINE 2825
CARBOPLATIN + VINORELBINE 2298
PEMETREXED 1547
CISPLATIN + GEMCITABINE 1195
ERLOTINIB 1441
DOCETAXEL 1567

 

BENCHMARK_GROUP is the field which contains the regimen name used in most analyses.  The SIM_SACT_REGIMEN tables also contains details of height and weight at start of regimen, treatment intent, if the regimen was part of chemoradiation and if the regimen was part of a clinical trial.  Note: The numbers above represent all lung cancer patients of all stages so regimens included could also be used for neo-adjuvant or adjuvant treatment as part of other treatment modalities.

Linking SIM_AV and SIM_SACT datasets

If we do want to look at regimens for specific tumor morphologies or for specific stages we need to link the SIM_SACT_PATIENT table to the SIM_AV_PATIENT table via the linknumber field.  This is the only way to link the simulated SACT tables with the simulated registry tables.  NOTE: ‘tumourid’ in the SIM_AV_TUMOUR table and ‘merged_tumour_id’ in the SIM_SACT_TUMOUR table do not match and should not be used to link these two datasets together.  They are different ids in the real CAS data also.

Here is an example if we want to count all urothelial cancer patients who have received systemic therapy (i.e. have at least 1 record in SACT):

SELECT COUNT(DISTINCT AVP.PATIENTID) AS NUM_PATIENTS

FROM

(SIM_AV_TUMOUR AVT JOIN SIM_AV_PATIENT AVP ON AVT.PATIENTID=AVP.PATIENTID)

WHERE AVP.LINKNUMBER IN

(SELECT DISTINCT LINK_NUMBER FROM SIM_SACT_PATIENT)

AND AVT.SITE_ICD10_O2_3CHAR IN (‘C64′,’C65′,’C66′,’C67′,’C68’)

AND AVT.MORPH_ICD10_O2=’8120′

AND AVT.BEHAVIOUR_ICD10_O2=’3′

Note: Here we are specifying a morphology code of ‘8120/3’ for urothelial carcinoma and the multiple anatomical sites where it arises in the body (C64-C68).  This count can give an indication as to the number of patients with the specific histology etc. that have had systemic anti-cancer therapy.  The actual number of patients will most likely be higher in the real CAS data.