# Artificial intelligence-based clustering and characterization of Parkinson’s disease trajectories

### Multivariate time series analysis identifies three patient clusters with distinct progression profiles

By clustering the time series data of 407 de novo PD patients from PPMI (267 male, 140 female) using our previously published artificial intelligence-based VaDER approach11, we identified three groups of PD patients with distinct progression profiles (Supplementary Section S1, Fig. S1). The clustering was conducted based on the multivariate progression of six key clinical assessments of PD symptoms over the course of up to 60 months: the MDS-UPDRS 1, 2, and 3 (off treatment)12, tremor dominant score (TD), postural instability and gait disorder score (PIGD), and the Epworth sleepiness scale (ESS).

The three resulting clusters contained ‘moderate’-progressors (n = 230), ‘fast’-progressors (n = 53), and ‘slow’-progressors (n = 124). Table 1 provides summary statistics of patients from each cluster at study baseline. We found significant differences between the average age at study baseline of slow progressors and the two other respective subtypes (t-test ‘slow’ versus ‘fast’, p < 0.013; ‘slow’ versus ‘moderate’, p < 0.019; ‘moderate’ versus ‘fast’, p > 0.32). In contrast, no significant difference was observed in the elapsed time from initial diagnosis to study baseline (pairwise U-tests between all three clusters, p > 0.3), or distribution of Hoehn and Yahr stages ($$\chi$$2-test, p > 0.15). With respect to MDS-UPDRS scores at study baseline, we found a significant difference in MDS-UPDRS 1 between the ‘moderate’ cluster and the other two clusters, respectively (U-test, ‘slow’ versus ‘fast’, p < 0.01; ‘moderate’ versus ‘fast’, p < 0.001; ‘slow’ versus ‘moderate’, p > 0.59). For MDS-UPDRS 2, the only significant deviation was observed comparing the ‘moderate’ against ‘fast’-progressors (U-test, ‘moderate’ versus ‘fast’, p < 0.025; ‘slow’ versus ‘fast’, p > 0.14; ‘slow’ versus ‘moderate’, p > 0.34). We identified no significant difference in MDS-UPDRS 3 scores (pairwise U-test for all clusters, p > 0.69). Furthermore, we detected no significant differences in the distribution of biological sex ($$\chi$$2-test, p > 0.15) and the start of symptomatic therapy (Fig. S2).

The mean univariate progression trajectories of these clusters along with their 95% confidence intervals are depicted in Fig. 1. Although the clustering was conducted on multiple outcome measures, we observed a clear separation of clusters across all selected variables except for the TD score between ‘fast’ and ‘moderate’ progressors. While ‘fast’ and ‘moderately’ progressing subtypes displayed a clear increase of symptoms over the covered 60 month interval already starting from baseline, ‘slow’-progressors experienced almost no significant symptom worsening across scores until month 24.

### Characterisation of PD clusters suggests longitudinal differences in dopaminergic deficiency

The differences in motor symptom progression rates across subtypes (Fig. 1) were mirrored by significant differences in the age-adjusted trajectories of DaTSCAN measurements, which were available until month 48: the rate in loss of specific-binding ratio (SBR) signal in the caudate region was significantly lower for the cluster exhibiting ‘slow’ progression than for both the ‘fast’ and ‘moderate’ progressing clusters, respectively (signal loss of − 0.0033 SBR unit/month, 95% CI [− 0.0055, − 0.0011], p = 0.004 compared to the ‘fast’ group, and of − 0.0019 SBR unit/month, 95% CI [− 0.0032, − 0.0003], p = 0.01 compared to the ‘moderate’ group). No significant difference in SBR was observed between the ‘fast’ and ‘moderate’ progressing groups (details in Supplementary Section S3). The difference in rate of dopaminergic loss between the ‘fast’ and the ‘slow’ progressing clusters was seen equally in the ipsilateral (signal loss of − 0.0034 SBR unit/month, 95% CI [− 0.0056, − 0.0008], p = 0.008) and the contralateral (signal loss of − 0.0032 SBR unit/month, 95% CI [− 0.0057, − 0.0008], p = 0.007) sides of the caudate region. In contrast, the difference in rate of progression between the ‘moderate’ and the ‘slow’ progressing subtypes was stronger in the contralateral side (signal loss of − 0.0022 SBR unit/month, 95% CI [− 0.0038, − 0.0006], p = 0.006) as compared to the ipsilateral (signal loss of − 0.0016 SBR unit/month, 95% CI [− 0.0030, + 0.0002], p = 0.07) sides of the caudate region. No significant difference in SBR rates were observed in the putamen, and changes in the striatum were intermediary between those observed in the caudate and the putamen.

### Machine learning revealed associations between clusters and underlying biology

To discover further associations between the identified progression clusters and clinical as well as biomarker and genetic variables, we developed machine learning models based on patients’ baseline visit data. Additionally, we built a second version of these models that included the 3-month follow-up data, both in the form of raw values and of change relative to baseline values. The variables included into the models comprised demographic and clinical data, including MDS-UPDRS item-level data (86 variables at baseline; 217 including 3 month follow-up), CSF biomarkers (amyloid beta, phosphorylated tau, total tau), blood serum transcriptomic data (7 variables), 3472 SNPs gained through a linkage disequilibrium analysis of an initial set of 145 PD associated SNPs obtained from DisGeNET13, and brain region specific DaTSCAN (5 variables). We also calculated burden-scores for biological pathways stemming from Kegg14, Reactome15, and NeuroMMSig16 (36, 10, and 12 pathways, respectively). These scores were based on the SNP data of each respective patient and described the amount of genetic variation affecting a pathway (see Method section for details). A full list of all variables is presented in the Supplementary Spreadsheet.

The machine learning algorithm of choice was a sparse group LASSO (SGL)17. We developed three distinct models, each discriminating one of the clusters from the respective other two (i.e., one versus rest approach). The significance of the most strongly associated variables was then determined by bootstrapping each model 200 times and investigating whether the resulting confidence intervals (CI) of standardised coefficients contained zero. CIs were Bonferroni-corrected to account for multiple testing. Further methodological details are described in Supplementary Section S4.

The built models revealed several significant associations between measured variables and progression clusters, which were interpretable from a clinical as well as a biological point of view.

### Progression clusters are associated with distinct symptoms and genetic loci

The coefficients of each machine learning model highlight how specific variables influence the probability that a patient belongs to a particular cluster. For interpretability, we focused on significant positive interactions (i.e., variables that increase the chance of belonging to the respective cluster; Fig. 2A–C).

The variable most strongly associated with ‘fast’ PD progression was the presence and severity of hallucinations at the 3 month follow-up visit (NP1HALL m3, 95%CI [3.91, 5.0]), with the increase in experienced hallucinations following in third position (NP1HALL slope, 95%CI [3.07, 3.9]). In fourth position, the increase in postural instability and gait disorder severity over the first 3 months was found (PIGD slope, 95% CI [2.73, 3.55]). Additionally, ‘fast’ progressing patients experienced more difficulties when rising from a lying or sitting position compared to the other two subtypes (95% CI: NP3RISNG [2.56, 3.63], NP3RISNG m3 [2.16, 2.98], NP2RISE m3 [1.9, 2.65], NP2RISE [1.8, 2.64]). REM sleep behaviour disorder (RBD) proved to be another association for ‘fast’ progression (95% CI [2.33, 3.24]). Furthermore, several SNPs (rs6783485-LOC105377110, rs1536076-SH3GL2, rs6532194-chromosome 4:89859751, rs11711441–chromosome 3:183103487, and rs591323-LOC105379297) were found to be among the top 20 associated variables for ‘fast’ progression. Notably, all these SNPs were taken from DisGeNET, because of their known association to PD according to GWAS studies. In all cases, the non-reference-allele increased the risk of ‘faster’ PD progression.

‘Slow’ PD progression was associated with increasing difficulties when performing the hand movement task of the MDS-UPDRS (NP3HMOV slope 95% CI [2.93, 3.38]). Furthermore, a series of highly associated variables were connected to daytime sleepiness (ESS 95% CI [2.27, 3.06]) and general fatigue (NP1FATG 95% CI [2.16, 2.97]). Patients of the ‘slow’ cluster also suffered more often from anxiety (95% CI: NP1ANXS [2.15, 2.93]; NP1ANXS m3 [0.89, 1.53]) and were the only subtype which showed a significant positive association with depression, albeit the coefficient remained rather small (geriatric depression scale 95% CI [0.1, 0.65]). Additionally, better semantic fluency was also connected to ‘slower’ disease progression (SFT 95% CI [2.06, 2.84]). With regard to motor symptoms, ‘slow’ progression was associated with rigidity of the ipsilateral extremities at baseline, month 3, and their relative increase in severity (95% CI: NP3RIGL_IL m3 [2.23, 3.09]; NP3RIGL_IL [1.74, 2.54]; NP3RIGU_IL [1.0, 1.61]). Further, we found a significant positive association of the polygenic risk score PGS00012318 and multiple genetic loci with the probability to belong to the ‘slow’-progressors. SNPs rs17565841 (OCA2), and rs12959200 (chromosome 18:73599819) placed among the top 10 associations (95% CI: [2.11, 2.71], [1.95, 3.05], [1.91, 2.77], respectively). Once again, these SNPs were taken from DisGeNET because of their known association to PD according to GWAS studies.

For ‘moderate’ disease progression, the strongest association was the worsening of performing the eating task of the MDS-UPDRS over the first 3 months (NP2EAT slope 95% CI [2.3, 3.08]). Further, reduced agility in the ipsilateral leg was associated with ‘moderate’ progression (95% CI: NP3LGAG_IL slope [1.79, 2.55]; NP3LGAG_IL m3 [1.36, 2.06]). With rs76904798 (chromosome 12:40220632), rs199347 (GPNMB), rs7702187 (SEMA5A), and rs7617877 (LINC00693), we identified several PD associated SNPs which raised the probability for patients to belong to the ‘moderate’ subtype.

A comprehensive view on all variables and their coefficients can be found in the Supplementary Spreadsheet.

While the SGLs were designed to identify variable associations and not to make reliable forecasts, we additionally evaluated their predictive performance. With a cross-validated area under the receiver operating characteristic curve of 0.62, 0.60, and 0.63 for ‘slow’, ‘moderate’, and ‘fast’ progression, respectively, their performance remained limited.

### Genetic burden scores connect the heterogeneity in PD progression to biological pathways

Several biological pathways and genes could be associated with the respective clusters (Fig. 2 D–F). The ‘fast’ cluster was highly associated with higher genetic burden in the Kegg ‘SNARE vesicle transport’ pathway (95% CI [1.25, 1.92]), the ‘Rap1 signalling’ pathway (95% CI [1.1, 1.71]), and NeuroMMSig’s ‘neurotrophic’ subgraph (95% CI [1.25, 1.92]). The patients of the ‘moderate’ cluster were linked to the ‘cholesterol metabolism’ subgraph (95% CI [1.56, 2.25]) and ‘vascular endothelial growth factor’ subgraph (95% CI [1.42, 2.12]) originating from NeuroMMSig. The ‘vitamin’ and ‘disaccharide metabolism’ subgraphs from NeuroMMSig, and Kegg’s ‘amoebiasis pathway’ were discovered as strongly associated with the ‘slow’ progressing clusters (95% CI: [1.6, 2.22], [1.04, 1.66], and [1.14, 1.86], respectively). A list of all mappings between pathways, genes and SNPs can be found in the Supplementary Spreadsheet.

### Identified clusters show differences in response to motor symptom therapy

After observing that potentially different biological pathways were involved in the PD pathology of each cluster, we investigated whether the clusters also differed in their response to symptomatic treatment for motor symptoms. To this aim, we selected participants who had initiated Levodopa or Dopamine agonist symptomatic treatment between month 6 and month 9 after baseline and assessed whether progression as measured by MDS-UPDRS 3 score differed by PD cluster. We separately analysed the ‘ON’-state MDS-UPDRS 3 score data, in which patients are examined approximately one hour after taking medication (Fig. 3), and the ‘OFF’-state MDS-UPDRS 3 score data (Fig. S11). As per PPMI protocol, patients were considered to be in the ‘OFF’-state when the last treatment dose was taken at least 6 h before symptoms were assessed19. Methodological details can be found in Supplementary Section S6.

Although initially all three PD clusters responded similarly to symptomatic treatment by stabilising their motor scores in the first 9 months after treatment initiation (i.e. 9–18 months post-baseline, Fig. 3, Fig.S11), we observed that patients in the ’fast’ progressing cluster continued to progress fastest and all three clusters had significantly different MDS-UPDRS 3 scores in ‘ON’ and ‘OFF’-states at 30 months after baseline (i.e. 21 months post-symptomatic treatment initiation) from each others, i.e. the 95% CIs did not overlap. PD subtypes did not differ according to whether they were prescribed Levodopa (alone or in combination with Dopamine agonist), or Dopamine agonist alone as a first line of PD symptomatic treatment (Table S1). The levodopa equivalent daily dose (LEDD) was obtained for the PPMI participants included in this analysis (Table S2). Only beyond 42 months post-baseline, patients in the ‘fast’ cluster appeared to have taken higher LEDD compared to the patients in the ‘moderate’ cluster (mean difference at month 54: 186.8, 95%CI [76.2, 267.6], p < 0.01), while no significant difference was found for ‘fast’ versus ‘slow’, and ‘slow’ versus ‘moderate’ progressors, respectively (Figure S12).