Mehmet Can Ay
PDataViewer: Investigation of Parkinson’s disease landscape and enabling semantic data harmonization through language models
Mehmet Can Ay presents PDataViewer, a web-based application to explore the data landscape for Parkinson's Disease research, which he developed as part of his Master's thesis.
Background
Parkinson’s disease (PD) poses significant research challenges due to its biological complexity. While cohort studies have advanced our understanding, they often suffer from biases related to selection criteria and inconsistent naming systems across studies. Effective exploration of PD via cross-cohort analyses requires rigorous data harmonization to enhance the generalizability of findings. Previous initiatives, such as the Observational Medical Outcomes Partnership [1] and the Common Data Element Catalog [2], have made strides but have not fully addressed the specific needs of PD research. Additionally, due to strict data-sharing rules, such as the European Union General Data Protection Regulation [3] and the Health Insurance Portability and Accountability Act [4], the cohort data remains non-transparent before getting access. For instance, are collected measurements available for all participants of a cohort for each given time point during the study? To overcome these limitations, to align the medical data with Findable, Accessible, Interoperable, Reusable (FAIR) principles [5], and to automatize the semantic data harmonization process, we propose the Parkinson’s disease common data model (PASSIONATE) and a web-based application called PDataViewer.
PASSIONATE
To comprehensively analyze the availability of PD cohort studies, we identified widely investigated PD datasets through an exhaustive literature search. All collected data dictionaries were manually checked for variables relevant to a clinical analysis of PD or Parkinsonism. Relevance was defined as variables that have been frequently associated with the diagnosis, progression, or treatment of PD in existing literature. A unique term, referred to as a reference term, was created for each variable. The reference terms were then mapped across investigated cohort studies and against ontologies and Observational Health Data Sciences and Informatics (OHDSI) standardized vocabulary terms to standardize the terminologies further. Overall, 741 defined unique reference terms were mapped to 276 ontology terms and 201 unique OHDSI standardized vocabulary terms. To comply with the FAIR principles, PASSIONATE is publicly available on Zenodo (https://doi.org/10.5281/zenodo.10218362).
PDataViewer
To help researchers select the most appropriate dataset for their objectives, we aim to create an openly accessible web application (https://github.com/SCAI-BIO/PDataViewer), enabling the exploration of key clinical PD datasets. We aim to display data availability at variable and modality levels and the total number of measurements collected per patient in each cohort study (see Figure 1). Additionally, we will show the proportion of measurements collected for specific variables over the study duration. Using the datastew [6] package, we aim to provide a tool within the web application that can automatically harmonize your data dictionary against other studies, PASSIONATE, or ontologies utilizing various language models.
Citations
[1] Reich C, Ostropolets A, Ryan P, Rijnbeek P, Schuemie M, Davydov A, et al. OHDSI Standardized Vocabularies—a large-scale centralized reference ontology for international data harmonization. J Am Med Inform Assoc. 2024 Feb 16;31(3):583–90.
[2] Grinnon ST, Miller K, Marler JR, Lu Y, Stout A, Odenkirchen J, et al. National Institute of Neurological Disorders and Stroke Common Data Element Project – approach and methods. Clin Trials. 2012 Jun;9(3):322–9.
[3] Voigt P, Von Dem Bussche A. The EU General Data Protection Regulation (GDPR). Cham: Springer International Publishing; 2017.
[4] Edemekong PF, Annamaraju P, Haydel MJ. Health Insurance Portability and Accountability Act. In: StatPearls. Treasure Island (FL): StatPearls Publishing; 2024.
[5] Wilkinson MD, Dumontier M, Aalbersberg IjJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3(1):160018.
[6] Salimi Y, Adams T, Ay MC, Balabin H, Jacobs M, Hofmann-Apitius M. On the Utility of Large Language Model Embeddings for Revolutionizing Semantic Data Harmonization in Alzheimer’s and Parkinson’s Disease. 2024. Preprint available at https://doi.org/10.21203/rs.3.rs-4108029/v1.