Aparna Krishnan

Strategies for rapid semi-automated generation of indication-wide biomedical data landscapes

Aparna Krishnan explains her Master's thesis in which she developed a workflow to rapidly generate variable data landscapes of clinical studies available for any indication area, such as 'aging and longevity'.

Background

Biomedical research is evolving rapidly due to immense data growth and technological advancements, but the dispersion of data across numerous specialized databases complicates a comprehensive understanding of diseases [1]. This fragmentation and the lack of awareness of independent datasets for AI model validation lead to biases and poor model generalizability [2]. For instance, the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset has been extensively analyzed, often overshadowing other datasets in Alzheimer's research [3]. Moreover, the process of collecting, characterizing, and qualifying data is tedious and time-consuming [4]. Current platforms like DataMed [5] attempt to integrate biomedical datasets but are often limited by a narrow focus on specific data communities and schemas.

To address this challenge, I focused on developing a semi-automated workflow for rapidly generating variable data landscapes using comprehensive data catalogs of clinical study data. These catalogs provide quick and structured overviews of available datasets in a given indication area – such as aging and longevity or neurodegenerative diseases – helping researchers discover and utilize independent datasets for model validation [6]. I also aim to establish a primary data dictionary to create a foundational structure for organizing and categorizing study data across multiple research modalities. The approach was tested in the domain of aging and longevity, where data-driven insights are essential for understanding health span, disease progression, and interventions to promote longevity [7].

Methods

The workflow is showcased in Figure 1.

1. Data Collection and Preprocessing

This step began with the development of precise research queries aimed at retrieving studies related to ‘Healthy Aging and Longevity’ from PubMed and ClinicalTrials.gov. Due to the vast number of articles available, ranking methods – SCImago Journal Rank (SJR), Best Matching 25 (BM25), BioBERT, and OpenAI embeddings – were employed to determine which studies were most relevant to the formulated queries. The selected documents were then standardized, followed by the collection of full-text PDFs to create a structured corpus.

2. Laser AI Optimization

In this phase, I leveraged LASER-AI [8], a software tool developed by the company EvidencePrime, which is optimized for extracting variables and features from full text articles using their inbuilt AI/ML models. The full text PDFs were uploaded to Laser AI, which was optimized to extract key features, such as study characteristics, variable definitions, and inclusion/exclusion criteria.

3. Data Catalog Generation

The final step focused on cleaning and harmonizing the extracted data to ensure consistency and enhance quality. Datasets were merged to identify overlapping variables, enabling cross-study comparisons. A data dictionary was created to document variable definitions, providing a detailed and standardized reference for the final data catalog.

Fig. 1: Overall workflow diagram.

Results and Conclusion

The analysis showed that OpenAI embeddings improved precision and relevance in ranking studies compared to traditional metrics like SJR and BM25, achieving a precision of 89% and an F1 score of 85%.

The Data Catalog that was generated (Table 1) compiles clinical studies on aging and longevity, offering a structured overview of variables such as study objectives, sample size, endpoints, and population demographics. This facilitates quick insights and comparisons across patient-level studies. A data dictionary consisting of 64 features was also generated. It provides detailed information about the structure and content of the dataset, with all features clearly defined with their names, definitions, data type, format, and additional notes.

Future improvements could involve validating extractions against gold standards and expanding the workflow to other datasets and clinical domains. Gold standards refer to well-established aging-related datasets, such as the Human Aging and Longevity Landscape (HALL) [9]. The data dictionary can be further mapped to a Common Data Model (CDM) like the Observational Medical Outcomes Partnership (OMOP) for enhanced semantic interoperability, ensuring that data from diverse studies can be meaningfully compared and utilized [10].

Table 1: Comprehensive data landscape sample generated for patient-level studies.

Citations

[1] J. Luo, et al., “Big Data Application in Biomedical Research and Health Care: A Literature Review,” Biomedical Informatics Insights, vol. 8, pp. 1–10, 2016. https://doi.org/10.4137/BII.S31559

[2] N. N. Chu, and H. Gebre-Amlak, “Navigating Neuroimaging Datasets ADNI for Alzheimer's Disease,” IEEE Consumer Electronics Magazine, 2021. https://doi.org/10.1109/MCE.2021.3056872

[3] N. N. Chu, and H. Gebre-Amlak, “Navigating Neuroimaging Datasets ADNI for Alzheimer's Disease,” IEEE Consumer Electronics Magazine, 2021. https://doi.org/10.1109/MCE.2021.3056872

[4] M. Martínez-García, and E. Hernández-Lemus, “Data Integration Challenges for Machine Learning in Precision Medicine,” Frontiers in Medicine, vol. 8, 2022. https://doi.org/10.3389/fmed.2021.784455

[5] T. Cohen, K. Roberts, A. E. Gururaj, X. Chen, S. Pournejati, G. Alter, W. R. Hersh, D. Demner- Fushman, L. Ohno-Machado, and H. Xu, “A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 bioCADDIE dataset retrieval challenge,” Database, 2017. https://doi.org/10.1093/DATABASE/BAX061

[6] J. Stillerman, T. Fredian, M. Greenwald, and G. Manduchi, “Data Catalog Project - A Browsable, Searchable, Metadata System,” United States, 2022. https://doi.org/10.7910/DVN/5EZSZC

[7] J. J. Carmona, and S. Michan, “Biology of Healthy Aging and Longevity,” Revista de Investigacion Clinica, vol. 68, no. 1, pp. 7–16, 2016.

[8] Evidence Prime, Laser AI. https://www.laser.ai/product

[9] Hao Li, et al., “HALL: a comprehensive database for human aging and longevity studies”, Nucleic Acids Research, Volume 52, Issue D1, 5 January 2024, Pages D909–D918, https://doi.org/10.1093/nar/gkad880

[10] Data Standardization – OHDSI. https://www.ohdsi.org/data-standardization/