CFDE Highlights

Building a Healthier Ecosystem: CFDE Expands with New Data Coordinating Centers and Partnerships

The Common Fund Data Ecosystem (CFDE) program aims to enable broad use of the data generated by its many programs by creating a data ecosystem—the management infrastructure, analytics, applications, and user interfaces needed to work within and across existing Common Fund data sets.  Continuing to develop and grow this ecosystem, the CFDE has now funded two more Common Fund Data Coordinating Centers (DCCs) as well as funded additional partnership projects among DCCs to help build this functional ecosystem.  

Data Coordinating Centers newly engaged with CFDE

A DCC representing the Molecular Transducers of Physical Activity Consortium (MoTrPAC) Program formally joined CFDE in 2022. It will work to integrate data on the molecular changes caused by exercise collected from both human and animal models into CFDE. The DCCs from two new CF programs, Cellular Senescence Network (SenNet) and Bridge to Artificial Intelligence (Bridge2AI), will begin to engage in CFDE, too. Bridge2AI will generate flagship data sets that are ethically sourced, trustworthy, truly represent the diversity of the population, well-defined and accessible. SenNet will generate publicly accessible atlases of senescent cells, and the molecules they secrete, using data collected from multiple human and model organism tissues with a particular emphasis on single cell data.

The addition of MoTrPAC, SenNet, and Bridge2AI, will help expand CFDE by contributing a wealth of clinical information while increasing the diversity of data types included within CFDE. They join ten DCCs including 4D Nucleome (4DN), Extracellular RNA Communication (ExRNA), Gabriella Miller Kids First (Kids First), Glycoscience, Genotype-Tissue Expression (GTEx), The Human BioMolecular Atlas Program (HuBMAP), Illuminating the Druggable Genome (IDG), Library of Integrated Network-based Cellular Signatures (LINCS)Metabolomics, and Stimulating Peripheral Activity to Relieve Conditions (SPARC) programs. Together with the CFDE-Coordination Center (CFDE-CC), these awardee teams are continuing to advance development of processes for harmonizing basic metadata elements, providing data sets for the CFDE Portal,  forming a culture of sharing insight and knowledge across DCCs, and contributing to CFDE-wide training and outreach efforts.    

New Partnerships among Data Coordinating Centers 

Five new DCC partnership projects have also been funded by the CFDE. These collaborative projects will develop approaches and tools to harmonize data and workflows from multiple Common Fund programs enabling cross-dataset analysis. These partnerships are meant to enhance DCC-DCC interactions. In addition, these partnerships aim to demonstrate the utility of their data integration tools and approaches for CF datasets to the broader scientific community. These projects and DCCs include: 

  • Workflow Playbook: Partnering DCCs: exRNA, Glyocscience, Kids First, LINCS, and Metabolomics  

This project will develop an interactive workflow engine that will draw knowledge from across CF DCCs. Several CF DCC tools, APIs, and databases will serve the CFDE Workflow Playbook (CFDE-WP) framework. The CFDE-WP will be a network of DCC microservices served through a user interface with nodes representing semantic types, for example, gene sets, signatures, diseases, or drugs, and edges representing transformations or visualizations of these objects performed by various tools. Users will be able to upload their own data and analyze it in the context of cross-CF data and tools to dynamically construct workflows, and implementing unique use cases, to generate novel hypotheses.

  • RNA Seq:  Partnering DCCs: GTEx, HuBMAP, Kids First, and SPARC

This project will produce common harmonized RNAseq data resources for the CFDE, and harmonized processing pipeline(s) for further use, to increase the fairness and interoperability of the RNA datasets in the CFDE. This will involve the deployment of a standard RNAseq pipeline across the DCCs based on a common GENCODE reference and using a revised aligner to enable improved detection of rare variant effects, and reprocessing of all of the existing accessible bulk-tissue RNAseq reference resources. The harmonized data resource and processing pipeline will be made widely available. We will also harmonize all single cell RNAseq datasets using the HuBMAP pipeline with the goal to evaluate results, incorporate any improvements into that pipeline and produce harmonized scRNAseq data across the CFDE.

  • Data Distillery: Partnering DCCs: 4DN, exRNA, Glycoscience, GTEx, HuBMAP, IDG, Kids First, LINCS, Metabolomics, and SPARC

    This partnership will produce the largest yet research knowledge graph database of integrated NIH project data, with hundreds of millions of experimental and ontological data points and relationships mapped including from the NIH  Unified Medical Language System (UMLS).  This knowledge graph will provide a rich and connected data space for biomedical search, analysis,  and machine learning applications.
  • Making Gene Regulatory Knowledge FAIR:  Partnering DCCs: exRNA, GTEx, and Kids First

The project will focus on gene regulatory element knowledge as the key “stepping stone” connecting genes and pathways and regulators in tissue-specific, developmental, and disease contexts. This approach will combine existing information from CFDE Data Coordination Centers (DCCs) DCCs into knowledge that will then be applied to generate more knowledge, thus igniting a virtuous cycle of FAIR knowledge creation. Because most genetic disease risk in humans is attributable to genetic variants impacting regulatory elements, the knowledge gained from this project will be key to interpreting whole genome sequence data from current and future projects. 

  • Clinical Observations and Vocabularies: Partnering DCCs: Kids First, Metabolomics, and SPARC

The goal of the CLOVoc project is to improve the ability to query and integrate across CF datasets for a given disease/phenotype or a clinical profile; allowing secondary analyses that drive insights about health and disease. In CLOVoc I, we developed interoperability across clinical resources in CFDE through a FAIR minimal clinical metadata framework and harmonized Fast Healthcare Interoperability Resources (FHIR) profiles. CLOVoc II will enhance the capabilities developed in CLOVoc I by (i) developing CLOVoc knowledge graph(s) from FHIR profiles, (ii) managing patient data values, as well as (iii) deploying learning systems using the ClOVoc knowledge graphs and algorithms to demonstrate our use cases. One such use case will be to characterize Type 1 diabetes mellitus, a spectral disorder, into its clinical variants and underlying endotypes. The outputs from this project will be synergistic and complimentary with other CFDE efforts.

Learn more about these CFDE awards by visiting the Funded Research page.

Common Fund Data Ecosystem: A New Frontier in Biomedical Research

Innovative collaborations will create useful tools for scientific discovery

The Common Fund Data Ecosystem (CFDE) aims to enable new ways of doing science by creating an ecosystem—the data management infrastructure, analytics, applications, and user interfaces needed to work within and across existing Common Fund data sets. The CFDE took a major step toward creating this resource by launching a set of collaborative projects that bring together eight Common Fund Data Coordinating Centers (DCCs) to help build this functional ecosystem for answering important biological questions, such as uncovering new molecular pathways and illuminating disease mechanisms.

The Common Fund DCCs will contribute a wealth of diverse data sets, spanning basic biology to clinical research, and will work towards making their data more useful alone and in combination with other data sets. The participating DCCs include Extracellular RNA Communication (ExRNA), Gabriella Miller Kids First (Kids First), Genotype-Tissue Expression (GTEx), The Human BioMolecular Atlas Program (HuBMAP), Illuminating the Druggable Genome (IDG), Library of Integrated Network-based Cellular Signatures (LINCS), Metabolomics, and Stimulating Peripheral Activity to Relieve Conditions (SPARC) programs. Their collaborative projects will tackle important challenges in biomedical research and human health, including (but not limited to):

  • Innovative strategies for data-driven treatment planning—coupling drug and small molecule predictions with patient gene activity data to uncover key molecular pathways and help with developing effective treatment strategies, predicting drug responses, identifying the best candidate drugs for specific patients, and tracking disease progression and recovery. Participating DCCs: GTEx, IDG, Kids First, LINCS, Metabolomics
  • New drug targets for pediatric cancer treatments—identifying new potential therapeutic targets for specific types of pediatric cancers by comparing the gene activity differences between tumors and healthy organ tissue. Participating DCCs: GTEx, Kids First, LINCS
  • Novel insights into complex conditions—generating multi-layered organ maps that will incorporate genetic mutations, structural birth defects, and gene activity changes during development, to create a powerful tool for studying complex conditions like Down syndrome. Participating DCCs: ExRNA, HuBMAP, Kids First, SPARC
  • Solutions for working with data in the cloud—exploring new ways to combine data sets and discover solutions for working across independent cloud-based platforms. Participating DCCs: ExRNA, GTEx, HuBMAP, IDG, Kids First, LINCS, Metabolomics, SPARC

Demonstrating the value of these data sets, particularly in combination, will help the research community see what kinds of new research questions can be asked of and answered by the data. CFDE will also make the data more accessible through a cloud-based public web portal. As these exciting projects begin, they hold the potential for opening new doors to scientific discovery and informing innovative approaches to improving human health.

Learn more about these CFDE engagement awards by visiting the Funded Research page.

Enhancing the Utility of Common Fund Data Sets: Maximizing Data Set Usage

Big data and artificial intelligence istock/RyzhiTo maximize the impact of Common Fund generated data, engage a broader community of end-users for wider adoption of these data sets, and to obtain feedback to enhance the data resources, the Common Fund CFDE supported small research projects (R03) encouraging the use of Common Fund data sets. Projects are intended to enable novel and compelling biological questions to be formulated and addressed, and/or to generate cross-cutting hypotheses for future research. Projects currently supported include: 

  • Methods to maximize the utility of common fund functional genomic data in multi-ethnic genetic studies. Data sets used: GTEx, 4DN 

This study will maximize the utility of Common Fund functional genomic data for multi-ethnic studies of smoking and drinking addiction. Through integrations of (GTEx, 4DN, ENCODE data sets) with other non-European functional genomic dataset this project aims to improve the gene expression prediction accuracy across different tissue types and multi-ethnic ancestries.  

  • Durable Common Fund Data Interfaces and Tutorials with Bioconductor. Data sets used: 4DN, IDG, GTEx 

Bioconductor/R is a widely used set of tools for high-throughput genomic data. This project will produce Common Fund datasets (4DN, IDG, GTEx) that the Bioconductor/R environment can use for genomic science with workspaces on NHGRI's AnVIL. 

  • Constructing High-Resolution Ensemble Models of 3D Single-Cell Chromatin Conformations of eQTL Loci from Integrated Analysis of 4DN-GTEx Data towards Structural Basis of Differential Gene Expression. Data sets used: 4DN, GTEx 

This project aims to develop novel computational tools for understanding the relationship of gene expression and gene topology based on datasets from the 4D Nucleome (4DN) and Genotype Tissue Expression (GTEx) programs. 

  • Deep Phenotyping of 3D Data for Candidate Gene Selection from Kids First Studies Data sets used: KOMP2, Kids FIRST 

This study aims to study the relationship between the asymmetry and the susceptibility to developmental disorders in a model organism (mouse) using quantitative analysis of KOMP and Kids FIRST data sets. 

  • Using Phosphorylation Signatures of Drug Perturbagens to Identify Exercisemimetic Compounds Data sets used: LINCS, MoTrPAC 

This study will focus on exploring if there are known compounds that can mimic the effects of physical activity. To accomplish this, the PTM signatures database (PTMsigDB) will be significantly expanded using the LINCS data. These signatures will then be correlated with phosphoproteomic changes induced by physical activity provided by MoTrPAC to suggest exercise mimicking drugs. 

  • Using Common Fund Datasets for Xenobiotic Localization Data sets used: LINCS, IDG, Metabolomics 

This project aims to develop a novel platform with computational tools for a better understanding of the subcellular localization of xenobiotic molecules in the body. The researchers will use IDG, LINCS, and Metabolomics data sets to provide predictions on improving subcellular localization of  specific xenobiotic molecules so that they can be more efficacious and less toxic. 

  • Interrogation and Interpretation of Common Fund Data Sets to Identify Novel Ocular Disease Genes Data sets used: KOMP2, GTEx 

This project aims to identify all mouse retinal disease genes in KOMP2, using GTEx and an aligned database EyeGEx to find human gene homologs, and then leverage ocular GWAS studies, pathway analysis, and literature searches to provide additional biologic data into a list of novel candidate genes that may impact blindness. 

  • Unraveling the Topological Architecture and Phenotypic Contexture of Structural Variation Data sets used: 4DN, Epigenomics, GTEx, Kids First 

This project aims to integrate 4DN, Epigenomics, and GTEx to provide an architecture (germline variation, genome topology, and chromatin structure) to explore gene expression and expression in pediatric and adult cancer tumor samples. 

  • Using three-dimensional genome structure to refine eQTL detection. Data sets used: 4DN, GTEx 

This project aims to use 4DN and GTEx datasets to generate a list of cis-regulatory elements (CRE)-gene linkages to improve the identification of eQTLs, reducing the search space, decreasing computational loads, and increasing the statistical power for eQTL detection. 

  • Investigating Systems Physiology with Multi-Omics Data Data sets used: MoTrPAC, GTEx 

This project will leverage GTEx and MoTrPAC data to create cross-tissue gene expression and protein expression correlations. This will be used to generate hypotheses for cross-tissue and cross-organ protein endocrine signals that can then be tested. 

To learn more click here.

This page last reviewed on April 17, 2023