The Common Fund Data Ecosystem (CFDE) program aims to enable broad use of the data generated by its many programs by creating a data ecosystem—the management infrastructure, analytics, applications, and user interfaces needed to work within and across existing Common Fund data sets. Continuing to develop and grow this ecosystem, the CFDE has now funded two more Common Fund Data Coordinating Centers (DCCs) as well as funded additional partnership projects among DCCs to help build this functional ecosystem.
Data Coordinating Centers newly engaged with CFDE
The new DCCs joining the CFDE represent the Glycoscience and 4DN programs.
- The Glycoscience DCC will work to integrate glycoscience data from its knowledgebase GlyGen into the CFDE. GlyGen, incorporates and harmonizes data for glycans, proteins, and glycoproteins from many sources aiming to map connections between glycans and genes and proteins.
- The 4D Nucleome (4DN) DCC further expands the CFDE by adding multimodal data including sequencing-based and imaging-data, that aim to understand how three-dimensional chromosomal interactions affect long-range gene regulation, chromosomal dynamics under perturbation, and non-coding variants in the genome.
The addition of these DCCs and datasets will help expand CFDE by contributing a wealth of information about nuclear organization and about roles that glycans play in organisms while increasing the diversity of data types included within CFDE. They join eight DCCs including Extracellular RNA Communication (ExRNA), Gabriella Miller Kids First (Kids First), Genotype-Tissue Expression (GTEx), The Human BioMolecular Atlas Program (HuBMAP), Illuminating the Druggable Genome (IDG), Library of Integrated Network-based Cellular Signatures (LINCS), Metabolomics, and Stimulating Peripheral Activity to Relieve Conditions (SPARC) programs that were initially funded in FY20. Together with the CFDE-Coordination Center (CFDE-CC), these awardee teams are continuing to advance development of processes for harmonizing basic metadata elements, providing data sets for the CFDE Portal, forming a culture of sharing insight and knowledge across DCCs, and contributing to CFDE-wide training and outreach efforts.
New Partnerships among Data Coordinating Centers
Six new DCC partnership projects have also been funded by the CFDE. These collaborative projects will develop approaches and tools to harmonize data and workflows from multiple Common Fund programs enabling cross-dataset analysis. These partnerships are meant to enhance DCC-DCC interactions. In addition, these partnerships aim to demonstrate the utility of their data integration tools and approaches for CF datasets to the broader scientific community. These projects and DCCs include:
- Anatomical Interoperation of Resources: Partnering DCCs: SPARC, HuBMAP
This project will compare the spatial distribution of gene expression in the heart across different developmental stages, health, and disease states. This is critical to improve understanding cardiac pathologies. This will involve data from the SPARC and HuBMAP program and registering tissue architecture, neural and/or vascular tracings, RNA-seq, and other data types against a common coordinate cardiac spatial scaffold.
- Gene Burden Testing: Partnering DCCs: Kids First, HuBMAP
This project will enhance the capabilities of the HuBMAP Knowledge Graph. The aim is to enable HuBMAP and Kids First workflows to run seamlessly on both HuBMAP and Kids First infrastructure and establish standards and solutions that point the way to broader workflow interoperability within the CFDE. The Knowledge Graph will enable finding and accessing the data sets relevant to the queries such as “do children with congenital disabilities have an overabundance of variants in genes that are expressed in specific cell types in tissues of interest?”
- CFDE Gene Centric Prototype Dashboard: Partnering DCCs: ExRNA, Glycoscience, GTEx, HuBMAP, IDG, Kids First, LINCS, and Metabolomics
This project will develop methods to harmonize gene, protein, and RNA identifiers and generate a cloud workspace that pools gene information from DCCs for use cases. This will involve development of standards for gene landing pages and gene centered API and development of a prototype dashboard for gene cards from the DCCs and other resources.
- CLinical Observations and Vocabularies (CLOVoc): Partnering DCCs: Kids First, Metabolomics, SPARC
This project will build FAIR metadata about human clinical data and facilitate interoperability amongst these datasets. This effort will develop minimal clinical metadata framework and APIs to facilitate the discoverability/interoperability and develop FHIR profiles of clinical metadata across partnering DCCs. The goals are to improve the ability to query across CF datasets for a given disease/phenotype or a clinical profile and integrate different datasets so that they are interoperable and reusable for secondary analyses.
- Aggregation and Sharing of Variant-centric Information: Partnering DCCs: ExRNA, GTEx, and Kids First
This project aims to make CFDE variant data FAIR by establishing a framework to derive information about specific variants and regulatory elements from the high-volume -omics profiling datasets to interpret such non-coding variants.
- Toxicology Screening Pipeline: Partnering DCCs: IDG, Kids First, LINCS, and SPARC
This project will develop a pipeline infrastructure that will tag CFDE Portal records for genes, their products, and small-molecule xenobiotics with labels of toxicity potential for reproductive and developmental processes.
Learn more about these CFDE awards by visiting the Funded Research page.
Innovative collaborations will create useful tools for scientific discovery
The Common Fund Data Ecosystem (CFDE) aims to enable new ways of doing science by creating an ecosystem—the data management infrastructure, analytics, applications, and user interfaces needed to work within and across existing Common Fund data sets. The CFDE took a major step toward creating this resource by launching a set of collaborative projects that bring together eight Common Fund Data Coordinating Centers (DCCs) to help build this functional ecosystem for answering important biological questions, such as uncovering new molecular pathways and illuminating disease mechanisms.
The Common Fund DCCs will contribute a wealth of diverse data sets, spanning basic biology to clinical research, and will work towards making their data more useful alone and in combination with other data sets. The participating DCCs include Extracellular RNA Communication (ExRNA), Gabriella Miller Kids First (Kids First), Genotype-Tissue Expression (GTEx), The Human BioMolecular Atlas Program (HuBMAP), Illuminating the Druggable Genome (IDG), Library of Integrated Network-based Cellular Signatures (LINCS), Metabolomics, and Stimulating Peripheral Activity to Relieve Conditions (SPARC) programs. Their collaborative projects will tackle important challenges in biomedical research and human health, including (but not limited to):
- Innovative strategies for data-driven treatment planning—coupling drug and small molecule predictions with patient gene activity data to uncover key molecular pathways and help with developing effective treatment strategies, predicting drug responses, identifying the best candidate drugs for specific patients, and tracking disease progression and recovery. Participating DCCs: GTEx, IDG, Kids First, LINCS, Metabolomics
- New drug targets for pediatric cancer treatments—identifying new potential therapeutic targets for specific types of pediatric cancers by comparing the gene activity differences between tumors and healthy organ tissue. Participating DCCs: GTEx, Kids First, LINCS
- Novel insights into complex conditions—generating multi-layered organ maps that will incorporate genetic mutations, structural birth defects, and gene activity changes during development, to create a powerful tool for studying complex conditions like Down syndrome. Participating DCCs: ExRNA, HuBMAP, Kids First, SPARC
- Solutions for working with data in the cloud—exploring new ways to combine data sets and discover solutions for working across independent cloud-based platforms. Participating DCCs: ExRNA, GTEx, HuBMAP, IDG, Kids First, LINCS, Metabolomics, SPARC
Demonstrating the value of these data sets, particularly in combination, will help the research community see what kinds of new research questions can be asked of and answered by the data. CFDE will also make the data more accessible through a cloud-based public web portal. As these exciting projects begin, they hold the potential for opening new doors to scientific discovery and informing innovative approaches to improving human health.
Learn more about these CFDE engagement awards by visiting the Funded Research page.
To maximize the impact of Common Fund generated data, engage a broader community of end-users for wider adoption of these data sets, and to obtain feedback to enhance the data resources, the Common Fund CFDE supported small research projects (R03) encouraging the use of Common Fund data sets. Projects are intended to enable novel and compelling biological questions to be formulated and addressed, and/or to generate cross-cutting hypotheses for future research. Projects currently supported include:
- Methods to maximize the utility of common fund functional genomic data in multi-ethnic genetic studies. Data sets used: GTEx, 4DN
This study will maximize the utility of Common Fund functional genomic data for multi-ethnic studies of smoking and drinking addiction. Through integrations of (GTEx, 4DN, ENCODE data sets) with other non-European functional genomic dataset this project aims to improve the gene expression prediction accuracy across different tissue types and multi-ethnic ancestries.
- Durable Common Fund Data Interfaces and Tutorials with Bioconductor. Data sets used: 4DN, IDG, GTEx
Bioconductor/R is a widely used set of tools for high-throughput genomic data. This project will produce Common Fund datasets (4DN, IDG, GTEx) that the Bioconductor/R environment can use for genomic science with workspaces on NHGRI's AnVIL.
- Constructing High-Resolution Ensemble Models of 3D Single-Cell Chromatin Conformations of eQTL Loci from Integrated Analysis of 4DN-GTEx Data towards Structural Basis of Differential Gene Expression. Data sets used: 4DN, GTEx
This project aims to develop novel computational tools for understanding the relationship of gene expression and gene topology based on datasets from the 4D Nucleome (4DN) and Genotype Tissue Expression (GTEx) programs.
- Deep Phenotyping of 3D Data for Candidate Gene Selection from Kids First Studies Data sets used: KOMP2, Kids FIRST
This study aims to study the relationship between the asymmetry and the susceptibility to developmental disorders in a model organism (mouse) using quantitative analysis of KOMP and Kids FIRST data sets.
- Using Phosphorylation Signatures of Drug Perturbagens to Identify Exercisemimetic Compounds Data sets used: LINCS, MoTrPAC
This study will focus on exploring if there are known compounds that can mimic the effects of physical activity. To accomplish this, the PTM signatures database (PTMsigDB) will be significantly expanded using the LINCS data. These signatures will then be correlated with phosphoproteomic changes induced by physical activity provided by MoTrPAC to suggest exercise mimicking drugs.
- Using Common Fund Datasets for Xenobiotic Localization Data sets used: LINCS, IDG, Metabolomics
This project aims to develop a novel platform with computational tools for a better understanding of the subcellular localization of xenobiotic molecules in the body. The researchers will use IDG, LINCS, and Metabolomics data sets to provide predictions on improving subcellular localization of specific xenobiotic molecules so that they can be more efficacious and less toxic.
- Interrogation and Interpretation of Common Fund Data Sets to Identify Novel Ocular Disease Genes Data sets used: KOMP2, GTEx
This project aims to identify all mouse retinal disease genes in KOMP2, using GTEx and an aligned database EyeGEx to find human gene homologs, and then leverage ocular GWAS studies, pathway analysis, and literature searches to provide additional biologic data into a list of novel candidate genes that may impact blindness.
- Unraveling the Topological Architecture and Phenotypic Contexture of Structural Variation Data sets used: 4DN, Epigenomics, GTEx, Kids First
This project aims to integrate 4DN, Epigenomics, and GTEx to provide an architecture (germline variation, genome topology, and chromatin structure) to explore gene expression and expression in pediatric and adult cancer tumor samples.
- Using three-dimensional genome structure to refine eQTL detection. Data sets used: 4DN, GTEx
This project aims to use 4DN and GTEx datasets to generate a list of cis-regulatory elements (CRE)-gene linkages to improve the identification of eQTLs, reducing the search space, decreasing computational loads, and increasing the statistical power for eQTL detection.
- Investigating Systems Physiology with Multi-Omics Data Data sets used: MoTrPAC, GTEx
This project will leverage GTEx and MoTrPAC data to create cross-tissue gene expression and protein expression correlations. This will be used to generate hypotheses for cross-tissue and cross-organ protein endocrine signals that can then be tested.
To learn more click here.
This page last reviewed on May 2, 2022