Big Data to Knowledge Program Resources
Centers of Excellence for Big Data Computing
The Big Data to Knowledge (BD2K) Centers of Excellence have developed new approaches, methods, software tools and related resources including publications, data standards, and educational resources to advance Big Data Science in their relevant biomedical area of focus.
Click on the names of the Centers in the list below for more information and direct links to the tools and resources that are available from each Center.
- Big Data for Discovery Science (BDDS)
- Researchers at the BDDS focus on proteomics, genomics, and images of cells and brains collected from patients and subjects across the globe. They enable detection of patterns, trends, and relationships among these data for the efficient large-scale analysis of biomedical data. BDDS offers a variety of tools for data management and processing, genetic association studies, statistical analysis, and utilities for existing frameworks. See BBDS Tools page
- BD2K-Library of Integrated Network-based Cellular Signatures Data Coordination and Integration Center (BD2K LINCS-DCIC)
- The BD2K-LINCS DCIC conducts data science research focused on responses of human cells and tissues to perturbing agents like small molecules. The Center provides access to and analysis of this data by the broader biomedical research community.The Center also develops web-based tools and data standards for integrative data access, visualization, and analysis across the distributed LINCS and BD2K sites and other relevant data sources. See BD2K LINCS-DCIC Resource page.
- Center for Causal Modeling and Discovery of Biomedical Knowledge from Big Data (CCD)
- CCD develops computational methods known as causal discovery algorithms that can be used to discover causal relationships from a combination of observational data, experimental data, and prior knowledge. CCD offers a software suite for causal discovery from large and complex biomedical data sets. See CCD Tools page.
- Center for Expanded Data Annotation and Retrieval(CEDAR)
- CEDAR is building new web-based technology to make it easier for biomedical scientists to author detailed metadata that describe their experiments completely, adhere to appropriate community-based standards, and incorporate controlled terms that facilitate interoperability with other online data sets. CEDAR provides a combination of tools to aid the community in producing optimal metadata. See CEDAR tools page.
- Center for Mobility Data Integration to Insight (Mobilize)
- The Mobilize center is analyzing movement data from over 6 million individuals using a smartphone app, revealing new insights about physical activity levels around the world and the factors predictive of these activity levels. The Mobilize center engages the community in mobility big data efforts and promotes the use of big data analytics in biomedical computational research through a number of resources including software, data sources, and publications. See The Mobilize center resources page.
- Center for Predictive Computational Phenotyping (CPCP)
- CPCP aims to accelerate the impact of predictive modeling on clinical practice by developing computational and statistical methods and software for a range of computational phenotyping tasks, including extracting relevant phenotypes from complex data sources and predicting clinically important phenotypes before they are exhibited. See CPCP resources page.
- Center of Excellence for Mobile Sensor Data-to-Knowledge (MD2K)
- Researchers at MD2K develop tools to make it easier to gather, analyze, and interpret data from mobile and wearable sensors. The goal is to use those data to reliably quantify physical, biological, behavioral, social, and environmental factors that contribute to health and disease risk. Visit the MD2K page.
- ENIGMA Center for Worldwide Medicine, Imaging, and Genomics (ENIGMA)
- The ENIGMA Center develops computational methods for integration, clustering, and learning from complex biodata types to help identify factors that either resist or promote brain disease, or assist in the diagnosis and prognosis, as well as new mechanisms and drug targets for mental health care. Visit the “ENIGMA-Vis” site for tools to generate interactive plots from genome-wide associated studies (GWAS) data and estimating genetic similarity between user uploaded datasets and ENIGMA data.
- Heart BD2K, a Community Effort to Translate Protein Data to Knowledge: An Integrated Platform (Heart BD2K)
- The goal of the Heart BD2K Center is to democratize data research to include non-computational scientists and individuals and to apply innovative global community-driven data integration and modeling methods to address challenges involved in the study of protein structure, function, and networks with a focus on cardiovascular research. See Heart BD2K products page.
- KnowEng, a Scalable Knowledge Engine for Large-Scale Genomic Data (KnowEnG)
- The KnowEng Center built a computational Knowledge Engine that uses data mining and machine learning techniques to obtain and combine gene function and gene interaction information from disparate genomic data sources. See KnowEng platform.
- Patient-Centered Information Commons: Standardized Unification of Research Elements (PIC-SURE)
- BD2K Centers Coordination Center (BD2K CCC)
BD2K Centers Resources BD2K Centers Resources
Click on the links below to learn more about the resources and accomplishments from each BD2K Center.
- Special Issue of the Summer 2017 Biomedical Computational review
- BD2K Centers WOW stories
- BD2K Centers Top 5 Products
Highlights from all BD2K Centers highlights from all BD2K Centers
BD2K Training and Education
The Big Data to Knowledge (BD2K) Training activities were designed to improve big data skills of biomedical scientists and increase the number of biomedical data scientists. BD2K-funded grants have produced a number of educational resources to strengthen the role of data science in modern biomedical research.
NIH-funded biomedical data science training programs represent a broad range of degree programs, career-development paths, in-person workshops, virtual events, and other unique activities.The BD2K Training Coordination Center (TCC) helps promote and support training and educational activities across the collection of NIH-funded Big Data to Knowledge (BD2K) grants. Learn more about the TCC.
Click on items in the list below for links and descriptions of each resource produced by the TCC.
- Educational Resource Discovery Index (ERuDIte)
- Data Science “RoAD Trip” Program
- The Data Science Rotations for Advancing Discovery (RoAD-Trip) program fosters new collaborations among junior biomedical researchers and senior-level data scientists to address the challenge of translating complex data into new knowledge. Learn More
- The BD2K Guide to the Fundamentals of Data Science Series
- This virtual lecture series on data science features presentations from experts across the country covering the basics of data management, representation, computation, statistical inference, data modeling, and other topics relevant to “big data” in biomedicine. See archived lectures.
Resources available through the Training Coordination CenterResources available through the Training Coordination Center
BD2K training grants have produced a number of in-person courses, Massive Open Online Courses (MOOCs), workshops, summer training programs, and other activities, which can be accessed through the sunburst, an interactive display of NIH-funded biomedical data science training programs. Explore educational resources from the BD2K training grants through the sunburst.
BD2K Mentored Career Development Award in Biomedical Big Data Science for Clinicians and Doctorally Prepared Scientists (K01)
- Project Tycho: A repository for global health data in a standardized format that is compliant with FAIR guidelines. Project Tycho contains case counts for notifiable conditions for the United States and includes data for dengue-related conditions for 100 countries obtained from the World Health Organization and national health agencies.
- HastagHealth: A resource that addresses both the dearth of neighborhood data and offers novel characterizations of neighborhoods. Neighborhood indicators include food themes, healthiness of food mentions, frequency of exercise/recreation mentions, metabolic intensity of physical activities, and happiness levels.
- genTB: An analysis tool for translational tuberculosis genomic data that offers a means for sharing, citing and crediting tuberculosis data and metadata, the prediction of resistance on genotype using a machine learning algorithm, geographic data mapping, and a user friendly statistical analysis tool.
BD2K Open Educational Resources for Biomedical Big Data (R25)
- Oregon Health & Science University (OHSU) Educational Materials: A repository of advanced introductory materials for individuals seeking to learn more about data science to expand their research programs, explore future career paths into data science, and understand and apply knowledge of the application of BD2K concepts in their present jobs.
Resources available through BD2K Training GrantsResources available through BD2K Training Grants
The BD2K Centers also produced training and educational resources including courses, workshops, webinars, lecture series, summer internships and training programs. Visit the Centers pages below for additional information.
BD2K- Library of Integrated Network-based Cellular Signatures Data Coordination and Integration Center (DCIC)
The BD2K-LINCS DCIC delivers high quality educational materials through the web like Massive Open Online Courses (MOOCs) as well as through mentoring, seminars and symposia:
- Summer Research Training Program in Biomedical Big Data Science
- Big Data Science with the BD2K-LINCS DCIC MOOC
- Network Analysis in Systems Biology MOOC
- Programming for Big Data Biomedicine Course
- LINCS Data Science Research Webinars
- Big Data Biostatistics PhD Program
- Crowdsourcing Portal
- BD2K-LINCS DCIC YouTube Channel
Center for Causal Discovery (CCD)
The CCD center offers courses, workshops, and lectures on causal relationships in big biomedical data:
- Summer Short Course and Datathon on Causal Discovery
- Course on Causality and Statistical Reasoning
- Distinguished Lecture Series
Center for Expanded Data Annotation and Retrieval (CEDAR)
The CEDAR center provides a list of educational resources for metadata training, and offers tutorials on the CEDAR software for the creation of simple template and metadata records.
The Mobilize Center
The Mobilize Center faculty have created a number of MOOCs and run workshops for individuals interested in data science:
- Mining Massive Datasets
- Statistical Learning
- Convex Optimization
- Rapid Biomedical Knowledge Base Construction from Unstructured Data Workshop
Center for Predictive Computational Phenotyping (CPCP)
The CPCP conducts training activities on data science, predictive models for biomedicine, and computational phenotyping for a broad set of audiences:
Mobile Censor Data-to-Knowledge (MD2K)
MD2K offers an annual training program to help investigators develop the multidisciplinary skills needed to generate high-quality mHealth research and solutions. Lectures from past training programs, training videos, and webinars on biomedical applications are available on the MD2K website:
The KnowEnG center offers an online resource that hosts prototypes of educational games for teaching sequence alignment, dynamic programming, and phylogenetic tree reconstruction algorithms. Through an R25 program partnership with the University of Illinois Chicago Urbana-Champaign and Mayo Clinic, and Fisk University, the KnowEnG center provides under-represented minority undergraduate students with curricular training and experience in Bioinformatics and Big Data.
- Bioinformatics Seminars
- Professional Development in Bioinformatics and Big Data
- Summer Research Experience
PIC-SURE trains the next generation of biomedical big data scientists through its Summer training program in Biomedical Informatics, and by offering data science and precision medicine graduate-level courses:
- Summer Institute in Biomedical Informatics
- Deep Learning 101 Course
- Data Science & Precision Medicine Courses
Training Resources from the BD2K Centers Training Resources from the BD2K Centers
Software & Analysis
Big Data to Knowledge (BD2K) supported the development of software tools and methods to tackle data management, transformation, and analysis challenges in areas of high need to the biomedical research community.
Click below for descriptions and direct links to the tools and methods developed under each area of high need.
The data compression/reduction awards are developing solutions for compressing files from various types of data from genomics to structural data.
- MMTF: A new compression format for large structural biology data files, the Macromolecular Transmission Format, enables 100-1000-fold speedup of interactive visualization of 3D structures over the internet.
- GTRAC, MetaCram, smallWig, and ChipWig: A suite of compression algorithms that can dramatically reduce the size of many of the common types of files (SAM, FASTQ, Wig) used in genome sequencing, metagenomics, RNA-seq and Chip-seq. Genomic data compression is improved by 10-100-fold using these tools.
- HaMMLET: Software that is able to improve detection of genomic copy number variants in array comparative genomic hybridization experiments.
- LinDen: A tool for constructing and compressing statistical epistasis networks from genome wide association studies. LinDen greatly increases the speed of a complete pairwise epistasis screen by reducing the number of statistical tests performed.
- Chopper: A MATLAB Toolbox used for retrieving Top-K proximities in large real world networks. Chopper yields asymptotically faster convergence in theory, and significantly reduced convergence times in practice.
- TruenoDB: A comprehensive manual for the TruenoDB distributed graph database system for biological networks
Data Compression/Reduction Data Compression/Reduction
The data privacy awards are developing tools that allow multiple individuals to compute on restricted access data sets without removing the encryption.
- PrivaSeq: A tool base for quantification and analysis of the individual characterizing information leakage, which can be used to link phenotype datasets to genotype datasets and reveal sensitive information in linking attacks.
- PopMedNet: A scalable and extensible open-source informatics platform designed to facilitate the implementation and operation of distributed health data networks.
- PeerSMC: A web-browser based tool allowing for two or more parties to conduct secure multiparty computation.
Data Privacy Data Privacy
The data provenance awards are generating tools to assign provenance information to biomedical datasets to improve reproducibility of these data for version tracking and citation.
- ProvCaRe: Provenance for Clinical Research and Healthcare (ProvCaRe) is a new framework ontology for data provenance in biomedical Big Data.
Data Provenance Data Provenance
The data visualization awards are making a wide range of large biomedical datasets easier to use and interpret, including brain scan imaging, geo-referenced data, health care systems dynamic data, and genomics data.
- GGV: The Geography of Genetic Variants (GGV) browser is a web services software implementation of EEMS. EEMS is a new and innovative method for visualizing and analyzing population genetics data and other such geo-tagged biomedical data.
- HSD ontology: A novel method for identifying and extracting healthcare systems dynamics (HSD) data, and for integrating these data with "traditional" electronic health record (EHR) data. HSD data take into account the dynamics of the healthcare system when interpreting medical records. (For example, the date when a patient developed a disease can be inferred from when they received a diagnosis, scheduled a doctor visit, tests were ordered, etc.)
- Caleydo Web: Caleydo Web is a suite of web based methods and software tools designed to meet current needs for visualization and analysis of complex, heterogeneous biomedical data.
- Vials: Vials is a novel visual analysis tool for analyzing splicing patterns in RNA-seq data
Data Visualization Data Visualization
The data wrangling awards are developing new methods and tools to improve the utility of big datasets by making them easier to share, integrate, and transform.
- IRRMC: The Integrated Resource for Reproducibility in Macromolecular Crystallography is a public database of x ray crystallography data, which provides a method for cleaning, collecting, and providing metadata for raw x-ray diffraction datasets.
- Fitmunk: A new program for the automatic building of amino-acid side chains in protein crystal structures.
- MODMatcher: A computational approach to identify and correct sample labeling errors in the multiple types of molecular data that can be used in subsequent integrative analyses.
- ActMiR: A software tool that infers the activity of miRNAs from expression data of target genes.
- AutoEEG and MERCuRY: New methods to process EEG cohort datasets and clinical records, align epileptic events, and identify seizure onset patterns that are of direct impact to clinicians studying epilepsy.
- Mygene.info and myvariant.info: Open source, high-performance, and continuously-updated data application programming interfaces (APIs) for accessing comprehensive, structured gene and variant annotations. The integration of multiple information streams into a community platform for annotating gene and genetic variation data significantly reduces siloing and duplication of effort across multiple databases and their user communities.
- AsterixDB: A data management tool enabling ready access to and use of behavioral and other health-relevant data contained in social media streams developed primarily for HIV risk behavioral research.
- geQTL: A sparse regression method that can detect both group-wise and individual associations between SNPs and expression traits.
Data Wrangling Data Wrangling
Forums for Integrative Phenomics
BD2K supported the development of community-based data and metadata standards. The Forums for Integrative Phenomics combines data across species to illuminate challenges in genomics, human health and disease.
- Phenotype Ontology Reconciliation Effort: A community effort that attempts to reconcile logical definition across a number of important phenotype ontologies. The outcome of this effort will be an integrated ecosystem of phenotype ontologies that can be leveraged in clinical diagnostics and disease mechanism discovery in humans.
Interactive Digital Media & Crowdsourcing
Big Data to Knowledge (BD2K) supported the development of interactive media tools for analyzing biomedical data via crowdsourcing.
- Fold It: A revolutionary crowdsourcing computer game enabling users to contribute to important scientific research. Users solve puzzles to help researchers find out if humans' pattern-recognition and puzzle-solving abilities make them more efficient than existing computer programs at pattern-folding tasks, informing models of protein structure prediction.
- EyeWire II: A 3D puzzle game that allows players to solve puzzles of neuron configurations to help researchers map the brain.
- Quorum: A flexible gaming platform that will crowdsource the analysis of visual data, such as microscopic images or graphical charts, that are provided by researcher scientists.
- Cancer Crusade: A game in which players can help improve scientific understanding of combination therapies that fight cancer.
- GraphSpace: An easy-to-use web-based platform that collaborating research groups can use for storing, interacting with, and sharing networks.
- StarGEO: The Search Tag Analyze Resource for the Gene Expression Omnibus (STARGEO) project aims to crowdsource annotations of open genomics big data that allows users to discover the functional genes and biological pathways that are defective in disease.
This page last reviewed on December 9, 2019