Frequently Asked Questions

Get Program Updates

Click on the drop-down menus to read frequently asked questions specific to each topic as listed below.

Expand All to search section using Ctrl+F | Expand All

FAQs for Kids First Data Sharing

1. Why is data sharing important for the Kids First program?

This program is Congressionally mandated to provide resources that will drive discovery in pediatric research (see Gabriella Miller Kids First Research Act). Datasets and resources generated by this program must be made as broadly shareable and accessible as is possible while abiding with informed consent language and protecting participants.

In accordance with NIH’s mission and the Kids First program’s goals, increasing accessibility to data through broad sharing practices empowers researchers and accelerates scientific progress that can lead to improved diagnostic capabilities and targeted therapies.

2. What are the general benefits of data sharing?

Enables data generated for a given study(s) to be used to explore a wide range of additional research questions
Increases statistical power by combining separate datasets and increasing sample size
Allows validation of research results Promotes innovation of methods and tools for research
Facilitates development of improved therapeutic and diagnostic strategies for patients

3. What is the National Institutes of Health (NIH) Genomic Data Sharing (GDS) policy?

Effective on January 25, 2015 the NIH Genomic Data Sharing Policy (NOT-OD-14-124) replaces the NIH GWAS Data Sharing Policy (NOT-OD-07-088). Under terms and conditions consistent with the informed consent provided by individual participants, the GDS policy seeks to make genomic data broadly available to the research community in a timely manner. Information on the NIH Genomic Data Sharing Policy can be found on:

4. What is an Institutional Certification and what role does it play in genomic data sharing?

Individual consent forms signed by study participants are the legal foundation for how genomic data from enrolled participants can be shared through dbGaP. Institutional Certifications assure that:

The Institutional Certification is submitted to a Genomic Program Administrator (GPA) who uses this to register the study in dbGaP and generate a Data Use Certification (DUC). Data Use Limitations reflect the language of the consent form and not PI or IRB preferences. Secondary users and their supporting Institution must agree to the conditions of the DUC, when applying to access data (see “FAQs for accessing Kids First data” below).

5. What is the process for obtaining an Institutional Certification?

We suggest that applicant PIs obtain Institutional Certifications following these steps:

Download the current NIH Institutional Certification template from: https://osp.od.nih.gov/scientific-sharing/institutional-certifications.
Fill out the first page of the Institutional Certification to include the sites that would contribute samples for sequencing. One document can list multiple sites; alternatively, multiple Institutional Certifications, one for each site, can be submitted.
Provide the Institutional Certification to the IRB, or equivalent body, along with the participant consent forms for each site and any other pertinent information (e.g. protocols), to complete the second and third pages:
1. On the top of second page, it is anticipated that the individual-level genomic data will be made available through controlled-access. Regarding “genomic summary results (GSR),” this box is to be left unchecked, unless unrestricted access to GSR is not permitted due to the study’s designation as “sensitive” by the institution. Please note that it is not anticipated that a “sensitive” designation will apply to current Kids First studies; therefore, GSR from Kids First data would not require controlled-access.
2. The lower section of the second page addresses “genomic summary results (GSR).” This box is to be left unchecked, unless unrestricted access to GSR is not permitted due to the study’s designation as “sensitive” by the institution. Please note that it is anticipated that unrestricted access to GSR will be appropriate for the majority of Kids First genomic datasets. For additional information see “Update to NIH Management of Genomic Summary Results Access” (https://grants.nih.gov/grants/guide/notice-files/NOT-OD-19-023.html).
On the third page, the IRB, or equivalent body, is to select the appropriate data use limitations (DULs) and/or DUL modifiers based on the language of each site’s consent form. Unless the intent of the consent form language is determined to prohibit specific uses of the data generated from the samples collected from the participants, it is expected that the dataset will be designated as “General Research Use (GRU)”. Please note that cohorts with data use limitations and/or modifiers that impede the ability to access, use, combine, or cross-analyze data will not be prioritized for sequencing by the Kids First program (e.g., datasets consented for disease-specific research only, datasets that require a letter of collaboration (“COL”), or datasets that require local “IRB” approval).
Finally, the Institutional Certification needs to be counter-signed by the applicant PI and the Institution Signing Official who is authorized to enter the institution into a legally binding contract and sign on behalf of the investigator who plans to submit the data to NIH, e.g. Dean, Vice President for Research.

An Institutional Certification must be provided with an application to the Kids First X01 sequencing opportunity; a Provisional Certification is acceptable if there is not adequate time to obtain a full Institutional Certification before submitting the application. However, approval to access the Kids First X01 sequencing capacity is conditional on the submission of a full Institutional Certification covering all samples to be submitted for sequencing. Cohort selection will be based, in part, on the Kids First program’s expectation for broad data sharing (i.e. General Research Use).

6. What are the genomic data sharing expectations for Kids First projects?

Consistent with the NIH Genomic Data Sharing Policy (NOT-OD-14-124), consent forms should contain language that reflects broad sharing of genomic data. Additionally, Kids First takes seriously its responsibility to ensure data can be broadly accessed, used, combined, and/or cross-analyzed across childhood cancer and structural birth defects. Projects that allow for the broadest leveling of sharing (i.e. “General Research Use” with no additional restrictions) will be prioritized for Kids First support (i.e., the X01 sequencing opportunity). The following data use consent groups and modifiers limit broad data access and impede the ability of the Kids First program to accomplish its goals.

Disease Specific Consent Group: When data use is restricted to a specific disease area, the data cannot be combined with a dataset with a different disease specific data use limitation. Combining and cross-analyzing datasets are a primary goal of Kids First and therefore datasets that are consented for General Research Use and/or Health/Medical/Biomedical purposes will be prioritized over datasets restricted to Disease Specific use.
IRB modifier: With this box checked, the Requester must provide documentation of a their local IRB’s approval for the proposed research when submitting a Data Access Request (DAR). We find that it is rare for consent language to include such a requirement and that this modifier is often included in error. As a reminder when submitting a Data Access, every requester and their institution must agree to the terms of the Data Use Certification (DUC), which verifies that the requesting PI is accredited within the institution, the institution is aware of the project for which the PI is proposing to use the data, and that the Institution has all appropriate security measures in place to manage and maintain the controlled-access dataset(s) being retrieved. For a sample DUC, see: https://osp.od.nih.gov/wp-content/uploads/Model_DUC.pdf.
COL modifier: This box is checked when the consent form states that collaboration with the original/submitting investigator is required in order to use the dataset; therefore, the Requestor must provide a collaboration agreement document in order to be approved for access the dataset. This can limit the number of end-users who are able to use the dataset.

Please note that under the recent guidance, “Update to NIH Management of Genomic Summary Results Access (NOT-OD-19-023)”, it is anticipated that unrestricted access to Genomic Summary Results will be appropriate for the majority of Kids First genomic datasets (i.e.the new box on page 2 should remain unchecked).

7. Where can I find additional resources about genomic data sharing?

Please refer to the following resources for more information about Genomic Data Sharing, consent language, Institutional Certifications, and the dbGaP registration process:

NIH Office of Science Policy: NIH Genomic Data Sharing: https://osp.od.nih.gov/wp-content/uploads/NIH_GDS_Policy_Overview.pdf
NIH GDS Policy pdf, 4. Informed Consent and 5. Institutional Certification:
https://osp.od.nih.gov/wp-content/uploads/NIH_GDS_Policy.pdf
NIH Guidance on Consent for Future Research Use and Broad Sharing of Human Genomic and Phenotypic Data Subject to the NIH Genomic Data Sharing Policy: https://sharing.nih.gov/sites/default/files/flmngr/NIH_Guidance_on_Elements_of_Consent_under_the_GDS_Policy_07-13-2015.pdf.
National Institutes of Health Points to Consider in Drafting Effective Data Use Limitation Statements (Institutional Certifications): https://osp.od.nih.gov/wp-content/uploads/NIH_PTC_in_Developing_DUL_Statements.pdf
NHGRI: The Informed Consent Resource: https://www.genome.gov/27565449/the-informed-consent-resource
Points to Consider for Institutions and Institutional Review Boards in Submission and Secondary Use of Human Genomic Data under the National Institutes of Health Genomic Data Sharing Policy: https://sharing.nih.gov/sites/default/files/flmngr/GDS_Points_to_Consider_for_Institutions_and_IRBs.pdf
Institutional Certification Template (note that Data Use Limitations (DULs) and modifiers must only be selected according to the language of the participant consent forms): https://osp.od.nih.gov/scientific-sharing/institutional-certifications
dbGaP Registration Flow Chart: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/GetPdf.cgi?document_name=HowToSubmit.pdf

8. Who can I contact for additional information and questions regarding data sharing?

Jaime M. Guidry Auvil, Ph.D.
Genomic Program Administrator (GPA)
Director, NCI Office of Data Sharing
NCIOfficeofDataSharing@mail.nih.gov

Vivian Ota Wang, Ph.D.
Kids First Data Access Committee (DAC) Chair
NCI Office of Data Sharing
KidsFirstDAC@nih.gov

General NIH Genomic Data Sharing questions: GDS@mail.nih.gov

dbGaP (NCBI) helpdesk: dbgap-sp-help@ncbi.nlm.nih.gov

FAQs for PAR-24-082

FAQs for the Discovery of the Genetic Basis of Childhood Cancers and of Structural Birth Defects: Gabriella Miller Kids First Pediatric Research Program (X01 Clinical Trial Not Allowed) (PAR-24-082) Funding Opportunity Announcement (FOA).

We encourage all applicants to view presentation slides and watch the recording from a previous pre-application webinar to help prepare their application.

1. What are some major features of PAR-24-082?

Supports whole genome sequencing (WGS) of existing cohorts to elucidate the genetic (germline or somatic) contribution to childhood cancers and the genetic etiology of structural birth defects.
Whole genome, exome, and transcriptome sequencing may be available for tumor or affected tissue when justified. With justification, complementary sequencing approaches, such as long-read sequencing, may also be proposed for a cohort or a subset of a cohort. Project design will be finalized in discussions among the X01 investigators, the sequencing centers, and NIH program staff.
Cohort participants must have given consent to allow sharing of individual-level sequence and relevant phenotypic data through an NIH-approved repository (see question 3 below). Cohort samples that have consents that allow for broad data sharing (i.e. for General Research Use) are of higher priority and cohorts with data use limitations that impede program goals will not be prioritized (e.g., datasets consented for disease-specific research only, datasets that require a letter of collaboration (“COL”), for access, or datasets that require local IRB approval for access). For more information, please see our FAQs on data sharing (Genomic Data Sharing FAQ #6).
Cohorts proposed for sequencing must include a minimal amount of associated clinical and phenotypic data sufficient to enable association with genomic variants and analysis. Proposals with rich clinical and phenotypic data that can be shared to facilitate cross-disease research among the pediatric research community will be prioritized.
Investigators with small cohort sizes are encouraged to collaborate with other investigators and pool samples together to increase statistical power.
Investigators who have probands that have previously undergone WGS and who have unsequenced nucleic acids from their parents, siblings, tumor, and/or affected tissue are encouraged to apply to have those samples sequenced.
Kids First is requesting that sample, phenotype, family structure, and data sharing information for the proposed cohorts be provided as "Other Attachments." See question 2 below for a downloadable set of tables that can be used.
The Kids First Data Resource Center will receive and process sequence data generated under this FOA and make genomic and phenotypic data accessible to the research community to facilitate comparative analyses.
This list is not exhaustive. Applicants are strongly encouraged to read the announcement closely and to contact program staff in case of any questions.

2. How will X01 projects/cohort will selected?

Investigators whose projects are selected for this opportunity will be notified by NIH Kids First program staff with the estimated number of samples approved for sequencing. Since there is no “award” associated with the X01 mechanism, X01 decisions are not finalized by an NIH Institute or Center (IC) Council. Rather, following initial peer review, recommended applications will receive a second level of review by the Common Fund and NIH staff involved in the Kids First Program, and decisions are approved by the NIH Gabriella Miller Kids First Working Group Co-Chairs (https://commonfund.nih.gov/kidsfirst/members). The following will be considered in making cohort selections:

Scientific and technical merit of the proposed project as determined by scientific peer review
Availability of funds
Relevance of the proposed project to program priorities
Value of incorporating the dataset into the Data Resource to empower research among the pediatric research community.
Compliance with resource sharing policies as appropriate and ability to broadly share and use data from the cohort in line with the goals of the program (i.e. combining and cross-analyzing genomic datasets). Kids First program staff reserves the right to not include cohorts that cannot be broadly shared or cross-analyzed with other Kids First datasets.
Program balance: Kids First seeks to ensure that a broad diversity of both childhood cancer studies and structural birth defects studies are well represented. Cohorts representing conditions not previously sequenced under Kids First may be prioritized.
Informative study design and sufficient clinical and phenotypic data.
Availability of samples in timely manner.
Sample quality in terms of suitability for whole genome sequencing (as well as exome and transcriptome if applicable).

Approval to access the sequencing capacity is conditional on the submission of a completed Institutional Certification covering all samples to be submitted for sequencing. If the document does not meet the Kids First program's expectation for broad data sharing (i.e. General Research Use), another cohort with broader sharing may be selected instead. For more information, please see our FAQs on data sharing (Genomic Data Sharing FAQ #6).

3. What information is required as "Other Attachments"?

Kids First is asking for specific information to be summarized and included as attachments. This is described in the FOA under Section IV. Application and Submission Information under the subheading SF424(R&R) Other Project Information. Applicants must include:

Institutional Certification – Institutional Certifications specify the data use limitations and data use limitation modifiers, as determined by the institution’s IRB based on the informed consent agreed to by the participants.
- In order to obtain the Institutional Certification, you can submit a cover letter that explains the data sharing expectations of the Kids First program (to download cover letter click here), along with the current NIH Institutional Certification template (please leave DULs and DUL modifier blank for your IRB to fill out), consent forms, and any other pertinent information (protocols etc.) to your IRB.
- If the IRB has not completed its review and therefore the institution cannot attest to all of the elements of the formal Institutional Certification, a provisional Institutional Certification is acceptable but the applicant is asked to describe the anticipated data use limitations and data use limitation modifiers. For institutional and/or provisional certifications, please use the current template: https://osp.od.nih.gov/scientific-sharing/institutional-certifications.
Sample Information, including type (e.g., DNA, RNA), tissue source, fixation method (when appropriate), and other details. Please note that DNA from patient-derived cell lines will not be accepted due to the possible introduction of mutations that could confound the identification of disease-causing rare variants.
& Description of clinical and phenotypic data that are available to be shared through the Kids First Data Resource. Applications that propose submitting rich phenotypic data sets will be looked upon favorably.
Optional – Family Structure or Pedigrees
Kids First has developed a downloadable table that applicants can use to summarize the samples, phenotype data, and data use limitations (if needed) for the proposed cohort. You can also use the DRC’s template for "Clinical Phenotype Data Elements" template to create a data dictionary describing the data that will be provided to the Kids First Data Resource Center (DRC) for sharing with the broader research community upon release of the dataset. While applicants are required to provide this information, the use of this these templates form is are optional. Applicants may submit the required information in whatever format that meets their individual purposes as long as it provides, at a minimum, the information requested in the FOA.

4. Do the cohorts have to be properly consented before applying for the X01?

Participants in cohorts selected under this FOA must have given consent to allow sharing of individual-level genome sequence and relevant phenotype data through dbGaP or other NIH-approved repositories. Applicants must provide documentation of this by submitting an Institutional Certification (or Provisional Certification with a description of anticipated data use limitations) that covers all sites samples, as an attachment (see question 3 above).

Cohort samples that have consents allowing for broad data sharing (e.g. for General Research Use with no data use limitation modifiers) will be given highest priority. No funds will be provided for obtaining new consent for existing samples. Consent to re-contact participants for additional phenotyping or collection of additional samples is strongly encouraged. Applicants are required to describe any data use limitations.

For research teams planning to start recruiting cohorts and/or collecting samples for a future application to the X01 program, please see FAQs for Kids First Genomic Data Sharing for more information.

5. What biospecimen information and phenotype data elements are expected?

Certain biospecimen and clinical/phenotype data are expected in order to process and analyze datasets; however, deep phenotyping is preferred. For phenotype data, the following data elements are expected, where available:
sex, race, ethnicity, age at enrollment and/or diagnosis, diagnoses (e.g. type of birth defect, primary tumor type), phenotypes for affected cases and unaffected families members, vital status, age at last known vital status, clinical information, and family medical history (e.g., family history of cancer or birth defects).

For templates and additional resources related to information required or suggested for the cancer projects visit: https://docs.gdc.cancer.gov/Data_Dictionary/viewer/. You can also view the Kids First DRC's "Clinical Phenotype Data Element which describes minimal expected data.

6. If investigators have already registered a project in dbGaP, and are seeking WGS through Kids First for samples from the same cohort, is a new Institutional Certification required?

As long as the Institutional Certification for the registered project complies with NIH Genomic Data Sharing policy and covers all of the participants whose samples will be sequenced through Kids First, a new certification is not required with the application. However, the Genomic Program Administrator (GPA) may ask for an Institutional Certification using the most recent NIH template (published on November 1, 2018: https://osp.od.nih.gov/scientific-sharing/institutional-certifications/), if needed, prior to registering the study in dbGaP.

7. Is it important to know the source of the DNA for samples being submitted for WGS through Kids First?

It is important to know the source of the DNA for samples provided to Kids First Sequencing Centers. We ask that applicants provide a description of the samples, such as collection site; number of samples included in the study; a detailed inventory of the sources of the DNA (e.g., number of samples from blood, number of samples from saliva); and previous genotyping or sequencing. DNA from fresh/frozen blood or tissue is ideal for sequencing, as DNA from saliva can be contaminated with microbial DNA, which may result in higher costs (and therefore reduce the number of total samples that can be sequenced). Cell lines will not be accepted because they often have significant genomic differences compared to the original germline which could complicate analysis. There are circumstances where studies might include induced pluripotent stem cells (iPSCs), but even then, a normal sample for comparison may be desirable.

8. What file types will be provided by the sequencing center?

The sequencing center will generate Variant Call Format (VCF) and BAM or CRAM files for genomic data. A BAM file (.bam) is the binary version of a SAM file. A SAM file (.sam) is a tab-delimited text file that contains sequence alignment data. A CRAM is a compressed version of a BAM. FASTQ files may be provided for RNA transcriptome data.

9. What is the role of the Kids First Data Resource Center (DRC)?

The goal of the Kids First Data Resource is to accelerate discovery of genetic etiology and shared biologic pathways by building a collection of curated genomic and phenotypic data from Kids First X01 projects and providing a central portal where these data and analysis tools will be readily accessible to the research community. The Kids First DRC is charged with re-processing and “harmonizing” sequence data generated by the sequencing centers and clinical and phenotypic data provided by X01 investigators to facilitate analyses across all Kids First datasets. Sequence data will be re-aligned and re-called with every iteration of the Human Genome Reference. X01 investigators will be able to access the data files generated by the sequencing center, as well as the harmonized version of the data generated by the Data Resource Center.

DRC activities and implementations form an integral part of the emerging landscape of the NIH’s data environments and supports the establishment of a cross-searching pediatric data with shared common standards.

Additionally, the DRC roadmap includes building out a collaborative platform for integrating, distributing, and collaborating over higher level analyses. X01 investigators are encouraged to utilize the DRC’s resources and work collaboratively with the Kids First Data Resource Center.

X01 applicants can contact at the Kids First Data Resource Center to inquire current data processing procedures, tools, pipelines, and workspaces available through the data resource:

visit kidsfirstdrc.org, or
email support@kidsfirstdrc.org

10. It seems that no funds will be awarded to investigators but a detailed analytic plan is requested. Given that is the case, are investigators expected to obtain funds to support analysis separately?

There are no direct funds available under the X01 opportunity to support analysis of sequence data or other activities. The request for applicants to provide an analysis plan is intended to increase the likelihood that the samples to be sequenced are of high quality, that the number of specimens is appropriate for the stated aims, and that those submitting X01 applications will be prepared to do the analyses. Those investigators providing the samples are likely to have a significant advantage in conducting analyses, because they are familiar with the cohort, they will be interacting directly with NIH, sequencing centers, and the Kids First Data Resource Center throughout the process, and lastly, because each X01 investigator team has six months of proprietary access to the sequence data before it is released to the public for controlled access via dbGaP. To learn about funding opportunities for supporting data analysis see: Funding Opportunities to Support Data Analyses of Kids First Datasets.

11. Is it possible to submit an application with multiple PIs from different Institutions in order to build an adequate sample size or create a larger, more compelling cohort? Alternatively, is it possible to reach an adequate sample size by adding trios or families with a different childhood cancer or structural birth defect?

Efforts to increase sample number by collaborations across institutions are acceptable and encouraged. Strong justification for the proposed sample size is expected in each application. Increasing sample numbers by aggregating across related conditions is acceptable. However, applicants doing this should be prepared to provide a coherent description of the analyses that will be performed across the aggregated cohort, and it may be easier to do this for sets of samples with related phenotypes or suspected underlying pathways. In addition, investigators should state how aggregating samples won’t slow the process of sending samples to the Kids First Sequencing Center.

Applicants are also encouraged to partner with current X01 recipients to extend existing cohorts. For a full list of current X01 projects, visit: https://commonfund.nih.gov/kidsfirst/x01projects.

12. Is there also a maximum that will be considered? Our combined cohorts for example have nearly 5000 trios.

We encourage the submission of a large number of trios, but ask that the samples be organized into tranches that make analytic/scientific sense to provide flexibility in the review process. The available budget for sequencing services associated with this FOA allows for roughly 4,000 genomes total this cycle. Depending on the quality and number of applications received, the Kids First program management will determine how many total samples each X01 recipient will have approved for sequencing, while taking study design and sample size into consideration.

Additionally, applicants who propose sequencing large numbers of samples should describe their capacity and plan to prepare such a large number of samples for sequencing within the year timeframe.

13. Should we propose quality metrics for the genome sequencing?

No, this is not necessary. You should note the quality of the samples being proposed for submission.

14. Do applicants need to describe the capacity to store BAM files?

Applicants are encouraged to make use of the cloud-based workspace that will be provided by the Kids First Data Resource Center. Therefore, local download and processing of data may not be necessary for interacting with Kids First datasets. If your group plans to download data to a local server as part of the data management plan, it is important to make clear that your team has the capacity (including equipment, security infrastructure, and physical resources) at your institution to securely accept and store large data files. If your group plans to make use of cloud-based workspaces, please describe a plan for analyzing data in such spaces. For information about the DRC cloud-based workspaces, visit kidsfirstdrc.org/

Data may be stored/hosted on local cloud-based platforms. For more information see “NIH Security Best Practices for Controlled-Access Data Subject to the NIH Genomic Data Sharing (GDS) Policy”

15. Although the maximum project period is 1 year, could one propose to sequence 70 trios now and then add 50 trios next year after additional collections?

All samples must be extracted, properly consented, and ready to send off to the sequencing center shortly after the review date. Please refer to the FOA for a more detailed timeframe.

16. Who is responsible for data deposition?

The sequencing center is responsible for deposition of the sequence data into a NIH approved data repository (e.g., dbGaP/ or the Kids First Data Resource). The study Principal Investigator will be responsible for directly submitting the clinical/phenotypic data to the Kids First Data Resource Center.

17. For tumor specimens, is there an opportunity for applying whole genome sequencing (WGS) to DNA extracted from formalin-fixed paraffin-embedded (FFPE) tissue?

Fresh frozen samples for tumors are preferred. However, proposals that include FFPE samples will be accepted. If such a proposal is successful in review, there may be technical issues to resolve before good results can be obtained.

18. What amount and concentration of DNA will be required and what will be the coverage?

Whole genome sequencing (WGS) of germline DNA will be done at 30X mean coverage using paired end sequencing. Depending on the sequencing center’s protocol, tumors may be sequenced at 60X or 30X mean coverage using paired end sequencing combined with whole exome sequencing (WES) and RNA sequencing both at 100X also using paired end sequencing. The NIH and sequencing center staff will work with each project to determine the best coverage and approach for sequencing and analysis of tumors and/or affected tissue.

	Amount DNA or RNA required/recommended	Concentration	Coverage	Additional info
Amount of DNA/RNA and coverage
WGS (Short-read)*	~2ug DNA	20-50 ng/ul preferred	30X	paired end reads
WES	275 ng DNA (minimum); 1 ug recommended	20 ng/ul (minimum)	100X, greater than 80% coding exons covered at 20X	paired end reads
RNA-Seq	750 ng total RNA (minimum); 1 ug recommended	20 ng/ul (minimum)	100X, greater than 40% coding exons covered at 20X	paired end reads

*Long-read sequencing technologies such as those offered by PacBio or Oxford Nanopore have specific requirements. Contact the sequencing center or program staff for more information.

19. Can I propose alternative or complimentary sequencing approaches, such as long-read whole genome sequencing?

Yes. Kids First sequencing centers are equipped to provide advanced sequencing technology; however, some approaches may require higher quality DNA or be more expensive than the typical Kids First pipeline (currently 30X WGS on the NovaSeq platform). With justification, applicants may proposed alternative or complimentary sequencing for their cohort, or a subset of their cohort, to further inform the value of these technologies for structural birth defects and childhood cancer research. However, this may reduce the number of total samples that can be sequenced for your project within the program’s limited funds. Project design will be finalized in discussions among the X01 investigators, the sequencing centers, and NIH program staff.

20. Are applicants expected to describe how results will be returned to study participants or how incidental findings will be reported?

Decisions about returning individual results and incidental findings to study participants lie with the institution and their IRBs and are outlined in the consent form agreed to by participants. NIH does not require that Kids First X01 applicants describe a plan for return of results. Investigators and participants should keep in mind that the technology used to generate sequence data in this program is designed for research purposes, not for identifying clinical results. Communicating clinically meaningful results to participant requires sequencing and analysis by a CLIA-approved laboratory. Since the Kids First program is focused on research and discovery, CLIA sequencing is not provided.

21. Who should I contact for additional questions?

You can email Valerie Cotton at valerie.cotton@nih.gov for additional questions. Please use the subject line: “X01 inquiry.”
Potential applicants may also contact any Program Officer listed in the FOA

FAQs X01 Cohort Selection for Sequencing

1. What information should be included on the shipping manifest?

Please include the following information on the shipping manifest (column headers):

Participant ID
Sample ID
Aliquot ID
Composition
Tissue Type
Anatomical Site
Age at Collection
Tumor Descriptor
Analyte Type
Concentration
Volume

You may add any additional fields relevant to the biospecimens and/or the operational needs of shipping (e.g. well/box location).

You may download this spreadsheet as a starting point. Please contact the DRC for recommendations or questions.
Please include support@kidsfirstdrc.org and valerie.cotton@nih.gov when emailing shipping manifests to the sequencing center.

2. What clinical and phenotypic information do X01 investigators need to submit to the DRC in order to be approved for access to the genomic dataset?

It is expected that each X01 group will provide the clinical and phenotypic described in the original X01 proposal to the DRC for sharing with the broader research community upon release of the dataset. Kids First strongly encourages the submission of detailed/deep clinical and phenotypic data, including longitudinal data and family histories. Please provide the information described in the "Clinical Phenotype Data Element" spreadsheet for the DRC and program staff to review:
The DRC will accept this information in another format, such as the REDCap dbGaP submission files, as long as all the necessary information is provided.

Please contact support@kidsfirstdrc.org to discuss the best format for submitting further information.

Upon receipt of the required information, Kids First NIH program staff will work with the DRC and/or sequencing centers to enable the X01 team to have access to the associated sequence data. Once the investigator team has access to the sequence data, they have six months of proprietary access before it is released to the public.

The DRC is working closely with investigators who have expertise in specific areas to address how to best capture clinical and phenotypic data moving forward. If you have an interest in engaging in this process or providing suggestions, please contact support@kidsfirstdrc.org or visit kidsfirstdrc.org.

3. How will the DRC harmonize phenotypes across Kids First projects? Which data ontologies will be used?

The DRC is leveraging existing community standards to harmonize clinical and phenotypic data which facilitates searching, analysis, and interoperability with other data efforts. If you are currently collecting phenotypic data or working to map such data to existing standards, we suggest you use one of the following ontologies, since these are what the DRC plans to use for phenotype harmonization:

For structural birth defects: Human Phenotype Ontology (http://human-phenotype-ontology.github.io/)
For childhood cancers: NCI Thesaurus (https://ncit.nci.nih.gov)

Also recommended:

Uberon (https://github.com/obophenotype/uberon) for tissue/anatomy, including but not limited to tumors.
Monarch Disease Ontology (MONDO, http://obofoundry.org/ontology/mondo.html)
ICD-10 (http://www.who.int/classifications/icd/en/)

Other helpful resources:

Ontology Lookup Service: https://www.ebi.ac.uk/ols/index

4. What should X01 investigators include in their acknowledgement statement when publishing research findings from Kids First generated data?

In addition to listing the PHS Accession Number(s) of the datasets used for a particular analysis and the databases from which they are accessible to the research community, X01 investigator teams (i.e. “Contributing Investigator(s)”) are asked to describe support for the project, including NIH grant numbers.

A sample statement for the acknowledgment of Kids First dataset(s) follows:

The results analyzed and here are based in whole or in part upon data generated by Gabriella Miller Kids First Pediatric Research Program (Kids First) projects , and are accessible through from the Kids First Data Resource Portal (kidsfirstdrc.org) and/or dbGaP (www.ncbi.nlm.nih.gov/gap). Kids First was supported by the Common Fund of the Office of the Director of the National Institutes of Health (www.commonfund.nih.gov/KidsFirst). The was awarded a U24 () to sequence [childhood cancer and/or structural birth defect cohort samples] submitted by investigators through the Kids First program (). Additional funds from supported the assembling of the cohorts, and the collection of the phenotypic data and samples, and/or data analysis. Contributing investigators include: *.
*If there are many collaborators/consortium members, you can use a ‘corporate authorship’ with a link to a website that lists everyone.

Kids First Sequencing Center Grants
Sequencing Center	Grant Number
BROAD INSTITUTE	U24 HD090743-01
HUDSON-ALPHA INSTITUTE FOR BIOTECHNOLOGY	U24 HD090744-01>
BAYLOR COLLEGE OF MEDICINE	3U54HG003273-12S1
WASHINGTON UNIVERSITY	3U54HG003079-12S2

5. What should secondary users (a.k.a. “end-users” or approved data requestors) include in acknowledgement statements when publishing research findings from Kids First generated data?

Secondary users, or “end users”, must acknowledge all datasets used in a publication or analysis by listing all relevant dbGaP PHS Accession Numbers, as well as the urls of the databases where the datasets were accessed. The Data Use Certification (DUC) agreed to by secondary users outlines how to use and acknowledge each approved dataset.

6. Are there opportunities for collaborating with other efforts for functional validation of variants?

KOMP2 is receptive to considering genes of interest identified through X01 analyses as candidates for targeting by KOMP2 Centers. This includes reviewing the literature for existing models, prioritizing specific genes for generating new knockout mice, and mapping resulting phenotypes to animal model ontologies All KOMP data are publicly available at www.mousephenotype.org. If you are interested in collaborating with KOMP, please contact:KidsFirstKomp@nih.gov
Projects that fall within the categorical interests of two or more NIH institutions/centers, may also consider applying for ORIP’s R21 program announcement for Development of Animal Models and Related Biological Materials for Research (PA-16-141). Investigators considering applying to PA-16-141 are strongly encouraged to consult with ORIP program staff (see Scientific/Research Contacts in Section VII. Agency Contacts) to be advised whether their research plans are appropriate for this FOA.
Researchers interested in exploring the gene by environment interactions of conditions such as craniofacial diseases may be interested in these funding opportunity announcements:
- Mechanistic Studies of Gene-Environment Interplay in Dental, Oral, Craniofacial, and Other Diseases and Conditions (R01) (PAR-19-292)
- Development of Novel and Robust Systems for Mechanistic Studies of Gene-Environment Interplay in Dental, Oral, Craniofacial, and Other Diseases and Conditions (R21) (PAR-19-293).

7. What does the DRC do with the data they manage?

Among other activities described in RFA-RM-16-010, the Kids First DRC is charged with:

processing, structuring, and harmonizing Kids First and associated datasets to improve data FAIRness,
staging Kids First data in the Data Resource for sharing and use by the broader research community,
developing, testing, and deploying software, tools, workflows, pipelines,
and resources associated with the platform, and helping and training researchers to use the platform.

The DRC receives de-identified data from submitting investigators and sequencing centers to perform these operations. Members of the DRC are not permitted to use pre-release data for their own research or discovery efforts and are required to obtain dbGaP approval to pursue research activities with controlled-access datasets. Similar to end-users, DRC members are not permitted to re-distribute controlled-access data through other environments and they may not attempt to identify or contact individual study participants from whom data and/or samples were collected.

FAQs for Small Research Grants to Support Analyses of Gabriella Miller Kids First Pediatric Research Data

1. When is the next receipt date for this opportunity?

"Small Research Grants for Analyses of Gabriella Miller Kids First Pediatric Research Data (R03 Clinical Trial Not Allowed)” has been reissued and follows the standard receipt cycle, with the first receipt date after the “Open Date." Please note that occasionally reissuing FOAs may interfere with a standard receipt date; however, we try our best to avoid this.

2. Do I need to suggest a scientific review group or study section in my cover letter?

No, the study section will be assigned by CSR and program staff will request that they review “Kids First” R03 applications in the same study section.

3. What are the data sharing expectations for this opportunity?

It is expected that data (including resultant raw, derived, aggregated, and summary data), tools, workflows, and/or pipelines created or used with support from this FOA will be provided to the Kids First Data Resource Center to be shared with the wider scientific community, if not already part of the Data Resource, in a timely manner that would enable other researchers to replicate and build on the analyses for future research efforts.

Applicants may contact at the Kids First Data Resource Center (DRC) to learn more about how secondary data and analytical pipelines can be submitted:

visit kidsfirstdrc.org, or
email support@kidsfirstdrc.org

4. What should be included in a data sharing plan?

In the Data and Resource Sharing Plan, applicants should describe the anticipated timeline, formats, and methods of providing the data and other products used or created under this FOA to the Data Resource Center. Some example resultant data type could include variant call files from multi-sample comparisons, plots or graphs of variant associations, lists or tables of gene summaries, network/pathway analysis results, and other summary statistics. Where applicable, applicants should describe how they plan to share any analytical tools, pipelines, or workflows used or created through open access channels (e.g. public GitHub links).

Here are two Example Acceptable Data Sharing Plans for Kids First R03s:

Kids First Example A (external researcher analyzing Kids First X01 genomic datasets)
The proposed research will compare genomic data from two Gabriella Miller Kids First X01 datasets (phs00XXX.v1.p1 and phs00XXX.v2.p2) which will be accessed via the Kids First Data Resource Portal and analyzed in associated cloud-based workspaces after dbGaP approval. Data and documentation related to this analysis will be provided to the Kids First Data Resource Center (DRC) upon acceptance by a journal for publication or sooner, via secure cloud-based transfer and/or other sharing methods in consultation with the DRC. Submitted data types will include resultant multi-sample VCFs as well as spreadsheets of raw and derived input data which will be processed to yield resulting summary tables and graphs for publication. We will also provide documentation to explain the analytical approach and will use pipelines that have been previously published. Code and other information related to these pipelines are currently available at the following open access links: [insert public GitHub links]. We recognize that the DRC may then make these data and documentation available to the research community in line with the appropriate parameters and/or policies, such as data use limitations of the original datasets described in dbGaP. We will cite the dbGaP accession numbers of both genomic datasets used in this analysis in any associated presentations or publication.

Kids First Example B (Kids First X01 investigator analyzing deep phenotypic datasets)
The proposed research aims to analyze phenotypic data from a variety of cohorts that overlap with phenotypes that will be extracted from our Kids First X01 dataset, phs00XXX.v1.p1, but are currently not shared through the Kids First Data Resource. We will mine the Kids First Data Resource Portal for deep phenotypic data and compare those with deep phenotypic data elements collected from patients represented in our X01 dataset to guide priorities for extracting, organizing, curating, and harmonizing data elements that would enable phenotype comparisons across cohorts. The newly extracted phenotypic data will be mapped to Human Phenotype Ontology (https://hpo.jax.org/), where applicable, and all harmonized data will be provided immediately thereafter (estimate: 1 year after award start date) to the Data Resource Center to add to and strengthen the existing data already available in the Kids First Data Resource Portal. All data provided will be de-identified. We will develop tools to facilitate phenotype data mining, extraction, and harmonization procedures and these algorithms and associated metadata and data dictionaries will be provided to the Data Resource Center for sharing with the wider research community. We will mine all datasets available through the portal whose data use limitations allows comparing and combining with datasets of the same phenotype as our project, to include the following:

phs00XXX.v1.p1
phs00YYY.v2.p2
phs00ZZZ.v3.p3
[etc…]

All dbGaP accession numbers and/or other study identifiers will be cited in any associated publication.

5. Since the DRC is charged with processing Kids First data, can we work with them to develop an analysis plan?

Applicants are encouraged to communicate with the Kids First Data Resource Center to avoid redundant efforts and to collaborate, analytical approach, tool development or data sharing. To contact them:

visit kidsfirstdrc.org, or
email support@kidsfirstdrc.org

6. If I am proposing to use the R03 to analyze genomic data obtained outside of Kids First, what are the data sharing requirements?

For proposals that aim to co-analyze Kids First data with non-Kids First genomic datasets that are currently accessible through an NIH-approved repository (e.g., dbGaP) or some other public controlled access database (e.g., European Genome-phenome Archive), applicants must describe the database through which the proposed data are accessible to the research community and the details of the dataset including any data use limitations based on the associated consent form.

For proposals that aim to co-analyze Kids First data with non-Kids First genomic datasets that are not currently accessible through an NIH-approved repository (e.g., dbGaP) or some other public controlled access database (e.g., European Genome-phenome Archive), applicants must describe their ability and willingness to submit the individual-level sequence data to an NIH-approved repository (e.g., dbGaP) and provide an associated Institutional Certification using the current NIH template (https://osp.od.nih.gov/scientific-sharing/institutional-certifications/). If the Institutional Certification is not available, provide a Provisional Certification and describe the anticipated data use limitations and associated modifiers separately. If submitting a Provisional Certification with the application, please note that a completed Institutional Certification may be required prior to award. Note that an NIH Institute may consider whether and/or how the external genomic dataset can be shared with the broader research community in line with the goals of Kids First and the mission of the corresponding institute, before making a decision about funding the proposal.

7. Who can I contact for additional information?

Researchers may contact the Program Officer listed on the Notice of Award for the R03 grant, which is available through your NIH eRA user account, or the Scientific/Research Contacts listed in the FOA.

FAQs for Funding Opportunities Announcements to Support Analyses of Kids First Data

1. Aside for the Kids First X01 mechanism, what other funding opportunities are available to support for analyses of Kids First datasets?

The NIH issued the following FOAs that maybe of interest to the Kids First community:

NIH “Parent” R03
NIH “Parent” R01
NCI: Secondary Analysis and Integration of Existing Data to Elucidate the Genetic Architecture of Cancer Risk and Related Outcomes (Contact: Melissa Rotunno, Ph.D rotunnom@mail.nih.gov)
- R01
- R21
NIDCR: Notice of Special Interest (NOSI) of NIDCR in Supporting Discovery, Characterization, and Mechanistic Study of Genetic Variants Underlying Dental, Oral, and Craniofacial Diseases and Conditions
NIDCR: Research Grants for Analyses of Existing Genomics Data (R01)
NIDCR: Small Research Grants for Analyses of Existing Genomics Data (R03)
OD: Pilot Projects Enhancing Utility and Usage of Common Fund Data Sets (R03 Clinical Trial Not Allowed)

2. What funding opportunities are available to support variant validation identified from Kids First datasets?

The NIH issued the following FOAs to support variant validation:

3. Are there other funding opportunities that could be relevant to Kids First researchers?

The NIH issued the following FOAs to support pediatric research:

FAQs for Accessing Kids First Data

1. Where can I access Kids First data?

Individual level sequence data (BAM/FASTQ/VCF files) and associated clinical/phenotype data and metadata generated for Kids First cohorts can be accessed through the Kids First Data Resource Portal (to learn more, visit https://kidsfirstdrc.org/resources/). Before accessing individual level genomic sequence data, you will need to submit a Data Access Request through dbGaP for approval from the NIH Kids First Data Access Committee (see FAQ #3 below).

Some genomic datasets from structural birth defect projects are currently stored in the National Center for Biotechnology Information’s (NCBI) Sequence Read Archive (SRA), but all datasets can be accessed through the Kids First Data Resource Portal and require dbGaP approval.

2. When will Kids First data be publicly available?

Kids First X01 datasets are scheduled to be released to the public via dbGaP six months after the X01 investigator team receives access to the sequence data. Sometimes this “pre-release period” can be longer than six months due to procedural delays, but the data will not be released prior to six months unless an associated scientific paper has been accepted for publication or early release is specifically requested by the X01 PI.

Visit our X01 projects page to see projects that have been released and estimated release dates for pending projects: https://commonfund.nih.gov/kidsfirst/x01projects

3. How do I access Kids First data?

The first step is to find the Kids First data. You can find Kids First datasets at the following links:

Directly search Kids First and other interoperable datasets in the Kids First Data Resource Portal: https://portal.kidsfirstdrc.org
The Kids First X01 projects page lists all projects selected for sequencing (including those that have not yet been released): https://commonfund.nih.gov/kidsfirst/x01projects
Learn more at the Kids First DRC’s Studies & Access Page: https://www.notion.so/d3b/Studies-and-Access-a5d2f55a8b40461eac5bf32d9483e90f

The next step is to submit a Data Access Request (DAR) through dbGaP for each project: https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login. For Tips for Preparing a Successful Data Access Request, vsit: https://sharing.nih.gov/sites/default/files/flmngr/Tips%20for%20Preparing%20a%20Successful%20DAR.pdf.

Secondary users and their supporting Institution’s Signing Official and IT Director must agree to the conditions of the Data Use Certification (sample agreement: https://osp.od.nih.gov/wp-content/uploads/Model_DUC.pdf), including any DULs or DUL modifiers pertinent to the requested dataset and the Genomic User Code of Conduct (htps://dbgap.ncbi.nlm.nih.gov/aa/Code_of_Conduct.html).

All internal and external collaborators must be listed on the application, with the exception of technicians, graduate students, and postdoctoral fellows who are under the requestor’s direct supervision. External collaborators from other institutions are required to submit separate DAR(s) for approved access to the same dataset(s). The DAR(s) will be reviewed by the NIH Kids First Data Access Committee (DAC), which is run out of the NCI Office of Data Sharing.

To learn more about the dbGaP data access procedure, visit: https://dbgap.ncbi.nlm.nih.gov/aa/dbgap_request_process.pdf and watch a presentation about requesting access to genomic datasets through dbGaP at https://www.youtube.com/watch?v=39cba0gF2tw&t=3s.

Once you have dbGaP approval, you can learn how to push data from the Kids First Data Resource Portal to the CAVATICA analysis platform here: https://kidsfirstdrc.org/support/analyze-data/

4. Who can apply to access individual level sequence data from dbGaP?

For extramural researchers, the Principal Investigator (PI) must be a tenure-track professor, senior scientist, or equivalent, to be able to submit a data access request (DAR) and have a valid NIH eRA Commons account for logging in to the dbGaP system. Please see here for more about how to setup a new eRA Commons account or how to make changes to an existing eRA Commons account.

Data Standards Relevant to Kids First

To maximize comparisons across datasets or studies, facilitate data and platform integration, and foster collaboration, Kids First researchers are strongly encouraged to use standards and resources used within the Kids First program and other existing standards, where applicable:

1. What standards are preferred for facilitating interoperability across clinical and phenotypic data?

NIH encourages clinical research programs and researchers to adopt and use the standardized set of data classes, data elements, and associated vocabulary standards specified in the United States Core Data for Interoperability (USCDI) standards , as they are applicable (NOT-OD-20-146).

NIH encourages the use of data standards including, but not limited to the following:

Standard terminologies and ontologies, such as:
- Mondo Disease Ontology (mondo.monarchinitiative.org)
- Human Phenotype Ontology (hpo.jax.org)
- SNOMED (for conditions and many specialized terms; snomed.org)
- LOINC (for laboratory tests and other measurements; loinc.org)
- Disease Ontology (https://disease-ontology.org/)
- Ontology for Biomedical Investigation (OBI; http://obi-ontology.org/)
- RxNorm (for medications; https://www.nlm.nih.gov/research/umls/rxnorm/index.html)
- NCI Thesaurus (NCIt; https://ncithesaurus.nci.nih.gov/ncitbrowser/)
- Unified Medical Language System (UMLS; https://uts.nlm.nih.gov/uts/umls)
Common Data Models, such as:
- Observational Health Data Sciences and Informatics (or OHDSI, also known as “OMOP", https://ohdsi.org/)
- PCORnet (https://pcornet.org/data/)
- I2b2 (Informatics for Integrating Biology & the Bedside; www.i2b2.org)
- Fast Healthcare Interoperability Resources (FHIR®) – see below
Common Data Elements, such as those available through the:
- caDSR (https://cdebrowser.nci.nih.gov/)
- PhenX Toolkit (www.phenxtoolkit.org)
- NIH PROMIS and NeuroQol (https://www.healthmeasures.net/)
- The NIH CDE Repository (cde.nlm.nih.gov)
For environmental and/or exposure data:
- Human Health Exposure Analysis Resource (HHEAR) (https://bioportal.bioontology.org/ontologies/HHEAR)
- Exposure Ontology, https://bioportal.bioontology.org/ontologies/EXO
- Environment Ontology: http://www.obofoundry.org/ontology/envo.html

2. What is FHIR and how is it used within Kids First?

NIH encourages researchers to explore the use of the HL7 FHIR® (Fast Healthcare Interoperability Resources) standard to capture, integrate, and exchange clinical data for research purposes and to enhance capabilities to share research data (NOT-OD-19-122). The FHIR® standard may be particularly useful in facilitating the flow of data with electronic health read (EHR)-based datasets, tools, and applications. The Kids First DRC is structuring all Kids First data in a FHIR-based data service to make it more compatible with EHR data and other FHIR-based tools and data models.

To learn more about using FHIR, visit:

FHIR 101 developed by the Kids First DRC: https://github.com/ncpi-fhir/fhir-10
Kids First FHIR Model: https://github.com/kids-first/kf-model-fhir
NIH Cloud Platforms Interoperability (NCPI) FHIR Implementation Guide: https://nih-ncpi.github.io/ncpi-fhir-ig/toc.html
HL7 FHIR Implementation Guide https://www.hl7.org/fhir/implementationguide.html

3. Which workflows are used by the Kids First DRC to promote genomic harmonization?

DRC Alignment Workflow: https://github.com/kids-first/kf-alignment-workflow
DRC Genomic Harmonization & Joint Trio Calling Workflow: https://github.com/kids-first/kf-jointgenotyping-workflow
DRC Somatic Variant Calling Workflow: https://github.com/kids-first/kf-somatic-workflow
Dockerized pipelines: https://hub.docker.com/u/kfdrc/
Additional DRC workflows and documentation: https://github.com/kids-first

4. Which reference genome does the Kids First DRC use for sequence read alignment?

GRCh38/hg38, see: http://genomereference.org

5. What GA4GH standards have been adopted by the Kids First DRC?

Common Workflow Language (CWL; https://www.commonwl.org/) for describing and running tools and workflows in CAVATICA
CRAM (https://www.ga4gh.org/cram/) for generating and storing whole genomic sequenced aligned read data files
Data Repository Service (DRS; https://github.com/ga4gh/data-repository-service-schemas) for providing an interface for tools and workspaces to access data in the cloud
NIH encourages the implementation of the NIH Researcher Auth Service (RAS; https://auth.nih.gov/docs/RAS/) across all data commons within the NIH such that each user can access any data they are authorized to in a seamless manner. RAS will be implemented in a two-phase approach while working towards a larger vision of federated data access utilizing GA4GH AAI.

6. What other standards and resources may be relevant to Kids First?

The “NIH Cloud-Based Platforms Interoperability” (NCPI) effort is a collaboration among NHGRI AnVIL, NHLBI BioData Catalyst, NCI Cancer Research Data Commons (CRDC), and the Common Fund-supported Kids First Data Resource Center. The goal is to enable and promote end-user analyses across these platforms through federation and interoperability. To learn more about the technical standards being adopted in NCPI visit: https://datascience.nih.gov/nih-cloud-platform-interoperability-effort

ClinVar, database of genomic variation and its relationship to human health: https://www.ncbi.nlm.nih.gov/clinvar/
uPheno, the unified cross-species phenotype ontology: https://www.ebi.ac.uk/ols/ontologies/upheno
Open mHealth Schemas for Mobile Health Applications: https://www.openmhealth.org/documentation/#/schema-docs/schema-library
The Drosophila RNAi Screening Center’s Integrative Ortholog Prediction Tool (DIOPT) for model organism orthology: http://www.flyrnai.org/diopt
Brain Imaging Data Structure (BIDS) for neuroimaging analysis: https://bids.neuroimaging.io/
The Cancer Imaging Archive (TCIA) standards for image de-identification: https://wiki.cancerimagingarchive.net/display/Public/Submission+and+De-identification+Overview
Intergrating Health Enterprise (IHE )Technical Framework: https://www.ihe.net/resources/

FAQs for Using the Kids First Data Resource

1. What is the Kids First Data Resource?

The Gabriella Miller Kids First Pediatric Data Resource (Kids First Data Resource) is a cloud-based platform designed to empower and accelerate collaborative research in pediatric cancer and structural birth defects. The Resource is built of multiple components, including a portal for querying and workspaces for analysis. The Resource is still growing and new functionalities and integrations are in development. To view past webinars including demonstrations and presentations of the Data Resource, visit the Kids First DRC YouTube channel.

2. How are controlled-access Kids First data distributed in the cloud?

The Kids First DRC collaborates with the University of Chicago’s Bionimbus, which is a National Institutes of Health (NIH) Trusted Partner (NCI's Protected Data Cloud, Leidos Biomedical Research Contract 16X063), to implement Framework Services that meet all required core NIH standards & established data quality, security, and service protocols for distributing controlled-access data.

Users need to use their eRA Commons account to log in to the Kids First DRC Framework Services (powered by the Bionimbus instance of Gen3) to work with controlled access data, after obtaining dbGaP approval.

Users can then use CAVATICA or other cloud-based resources to analyze Kids First data in alignment with NIH Security Best Practices for Controlled-Access Data Subject to the NIH Genomic Data Sharing Policy and the institution’s own IT security requirements and policies, per the dbGaP data access request process.

3. What is CAVATICA?

Seven Bridges CAVATICA is a cloud-based data analysis ecosystem that integrates directly with the Bionimbus/Gen3 Trusted Partner and other data access frameworks. Researchers can use CAVATICA for analyzing controlled-access datasets from Kids First and other data efforts including upload their own data for analysis in CAVATICA.

CAVATICA also integrates directly with the Kids First Data Resource Portal, where researchers can search and select Kids First and other pediatric data files they are interested in analyzing and “push” them to a CAVATICA workspace, after obtaining the appropriate approvals (e.g., dbGaP). More details about how to search for files using the Kids First Portal and push them to CAVATICA are found in the Kids First DRC Help Center.

4. Can I collaborate on and share data in CAVATICA?

CAVATICA includes collaborative workspaces called “projects”. Projects serve as access-controlled containers for data files, analysis workflows, and results. Researchers can work independently in projects (keeping all data, tools, and analysis private) or work alongside collaborators, controlling both project access and individual permissions within the project. These collaborative capabilities are especially empowering when groups of individuals with diverse expertise are working together, or when large working groups are formed to collaboratively analyze a dataset.

Users can typically share their own data if they have the authority to do so by adding collaborators and setting permissions, as needed. However, access to NIH controlled-access datasets requires approved dbGaP data access requests that cover all users involved in the project. Note that according to the terms of the NIH Genomic Data Sharing Data Use Certification (DUC), end-users do not have the authority to share or re-distribute NIH controlled-access data. Each individual is responsible for complying with the appropriate terms of data access and use (e.g., DUC terms).

5. How is data managed on CAVATICA?

Researchers can push Kids First data from the Portal to CAVATICA for further exploration and analysis. Importantly, this data remains managed and stored on Bionimbus. Fully functional aliases are created in CAVATICA but the data is not copied. Multi-cloud functionality brings analysis to the data such that it can stream from Bionimbus or other Data Management Centers into analysis without delay or egress.  CAVATICA also includes private, secure storage of user data and multiple tools to copy private data to the system if needed. These tools range from command line interfaces, to FTP/HTTP integration, to web or desktop GUIs. Private data on AWS or Google cloud storage can also be connected to CAVATICA, again creating aliases without duplicating data. These external buckets can also serve as an integration point to further increase interoperability with other systems. Relevant phenotype, clinical, and demographic information can also be imported to CAVATICA independently in a format that is more amenable for further analysis (e.g. .csv, dataframe). This flexibility is crucial to enable researchers to analyze Kids First data, their own data, and harmonize across data sets where appropriate. 

6. What analysis is possible on CAVATICA?

The CAVATICA platform features scalable and optimizable compute that serve both development and production analysis needs. Development includes prototyping tools or algorithms, statistical analysis, and exploring data using Interactive Analysis including RStudio and Jupyterlab. Production includes optimized scaling and execution of containerized tools and workflows (pipelines) for one or thousands of samples. The Kids First Data Resource Center harmonized over 10,000 WGS and 1,000 RNA-Seq samples across 12 study cohorts, at a median cost of $15.61 for a single WGS workflow based on GATK Best Practices. This optimization was a 3x speed improvement and reduced cost by 50%. The total size of the dataset processed was 1.2 PB.

Interactive Analysis is fully integrated within each project in the Data Cruncher feature. Data Cruncher was launched in 2017 and has been continuously developed to meet user needs, including cost capping and prediction, collaboration within projects, auditing, and pre-configured packages. Jupyterlab and RStudio are supported. Data Cruncher enables data scientists and analysts to bring their favorite analysis and visualization tools from Python, R or Julia, to further explore imaging, single cell, tabular data (e.g. VCF, RNA Seq count files), and any other file types. The system stores Jupyter notebooks and RStudio while providing task tracking and version control so it is easy to reproduce results. Users have the ability to choose from popular data analysis libraries or can install any libraries via the console. Users can also launch an environment optimized for machine learning and execution on GPU instances. Finally, users can create notebooks, upload existing notebooks, or integrate with a version control system. 

Tools and workflows used in production analysis are all described in Common Workflow Language (CWL), an open-source, community-driven specification and standard for describing how to run computational analysis with command line tools in short, human-, and machine-readable text files. Seven Bridges has contributed to developing this standard alongside academic and industry leaders since 2014. CWL has multiple advantages that are particularly impactful for analyzing diverse data types or across omics. CWL provides reproducibility in analysis by tracking the code version, inputs, and outputs. This verbose specificity makes CWL inherently modular, making it straightforward to adapt workflows to incorporate optimized tools. Derived outputs inherit information from the workflow, and this additional information is critical for co-analysis and data provenance.

Researchers can configure and run production analysis via the GUI or API. The interface guides users through a simplified point-and-click setup, which lowers the barrier for researchers with less computational experience; the API provides functionality for more experienced users. Analysis is tracked as a Task. Users see Task progression as the instance is queued up, initialized, and running, and receive email notifications upon Task completion. If the analysis fails, the user can easily modify inputs and parameters then re-run. Users can also monitor the job activity in real time while the instance is active, seeing information like instance metrics and standard output and error streams. This aids the researcher in optimizing their code and troubleshooting. 

For analyzing many samples at once, CAVATICA offers Batch processing. Users can batch by File and have a separate Task created for each file input or batch by File metadata and have a separate Task created for specified values of metadata. Analyses can also be configured to parallelize on an instance or ‘Scatter’, such that the same tool runs multiple times on one instance to take full advantage of the computational capacity. CWL tools/workflows can also be run using the API, allowing users to simultaneously launch 1000s of analyses. All Task configurations and results files are stored indefinitely on CAVATICA. Users can also see the cost of each analysis, which can be used to anticipate future costs for similar types of analyses.

7. How do I bring tools/pipelines/workflows onto CAVATICA?

The Seven Bridges bioinformatics team has optimized and deeply documented over 500 commonly used bioinformatics tools available as Public Apps. These tools and workflows are deeply optimized, frequently leading to over 50% cost reduction. Representative tools include utility tools like BCFtools and popular alignment and variant calling tools like the Broad’s GATK. There are multiple end-to-end workflows suited for a variety of multi-omics applications:

Proteomics: tools such as TransProteomicsPipeline, X!Tandem, Comet, MSGF+.
Transcriptomics: multiple gold standard expression and pathway analysis tools such as Salmon, Kallisto, DeSeq2, Seurat, and GSEA.
Population genomics: Plink, GENESIS package, and EPACTS, and popular PheWAS tools.
Single cell and spatial transcriptomics
Epigenomics: methylation analyses by utilizing tools such as Bismark and methyl kit, among others.
Microbiome: HUMaN2, QIIME2 and Centrifuge.

The Kids First DRC bioinformatics teams can help researchers create new optimized CWL tools and workflows and/or share their own tools.

Researchers also have complete flexibility to create their own CWL tools or modify existing tools for customized analysis. The Software Development Kit consists of a Tool Editor for describing command-line tools in CWL, and a Workflow Editor for enabling rapid assembly of tools into a workflow. The Tool and Workflow Editors are part of Rabix Web Composer - an cloud-optimized CWL integrated development environment. The Rabix Web Composer shines in its ease-of-use and offers a graphical visual editor which reduces the learning curve for less computationally savvy users. There is also a text-based code editor for users who prefer writing in raw JSON. The Rabix Web Composer features workflow sharing and version tracking, for enhanced reproducibility and portability for sharing with collaborators. The Rabix Web Composer has been widely adopted, e.g. since 2015, users on the Cancer Genomics Cloud have created 66,871 tools.

8. Is CAVATICA secure?

Yes. CAVATICA is powered by the Seven Bridges Core Infrastructure, which meets and exceeds the NIH Security Best Practices for Controlled-Access Data Subject to the NIH Genomic Data Sharing Policy on both Amazon Web Services (AWS) and/or the Google Compute Platform (GCP). Please see the Seven Bridges Compliance White Paper and Keeping Genomic Data Safe on the Cloud for a full description of CAVATICA's security and compliance features.

9. Do I need a personal AWS or GCP account to use CAVATICA?

No. CAVATICA includes a comprehensive billing system that covers all aspects of data storage and use regardless of cloud provider, region, or type of analysis. The billing system tracks the costs of individual analyses and cloud storage through billing groups, which can be set up for individual users or groups of collaborators. Seven Bridges can invoice the owner of the billing group using either a purchase order number or credit card. Members of the billing group can view their cumulative compute and storage costs as well as the costs of individual analyses. To help with cost control, billing groups have a customizable maximum spending cap, and the CAVATICA will prevent further analyses from being run while the balance exceeds the cap. 

The billing system can also be integrated with cloud credit programs. For example, the NIH Kids First program is supporting a cloud credit distribution pilot in collaboration with STRIDES.

10. Can I bring a personal AWS or GCP account to pay for compute or usage on CAVATICA?

No. The CAVATICA architecture does not permit billing integration of user or institutional cloud service provider accounts.

11. How much does it cost to analyze Kids First data in CAVATICA?

Kids First data are stored in an AWS environment and the costs of this storage are covered by the NIH. However, analyzing or computing on Kids First data in CAVATICA, requires establishing a billing account and contributing funds to the billing account. Additionally, storing output files or uploaded data in CAVATICA workspaces also comes with costs.

Estimating, optimizing, and bounding compute costs resonate with the Kids First and larger research community is important to create a more sustainable system and ultimately lead to more scientific outcomes. All tools, workflows, & pipelines in the Public Apps already provide basic benchmarking information to estimate costs of common configurations. Seven Bridges is also working on more robust modeling, but the solution space is extremely large (high dimensionality, diverse data types/sizes) and constantly changing. As more useful predictive models are available, Seven Bridges will share them.

Seven Bridges sets Spot/preemptible instances by default on AWS and Google, which can lead to up to 80% in cost savings. To ensure usability is not degraded on these cheaper preemptible instances, the platform provides robust restart and memoization functionalities.

For Interactive Analysis (Jupyter, Julia, RStudio), the user can select instance size and the cost per hour is clearly labeled. Cost containment is also accomplished via automatic inactivity shutdown on Data Cruncher and Billing Groups. 

Cloud credits are available for Kids First X01 investigators to begin their analysis and may become available to other researchers in the future. For questions about cloud credits, email: KidsFirst@od.nih.gov.

To learn more, visit: What costs are there for using Kids First data and CAVATICA?

12. What does interoperability mean to Kids First?

Kids First seeks to enable the pediatric research community to find and access childhood cancer and structural birth defects datasets not only from Kids First, but also from other pediatric datasets and platforms.

As a key member of the NIH Cloud Platforms Interoperability (NCPI) effort, the Kids First Data Resource Center is collaborating with NCI Cancer Research Data Commons, NHGRI AnVIL, NHLBI BioData Catalyst and NCBI to enable and promote end-user analyses across these platforms through federation and interoperability. There are multiple challenges being addressed in NCPI, including operational barriers, the implementation of technical standards, and the creation and dissemination of training resources to transition researchers to using the cloud. To learn more visit: https://datascience.nih.gov/nih-cloud-platform-interoperability.

13. Where can I learn more about the Data Resource?

Visit the Kids First DRC website: https://kidsfirstdrc.org/

Contact their help desk: support@kidsfirstdrc.org

View their documentation: https://github.com/kids-first/

View the CAVATICA documentation: https://docs.CAVATICA.org/docs

Gabriella Miller Kids First Pediatric Research (Kids First)

Gabriella Miller Kids First Pediatric Research (Kids First)

Frequently Asked Questions