Skip to content

Dataset Data Model

A Dataset refers to a structured collection of data that is organized for analysis, sharing, and research. Datasets can include various types of information such as demographic statistics, experimental results, or survey responses. In MC2 Center-supported projects, dataset entries help ensure data is properly categorized, easily retrievable, and compliant with sharing and storage requirements.

This model outlines key attributes that describe and manage datasets, including metadata about the data type, format, number of samples, and related grant information. By maintaining these attributes, datasets can be efficiently tracked and referenced within data repositories.

Why You Should Contribute Dataset Entries

Contributing dataset entries ensures that your data is easily discoverable, accurately documented, and compliant with research standards. Well-structured metadata enhances collaboration opportunities, increases data citation potential, and reduces administrative overhead for reporting and compliance. Additionally, dataset entries are required for sharing through the Cancer Complexity Knowledge Portal (CCKP).

You can submit dataset entries for data stored both within and outside of Synapse, including repositories like the Gene Expression Omnibus (GEO), Database of Genotypes and Phenotypes (dbGaP), and Zenodo, allowing them to be listed on the CCKP. Well-documented datasets help other researchers and stakeholders effectively use your data, maximizing its long-term impact.

Who Should Be Contributing Dataset Entries?

  1. Principal Investigators (PIs) – Increase the visibility and impact of your research by contributing properly cataloged datasets, making it easier for others to cite and use your work.
  2. Data Managers – Improve data organization and retrieval, reducing time spent on requests for data clarification and documentation during audits.
  3. Research Staff – Simplify project reporting by ensuring that datasets are complete with accurate descriptions, grant associations, and metadata.
  4. Collaborators and Partners – Enhance data interoperability across multiple institutions by contributing standardized entries that support shared research initiatives.

Download Template

You can download the dataset entry template, which includes all required fields, to streamline the data entry process.

Example Data Entry

The table below includes sample values to demonstrate proper attribute usage.

Attribute Example Value
Dataset Name RNA Sequencing of Lung Cancer Samples 2021
Dataset Alias GSE56789
Dataset Description This dataset contains RNA sequencing data from 200 lung cancer samples, including gene expression profiles and patient clinical data. It is designed to study differential gene expression and mutation burden across tumor stages.
Dataset Url https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE56789
Dataset Assay RNA Sequencing
Dataset Species Homo sapiens
Dataset Tumor Type Glioblastoma
Dataset Tissue Lung
Dataset File Formats CSV, PDF
Dataset Grant Number CA209971
Dataset Pubmed Id Not applicable
Dataset View Table
DatasetView_id DatasetView_12345

Full Field Reference

Below is the full field reference table with attributes and their descriptions.

⤓ Download template

Attribute Description Required Validation Rules Examples
Dataset Name Name of the dataset True None RNA Sequencing of Lung Cancer Samples 2021
Dataset Alias Alias of the dataset. Must be unique. Can be the GEO identifier such as GSE12345, or a DOI. No Greek Letters. True unique GSE56789
Dataset Description Description of the dataset. False None This dataset contains RNA sequencing data from 200 lung cancer samples, including gene expression profiles and patient clinical data. It is designed to study differential gene expression and mutation burden across tumor stages.
Dataset Design The overall design of the dataset. False None 'Cross-sectional' to compare gene expression in healthy vs. tumor tissues, or 'Time-series' to observe changes during treatment.
Dataset Url The url of where the dataset is stored. True url https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE56789
Dataset Doi The Digital Object Identifier (DOI) associated with the dataset. True url nan
Dataset Assay The assay the dataset is representative of. Multiple values permitted, comma separated. True list like RNA Sequencing
Dataset Species The species the data was collected on. Multiple values permitted, comma separated. True list like Mouse
Dataset Tumor Type The tumor type(s), if applicable, of the data collected on. Multiple values permitted, comma separated. False list like Glioblastoma
Dataset Tissue Tissue type(s) associated with the dataset. Multiple values permitted, comma separated. False list like Lung
Dataset File Formats A list of file formats associated with the dataset. Multiple values permitted, comma separated. False list like "AVI, BAI, BAM, BED, CDS, CHP, COOL, CSV, DAE, DB, DS_Store, FASTA, FASTQ, FCS, FIG, FREQ, GCG, GCT, GCTx, GFF3, GTF, GZIP Format, HDF, HDF5, HTML, IDAT, JPG, JSON, LIF, MAP, MAT, MATLAB script, MSF, MTX, PDF, PNG, PZFX, Python Script, R File Format, RAW, RDS, ROUT, RPROJ, RTF, SGI, SRA, STAT, TAR Format, TDF, TIFF, TSV, TXT, VCF, WIG, XML, ZIP, bed12, bedgraph, cel, cloupe, docx, mzIdentML, mzXML, pptx, rcc, xls, xlsx, MGF, BIGWIG, H5AD, H5, SF, PKL, BPM, Unspecified, Pending Annotation, maf, CLS, SCN, SVS."
Dataset View The denormalized manifest for dataset submission. False None Table, Spreadsheet
Dataset Grant Number Grant number(s) associated with the dataset's development. Multiple values permitted, comma separated. True list like CA209971
Dataset Pubmed Id The PubMed identifer(s) associated with the development of the dataset. Multiple values permitted, comma separated. False list like 31245678
DatasetView_id A unique primary key that enables record updates using schematic. True unique nan