Dataset Data Model

A Dataset refers to a structured collection of data that is organized for analysis, sharing, and research. Datasets can include various types of information such as demographic statistics, experimental results, or survey responses. In MC² Center-supported projects, dataset entries help ensure data is properly categorized, easily retrievable, and compliant with sharing and storage requirements.

This model outlines key attributes that describe and manage datasets, including metadata about the data type, format, number of samples, and related grant information. By maintaining these attributes, datasets can be efficiently tracked and referenced within data repositories.

Why You Should Contribute Dataset Entries¶

Contributing dataset entries ensures that your data is easily discoverable, accurately documented, and compliant with research standards. Well-structured metadata enhances collaboration opportunities, increases data citation potential, and reduces administrative overhead for reporting and compliance. Additionally, dataset entries are required for sharing through the Cancer Complexity Knowledge Portal (CCKP).

You can submit dataset entries for data stored both within and outside of Synapse, including repositories like the Gene Expression Omnibus (GEO), Database of Genotypes and Phenotypes (dbGaP), and Zenodo, allowing them to be listed on the CCKP. Well-documented datasets help other researchers and stakeholders effectively use your data, maximizing its long-term impact.

Who Should Be Contributing Dataset Entries?¶

Principal Investigators (PIs) – Increase the visibility and impact of your research by contributing properly cataloged datasets, making it easier for others to cite and use your work.
Data Managers – Improve data organization and retrieval, reducing time spent on requests for data clarification and documentation during audits.
Research Staff – Simplify project reporting by ensuring that datasets are complete with accurate descriptions, grant associations, and metadata.
Collaborators and Partners – Enhance data interoperability across multiple institutions by contributing standardized entries that support shared research initiatives.

Download Template¶

You can download the dataset entry template, which includes all required fields, to streamline the data entry process.

Example Data Entry¶

The table below includes sample values to demonstrate proper attribute usage.

Attribute	Example Value
Dataset Name	RNA Sequencing of Lung Cancer Samples 2021
Dataset Alias	GSE56789
Dataset Description	This dataset contains RNA sequencing data from 200 lung cancer samples, including gene expression profiles and patient clinical data. It is designed to study differential gene expression and mutation burden across tumor stages.
Dataset Url	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE56789
Dataset Assay	RNA Sequencing
Dataset Species	Homo sapiens
Dataset Tumor Type	Glioblastoma
Dataset Tissue	Lung
Dataset File Formats	CSV, PDF
Dataset Grant Number	CA209971
Dataset Pubmed Id	Not applicable
Dataset View	Table
DatasetView_id	DatasetView_12345

Full Field Reference¶

Below is the full field reference table with attributes and their descriptions.

Attribute	Description	Required	Validation Rules	Examples
Dataset Name	Name of the dataset	True	None	RNA Sequencing of Lung Cancer Samples 2021
Dataset Alias	Alias of the dataset. Must be unique. Can be the GEO identifier such as GSE12345, or a DOI. No Greek Letters.	True	unique	GSE56789
Dataset Description	Description of the dataset.	False	None	This dataset contains RNA sequencing data from 200 lung cancer samples, including gene expression profiles and patient clinical data. It is designed to study differential gene expression and mutation burden across tumor stages.
Dataset Design	The overall design of the dataset.	False	None	'Cross-sectional' to compare gene expression in healthy vs. tumor tissues, or 'Time-series' to observe changes during treatment.
Dataset Url	The url of where the dataset is stored.	True	url	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE56789
Dataset Doi	The Digital Object Identifier (DOI) associated with the dataset.	True	url	nan
Dataset Assay	The assay the dataset is representative of. Multiple values permitted, comma separated.	True	list like	RNA Sequencing
Dataset Species	The species the data was collected on. Multiple values permitted, comma separated.	True	list like	Mouse
Dataset Tumor Type	The tumor type(s), if applicable, of the data collected on. Multiple values permitted, comma separated.	False	list like	Glioblastoma
Dataset Tissue	Tissue type(s) associated with the dataset. Multiple values permitted, comma separated.	False	list like	Lung
Dataset File Formats	A list of file formats associated with the dataset. Multiple values permitted, comma separated.	False	list like	"AVI, BAI, BAM, BED, CDS, CHP, COOL, CSV, DAE, DB, DS_Store, FASTA, FASTQ, FCS, FIG, FREQ, GCG, GCT, GCTx, GFF3, GTF, GZIP Format, HDF, HDF5, HTML, IDAT, JPG, JSON, LIF, MAP, MAT, MATLAB script, MSF, MTX, PDF, PNG, PZFX, Python Script, R File Format, RAW, RDS, ROUT, RPROJ, RTF, SGI, SRA, STAT, TAR Format, TDF, TIFF, TSV, TXT, VCF, WIG, XML, ZIP, bed12, bedgraph, cel, cloupe, docx, mzIdentML, mzXML, pptx, rcc, xls, xlsx, MGF, BIGWIG, H5AD, H5, SF, PKL, BPM, Unspecified, Pending Annotation, maf, CLS, SCN, SVS."
Dataset View	The denormalized manifest for dataset submission.	False	None	Table, Spreadsheet
Dataset Grant Number	Grant number(s) associated with the dataset's development. Multiple values permitted, comma separated.	True	list like	CA209971
Dataset Pubmed Id	The PubMed identifer(s) associated with the development of the dataset. Multiple values permitted, comma separated.	False	list like	31245678
DatasetView_id	A unique primary key that enables record updates using schematic.	True	unique	nan