Dataset Data Model
A Dataset refers to a structured collection of data that is organized for analysis, sharing, and research. Datasets can include various types of information such as demographic statistics, experimental results, or survey responses. In MC2 Center-supported projects, dataset entries help ensure data is properly categorized, easily retrievable, and compliant with sharing and storage requirements.
This model outlines key attributes that describe and manage datasets, including metadata about the data type, format, number of samples, and related grant information. By maintaining these attributes, datasets can be efficiently tracked and referenced within data repositories.
Why You Should Contribute Dataset Entries¶
Contributing dataset entries ensures that your data is easily discoverable, accurately documented, and compliant with research standards. Well-structured metadata enhances collaboration opportunities, increases data citation potential, and reduces administrative overhead for reporting and compliance. Additionally, dataset entries are required for sharing through the Cancer Complexity Knowledge Portal (CCKP).
You can submit dataset entries for data stored both within and outside of Synapse, including repositories like the Gene Expression Omnibus (GEO), Database of Genotypes and Phenotypes (dbGaP), and Zenodo, allowing them to be listed on the CCKP. Well-documented datasets help other researchers and stakeholders effectively use your data, maximizing its long-term impact.
Who Should Be Contributing Dataset Entries?¶
- Principal Investigators (PIs) – Increase the visibility and impact of your research by contributing properly cataloged datasets, making it easier for others to cite and use your work.
- Data Managers – Improve data organization and retrieval, reducing time spent on requests for data clarification and documentation during audits.
- Research Staff – Simplify project reporting by ensuring that datasets are complete with accurate descriptions, grant associations, and metadata.
- Collaborators and Partners – Enhance data interoperability across multiple institutions by contributing standardized entries that support shared research initiatives.
Download Template¶
You can download the dataset entry template, which includes all required fields, to streamline the data entry process.
Example Data Entry¶
The table below includes sample values to demonstrate proper attribute usage.
Attribute | Example Value |
---|---|
Dataset Name | RNA Sequencing of Lung Cancer Samples 2021 |
Dataset Alias | GSE56789 |
Dataset Description | This dataset contains RNA sequencing data from 200 lung cancer samples, including gene expression profiles and patient clinical data. It is designed to study differential gene expression and mutation burden across tumor stages. |
Dataset Url | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE56789 |
Dataset Assay | RNA Sequencing |
Dataset Species | Homo sapiens |
Dataset Tumor Type | Glioblastoma |
Dataset Tissue | Lung |
Dataset File Formats | CSV, PDF |
Dataset Grant Number | CA209971 |
Dataset Pubmed Id | Not applicable |
Dataset View | Table |
DatasetView_id | DatasetView_12345 |
Full Field Reference¶
Below is the full field reference table with attributes and their descriptions.
Attribute | Description | Required | Validation Rules | Examples |
---|---|---|---|---|
Dataset Name | Name of the dataset | True | None | RNA Sequencing of Lung Cancer Samples 2021 |
Dataset Alias | Alias of the dataset. Must be unique. Can be the GEO identifier such as GSE12345, or a DOI. No Greek Letters. | True | unique | GSE56789 |
Dataset Description | Description of the dataset. | False | None | This dataset contains RNA sequencing data from 200 lung cancer samples, including gene expression profiles and patient clinical data. It is designed to study differential gene expression and mutation burden across tumor stages. |
Dataset Design | The overall design of the dataset. | False | None | 'Cross-sectional' to compare gene expression in healthy vs. tumor tissues, or 'Time-series' to observe changes during treatment. |
Dataset Url | The url of where the dataset is stored. | True | url | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE56789 |
Dataset Doi | The Digital Object Identifier (DOI) associated with the dataset. | True | url | nan |
Dataset Assay | The assay the dataset is representative of. Multiple values permitted, comma separated. | True | list like | RNA Sequencing |
Dataset Species | The species the data was collected on. Multiple values permitted, comma separated. | True | list like | Mouse |
Dataset Tumor Type | The tumor type(s), if applicable, of the data collected on. Multiple values permitted, comma separated. | False | list like | Glioblastoma |
Dataset Tissue | Tissue type(s) associated with the dataset. Multiple values permitted, comma separated. | False | list like | Lung |
Dataset File Formats | A list of file formats associated with the dataset. Multiple values permitted, comma separated. | False | list like | "AVI, BAI, BAM, BED, CDS, CHP, COOL, CSV, DAE, DB, DS_Store, FASTA, FASTQ, FCS, FIG, FREQ, GCG, GCT, GCTx, GFF3, GTF, GZIP Format, HDF, HDF5, HTML, IDAT, JPG, JSON, LIF, MAP, MAT, MATLAB script, MSF, MTX, PDF, PNG, PZFX, Python Script, R File Format, RAW, RDS, ROUT, RPROJ, RTF, SGI, SRA, STAT, TAR Format, TDF, TIFF, TSV, TXT, VCF, WIG, XML, ZIP, bed12, bedgraph, cel, cloupe, docx, mzIdentML, mzXML, pptx, rcc, xls, xlsx, MGF, BIGWIG, H5AD, H5, SF, PKL, BPM, Unspecified, Pending Annotation, maf, CLS, SCN, SVS." |
Dataset View | The denormalized manifest for dataset submission. | False | None | Table, Spreadsheet |
Dataset Grant Number | Grant number(s) associated with the dataset's development. Multiple values permitted, comma separated. | True | list like | CA209971 |
Dataset Pubmed Id | The PubMed identifer(s) associated with the development of the dataset. Multiple values permitted, comma separated. | False | list like | 31245678 |
DatasetView_id | A unique primary key that enables record updates using schematic. | True | unique | nan |