Diabetes Research Datasets¶
This page is the IINTS-AF dataset map for local AI diabetes research. It lists the external datasets that are useful for simulation realism, glucose forecasting, and bench-only controller research.
Research-only boundary
These datasets can support pre-clinical simulation, model development, and documentation. They do not make IINTS-AF a medical device and must not be used for real insulin dosing.
One-Command Plan¶
Generate the local acquisition plan:
iints data research-plan --output-dir data_packs/research_dataset_plan
This writes:
| Artifact | Purpose |
|---|---|
DATASET_ACQUISITION_PLAN.md |
Human-readable acquisition checklist |
research_dataset_matrix.csv |
Spreadsheet-style source matrix |
dataset_registry_snapshot.json |
Exact registry metadata used for the plan |
SOURCE_CITATIONS.bib |
BibTeX citations for reports and papers |
research_dataset_plan_manifest.json |
Machine-readable manifest |
Recommended Order¶
| Priority | Dataset | Best role in IINTS |
|---|---|---|
| 1 | hupa_ucm |
Multimodal predictor training with glucose, insulin, carbs, steps, heart rate, calories, and sleep |
| 2 | azt1d |
AID-oriented predictor training with detailed bolus and device-mode context |
| 3 | t1d_uom |
Newer multimodal longitudinal validation set with nutrition, activity, and sleep |
| 4 | ohio_t1dm |
Classic external benchmark for glucose prediction |
| 5 | dclp3_idcl |
Closed-loop clinical-trial benchmark and external validation |
| 6 | jaeb_loop |
Real-world AID/Loop validation source |
| 7 | t1dexi / t1dexip |
Exercise-aware glucose and hypoglycemia-risk research |
| 8 | d1namo |
Large multimodal/wearable research archive |
| 9 | openaps_data_commons |
Community AID data, useful after access approval and careful provenance review |
| 10 | metabonet / glucose_ml |
Dataset-selection and cross-dataset benchmark references |
Access Classes¶
| Access | Meaning | SDK behavior |
|---|---|---|
bundled |
Included with the SDK | iints data fetch sample works offline |
public-download |
Public URL is known | iints data fetch <id> requires pinned hashes or --no-verify |
manual |
User must download through source page or approved form | SDK writes instructions and expects local files |
request |
Requires approval or data-use agreement | SDK records source metadata but does not bypass access rules |
mixed / collection |
Meta-dataset or mixed access | Use as a benchmark map, not one homogeneous raw dataset |
Current SDK Support¶
The SDK has dedicated preparation commands for the first three practical public pipelines:
iints research prepare-hupa
iints research prepare-azt1d
iints research prepare-ohio
For sources without a dedicated converter yet, use the generic importer after extracting the source data:
iints import-data \
--input-csv data_packs/public/<dataset_id>/raw/<file>.csv \
--output-dir data_packs/public/<dataset_id>/standard \
--data-format generic
Then blend only prepared, leakage-safe datasets:
iints research blend-datasets \
--source hupa=data_packs/public/hupa_ucm/processed/hupa_ucm_merged.csv \
--source azt1d=data_packs/public/azt1d/processed/azt1d_merged.csv \
--source ohio=data_packs/public/ohio_t1dm/processed/ohio_t1dm_merged.csv \
--output data_packs/processed/iints_research_blend.csv \
--manifest data_packs/processed/iints_research_blend_manifest.json
AI Training Rules¶
Use this split:
| Model type | Data source | Why |
|---|---|---|
| Glucose predictor | Real-world datasets | Learns glucose dynamics from measured data |
| Controller policy | Safety-supervised simulator/Jetson runs | Learns auditable research actions under known safety constraints |
| Local LLM assistant | Reports, model cards, MDMP payloads | Explains and reviews evidence; does not dose insulin |
Never train an autonomous insulin controller directly from mixed public data without a safety contract, subject-level split, MDMP review, and simulator-only validation.
Provenance Checklist¶
Before using any dataset in a model card or EUCYS report:
- Record source URL, DOI, access date, version, and license/access terms.
- Keep raw data read-only under
data_packs/public/<dataset_id>/raw. - Save converted data under
data_packs/public/<dataset_id>/processed. - Keep
source_datasetandsubject_idin every row. - Split train/validation/test by subject.
- Run
iints data realism-checkandiints data certify. - Put the
research_dataset_plan_manifest.jsonnext to the model output.
Sources¶
- D1NAMO on Zenodo publishes the open multimodal D1NAMO archive and its license metadata.
- Jaeb Public Diabetes Datasets lists T1DEXI, DCLP3/iDCL, Loop, PEDAP, AIDE T1D, and other CGM-containing studies.
- NIDDK DCLP3/iDCL study page describes the iDCL/DCLP3 trial and points to public dataset access.
- T1D-UOM Scientific Data paper describes the longitudinal multimodal dataset and Zenodo repository.
- AZT1D arXiv paper describes the AID-oriented AZT1D dataset and its Mendeley DOI.
- MetaboNet arXiv paper describes the consolidated T1D dataset and mixed-access public/DUA model.
- Glucose-ML arXiv paper summarizes a benchmark collection of longitudinal diabetes datasets for robust AI work.