# Data Dictionary / Codebook — `mh_gap_youth_state_v1.csv`
**Dataset**: `analysis_outputs.mh_gap_youth_v1_state_v1` → MinIO `s3://data-stories/youth_mental_health_gap/dataset_v1/mh_gap_youth_state_v1.csv`
**Rows**: 35 states (after population threshold)
**Fields**: 12
**License**: CC-BY-4.0 (planned)
**Citation**: Trellison Institute. *Youth Mental Health Access Gap V1 State-Level Dataset* (May 2026). DOI: [pending Zenodo registration]
## Field reference
| # | Field | Type | Unit | Source | Description | Null semantics |
|---|---|---|---|---|---|---|
| 1 | `state_abbr` | string(2) | — | postal abbreviation | 2-letter state code. Primary key. | never null |
| 2 | `state_name` | string | — | Census/ACS | Full state name. | never null |
| 3 | `geography_id` | string(2) | — | framework | Same as `state_abbr` for this analysis. | never null |
| 4 | `total_pop` | int | under-18 persons | ACS B09001_001E 2023 | Under-18 population, total. The `p_s` weight in `M_national`. | never null |
| 5 | `need_value` | float | percent (0-100) | CDC YRBSS 2023 | High-school students reporting persistent sadness/hopelessness for 2+ weeks in past 12 months. Total demographic. | null if YRBSS state non-participating |
| 6 | `covariate_value` | float | percent (0-100) | ACS S2701_C05_002E 2023 | Percent uninsured under-19 in state. | never null in published rows |
| 7 | `access_value` | float | providers per 100K under-18 | derived (NPPES + ACS) | State count of youth-serving NPPES providers ÷ under-18 population × 100,000. | never null |
| 8 | `gap_ratio` | float | (need×1000)/access | derived | `(need_value × 1000) / access_value`. Higher = more need per unit of state supply. | null if access_value zero |
| 9 | `log_gap_ratio` | float | natural log | derived | `ln(gap_ratio)`. Regression target. | null if gap_ratio null/non-positive |
| 10 | `residual_raw` | float | (unitless) | national OLS | `log_gap_ratio - (α + β × covariate_value)`. Single national regression residual. | null if regression undefined |
| 11 | `residual_z` | float | σ | derived | `residual_raw / σ_national_residual`. | null if regression undefined |
| 12 | `residual_class` | enum | — | derived | `positive_outlier` (z>1.5), `negative_outlier` (z<-1.5), `expected`, `insufficient_data`. | always assigned |
| (meta) | `study_id` | string | — | framework | "mh_gap_youth_v1" — study identifier. | never null |
| (meta) | `run_ts` | datetime | UTC | framework | Pipeline execution timestamp. | never null |
## Type conventions
- `string`: UTF-8 text.
- `int`: 32-bit unsigned.
- `float`: IEEE 754 double precision, CSV-formatted to ~6 decimal places.
- `enum`: fixed vocabulary listed inline.
- `datetime`: ISO-8601 with `+00:00` UTC.
## Population coverage
- 35 states (after population threshold ≥ 50,000 under-18 — only smallest territories excluded).
- 41,812,441 under-18 covered — approximately 57% of the U.S. under-18 population.
- 17 U.S. states are absent because they did not participate in YRBSS 2023 or did not have releasable state-level data.
## Known issues
- **YRBSS state coverage gaps**: 17 U.S. states missing from 2023 data; this is a known limitation of YRBSS state-level participation, not a quality issue with the present analysis.
- **State-level NPPES supply** does not reflect provider-level capacity (hours, wait times, network status, accepting under-18 patients).
- **Single covariate** (uninsured rate) — multivariate residual analysis is the v1.2 plan.
- **Negative-outlier class is small** (2 states: VT, AK) — the framework handles small classes gracefully but the statistical power to characterize this class is limited.
## Reproducibility
This dataset is the output of `atlas.need_vs_access_framework_v1` v1.1.0 (DaedArch tool registry). See `mh_gap_youth_v1_methodology_supplement.md` §7 for the end-to-end reproduction recipe.