Content Arsenal · part: data_dictionary
# Data Dictionary / Codebook — `mh_gap_tract_v1.csv`
**Dataset**: `analysis_outputs.mh_gap_tract_v1` → MinIO `s3://data-stories/mental_health_gap/dataset_v1/mh_gap_tract_v1.csv`
**Rows**: 78,815 census tracts
**Fields**: 27
**sha256**: `90a7baee7793ad89d3c70017…` (full hash in `manifest.json`)
**License**: CC-BY-4.0 (planned)
**Citation**: Trellison Institute. *Mental Health Access Gap V1 Tract-Level Dataset* (May 2026). DOI: [pending Zenodo registration]
## Field reference
| # | Field | Type | Unit | Source | Description | Null semantics |
|---|---|---|---|---|---|---|
| 1 | `tract_fips` | string(11) | — | Census 2020 | 11-digit GEOID (state(2)+county(3)+tract(6)). Primary key. | never null |
| 2 | `state_abbr` | string(2) | — | Census | 2-letter postal state abbreviation. | never null |
| 3 | `state_name` | string | — | Census | Full state name. | never null |
| 4 | `county_fips` | string(5) | — | Census 2020 | 5-digit county FIPS (state(2)+county(3)). | never null |
| 5 | `county_name` | string | — | Census | County name without "County" suffix. | never null |
| 6 | `total_population` | int | persons | ACS 5-year (PLACES base) | Total tract population. | null if PLACES base missing |
| 7 | `total_pop_18plus` | int | adults ≥18 | ACS 5-year (PLACES base) | Adult (≥18) tract population. Used as `p_t` in pop-weighted aggregation. | null if PLACES base missing |
| 8 | `mhlth_crudeprev` | float | percent (0-100) | CDC PLACES (BRFSS small-area) | Prevalence of frequent mental distress (≥14 days mental health not good in last 30). The `need_metric`. | null if PLACES suppressed for tract |
| 9 | `access2_crudeprev` | float | percent (0-100) | CDC PLACES (BRFSS small-area) | Prevalence of adults 18-64 without health insurance. The `covariate_metric`. | null if suppressed |
| 10 | `need_adults_in_distress` | int | adults | derived | `mhlth_crudeprev × total_pop_18plus / 100`. Implied count of adults in frequent distress. | null if either input null |
| 11 | `state_supply_per_100k` | float | providers per 100K adults | CMS NPPES + ACS | State-level mental-health provider density (5 taxonomies: psychiatry, psychology, clinical social work, MFT, psychiatric NP). | never null |
| 12 | `gap_ratio` | float | (need × 1000) / supply | derived | `(mhlth_crudeprev × 1000) / state_supply_per_100k`. Larger = more demand per unit of supply. | null if state_supply zero |
| 13 | `log_gap_ratio` | float | natural log | derived | `ln(gap_ratio)`. Used as regression target for residual analysis. | null if gap_ratio null or non-positive |
| 14 | `regression_intercept` | float | (unitless) | derived per-state | `α_s` in `log_gap_ratio = α_s + β_s × access2`. State-fixed intercept. | null for states with <5 tracts |
| 15 | `regression_slope_access2` | float | (unitless) | derived per-state | `β_s` slope on covariate. | null for states with <5 tracts |
| 16 | `residual_raw` | float | (unitless) | derived | `ε_t = log_gap_ratio - (α_s + β_s × access2)`. | null if regression null |
| 17 | `residual_zscore` | float | σ | derived | `(ε_t - μ_ε_s) / σ_ε_s` within state's residual distribution. | null if regression null |
| 18 | `residual_class` | enum | — | derived | `positive_outlier` (z>1.5), `negative_outlier` (z<-1.5), `expected`, `insufficient_data`. | `insufficient_data` for states <5 tracts |
| 19 | `nearest_provider_miles` | float | miles | derived (haversine × 1.4) | Haversine distance × 1.4 road-multiplier to nearest provider ZIP centroid. | null if no candidate ZIP within neighbor-state catchment |
| 20 | `nearest_provider_minutes` | float | minutes | derived | `miles × 60 / speed_mph`. Urban (pop>4000) 35mph; rural 55mph. | null if miles null |
| 21 | `nearest_provider_zip` | string(5) | — | CMS NPPES | 5-digit ZIP of nearest provider's practice address. | null if minutes null |
| 22 | `nearest_provider_state` | string(2) | — | CMS NPPES | State of nearest provider (may differ from `state_abbr` — inter-state catchment). | null if minutes null |
| 23 | `drive_time_class` | enum | — | derived | `in_30min`, `in_60min`, `over_60min`, `no_provider_found`. | `no_provider_found` if minutes null |
| 24 | `drive_time_30min_flag` | bool | 0/1 | derived | True iff drive_time_class == "in_30min". | never null |
| 25 | `drive_time_60min_flag` | bool | 0/1 | derived | True iff drive_time_class in {"in_30min","in_60min"}. | never null |
| 26 | `computed_at` | datetime | ISO-8601 UTC | derived | Timestamp the row was computed. | never null |
| 27 | `drive_time_computed_at` | datetime | ISO-8601 UTC | derived | Timestamp the drive-time fields were computed (separate from gap fields). | never null |
## Type conventions
- `string`: UTF-8 text, no normalization beyond CSV escaping.
- `int`: 32-bit unsigned where ≥0.
- `float`: IEEE 754 double precision (CSV-formatted to ~6 decimal places).
- `bool`: serialized as `True`/`False` (Python convention).
- `enum`: fixed vocabulary listed inline.
- `datetime`: ISO-8601 with explicit `+00:00` UTC suffix.
## Population coverage
- 78,815 of ~84,400 U.S. census tracts have both `mhlth_crudeprev` and `access2_crudeprev` present (PLACES coverage).
- Of those, 78,814 have lat/lon centroids (1 tract dropped from spatial joins).
- Splits by region: CONUS 78,213 · Alaska 175 · Hawaii 426.
## Known issues
- ~370 PLACES tracts (<0.5%) are suppressed for one measure but not the other; we retain the row with null in the suppressed field.
- Inter-state catchment can place `nearest_provider_state` ≠ `state_abbr` in border areas (~3% of tracts).
- For states with very few tracts (DC, the 5 inhabited U.S. territories before PLACES applicability), the within-state regression has `residual_class = "insufficient_data"`.
## Reproducibility
This dataset is the output of the `atlas.need_vs_access_framework_v1` v1.0.0 pipeline (DaedArch tool registry, DB-native). See `mh_gap_v1_methodology_supplement.md` §6 for the end-to-end reproduction recipe.