DataHub Python Builds

These prebuilt wheel files can be used to install our Python packages as of a specific commit.

Build context

Built at 2025-10-30T04:29:38.403818+00:00.

{
  "timestamp": "2025-10-30T04:29:38.403818+00:00",
  "branch": "feat-make-pyspark-optional",
  "commit": {
    "hash": "80f9806b0d7bb288885082bc6226c46c15c66935",
    "message": "feat(docker): Add slim and locked variants for PySpark-optional deployment\n\nIntroduces slim and locked Docker image variants for both\ndatahub-ingestion and datahub-actions, for environments with different PySpark requirements\nand security constraints.\n\n**Image Variants**:\n\n1. **Full (default)**: With PySpark, network enabled\n   - Includes PySpark for data profiling\n   - Can install packages from PyPI at runtime\n   - Backward compatible with existing deployments\n\n2. **Slim**: Without PySpark, network enabled\n   - Excludes PySpark (~500MB smaller)\n   - Uses s3-slim, gcs-slim, abs-slim for data lake sources\n   - Can still install packages from PyPI if needed\n\n3. **Locked** (NEW): Without PySpark, network BLOCKED\n   - Excludes PySpark\n   - Blocks ALL network access to PyPI/UV indexes\n   - datahub-actions: ONLY bundled venvs, no main ingestion install\n   - Most secure/restrictive variant for production\n\n**Additional Changes**:\n\n**1. pyspark_utils.py**: Fixed module-level exports\n   - Added SparkSession, DataFrame, AnalysisRunBuilder, PandasDataFrame as None\n   - These can now be imported even when PySpark unavailable\n   - Prevents ImportError in s3-slim installations\n\n**2. setup.py**: Moved cachetools to s3_base\n   - operation_config.py uses cachetools unconditionally\n   - Now available in s3-slim without requiring data_lake_profiling\n\n**3. build_bundled_venvs_unified.py**: Added slim_mode support\n   - BUNDLED_VENV_SLIM_MODE flag controls package extras\n   - When true: installs s3-slim, gcs-slim, abs-slim (no PySpark)\n   - When false: installs s3, gcs, abs (with PySpark)\n   - Venv named {plugin}-bundled (e.g., s3-bundled) for executor compatibility\n\n**4. datahub-actions/Dockerfile**: Three variant structure\n   - bundled-venvs-full: s3 with PySpark\n   - bundled-venvs-slim: s3-slim without PySpark\n   - bundled-venvs-locked: s3-slim without PySpark\n   - final-full: Has PySpark, network enabled, full install\n   - final-slim: No PySpark, network enabled, slim install\n   - final-locked: No PySpark, network BLOCKED, NO main install (bundled venvs only)\n\n**5. datahub-ingestion/Dockerfile**: Added locked stage\n   - install-full: All sources with PySpark\n   - install-slim: Selected sources with s3-slim (no PySpark)\n   - install-locked: Minimal sources with s3-slim, network BLOCKED\n\n**6. build.gradle**: Updated variants and defaults\n   - defaultVariant: \"full\" (restored to original)\n   - Variants: full (no suffix), slim (-slim), locked (-locked)\n   - Build args properly set for all variants\n\n**Network Blocking in Locked Variant**:\n```dockerfile\nENV UV_INDEX_URL=http://127.0.0.1:1/simple\nENV PIP_INDEX_URL=http://127.0.0.1:1/simple\n```\nThis prevents all PyPI downloads while allowing cached packages from build.\n\n**Bundled Venv Naming**:\n- Venv named `s3-bundled` (not `s3-slim-bundled`)\n- Recipe uses `type: s3` (standard plugin name)\n- Executor finds `s3-bundled` venv automatically\n- Slim/locked: venv uses s3-slim package internally (no PySpark)\n- Full: venv uses s3 package (with PySpark)\n\n**Testing**:\n\u2705 Full variant: PySpark installed, network enabled\n\u2705 Slim variant: PySpark NOT installed, network enabled, s3-bundled venv works\n\u2705 Integration tests: 12 tests validate s3-slim functionality\n\n**Build Commands**:\n```bash\n./gradlew :datahub-actions:docker\n./gradlew :docker:datahub-ingestion:docker\n\n./gradlew :datahub-actions:docker -PdockerTarget=slim\n./gradlew :docker:datahub-ingestion:docker -PdockerTarget=slim\n\n./gradlew :datahub-actions:docker -PdockerTarget=locked\n./gradlew :docker:datahub-ingestion:docker -PdockerTarget=locked\n\n./gradlew :datahub-actions:docker -PmatrixBuild=true\n./gradlew :docker:datahub-ingestion:docker -PmatrixBuild=true\n```\n\n**Recipe Format** (works with all variants):\n```yaml\nsource:\n  type: s3  # Use of existing \"s3\" source type\n  config:\n    path_specs:\n      - include: \"s3://bucket/*.csv\"\n    profiling:\n      enabled: false  # Required for slim/locked\n```"
  },
  "pr": {
    "number": 15123,
    "title": "feat(ingestion): Make PySpark optional for S3, ABS, and Unity Catalog sources",
    "url": "https://github.com/datahub-project/datahub/pull/15123"
  }
}

Usage

Current base URL: unknown

Package Size Install command
acryl-datahub 2.414 MB uv pip install 'acryl-datahub @ <base-url>/artifacts/wheels/acryl_datahub-0.0.0.dev1-py3-none-any.whl'
acryl-datahub-actions 0.101 MB uv pip install 'acryl-datahub-actions @ <base-url>/artifacts/wheels/acryl_datahub_actions-0.0.0.dev1-py3-none-any.whl'
acryl-datahub-airflow-plugin 0.039 MB uv pip install 'acryl-datahub-airflow-plugin @ <base-url>/artifacts/wheels/acryl_datahub_airflow_plugin-0.0.0.dev1-py3-none-any.whl'
acryl-datahub-dagster-plugin 0.019 MB uv pip install 'acryl-datahub-dagster-plugin @ <base-url>/artifacts/wheels/acryl_datahub_dagster_plugin-0.0.0.dev1-py3-none-any.whl'
acryl-datahub-gx-plugin 0.010 MB uv pip install 'acryl-datahub-gx-plugin @ <base-url>/artifacts/wheels/acryl_datahub_gx_plugin-0.0.0.dev1-py3-none-any.whl'
prefect-datahub 0.011 MB uv pip install 'prefect-datahub @ <base-url>/artifacts/wheels/prefect_datahub-0.0.0.dev1-py3-none-any.whl'