|
| 1 | +--- |
| 2 | +inclusion: always |
| 3 | +--- |
| 4 | + |
| 5 | +# SageMaker Unified Studio Space Context |
| 6 | + |
| 7 | +This workspace is running on an Amazon SageMaker Unified Studio Space. |
| 8 | + |
| 9 | +## Environment |
| 10 | +- Operating system: Ubuntu-based SageMaker Distribution |
| 11 | +- User: sagemaker-user |
| 12 | +- Home directory: /home/sagemaker-user |
| 13 | +- AWS credentials are available via the container credentials provider (AWS_CONTAINER_CREDENTIALS_RELATIVE_URI) |
| 14 | +- Do NOT hardcode AWS credentials; use the default credential chain (e.g., boto3.Session()) |
| 15 | + |
| 16 | +## Project Info |
| 17 | +- ~/README.md contains project-specific configuration such as connection names and available compute resources. |
| 18 | +- ~/shared/README.md contains shared project data catalog and storage information. |
| 19 | +Refer to these files when you need details about the project's connections, databases, or S3 paths. |
| 20 | + |
| 21 | +## Project Library (`sagemaker_studio`) |
| 22 | +The `sagemaker_studio` package is pre-installed and provides access to project resources. |
| 23 | + |
| 24 | +### Project |
| 25 | +```python |
| 26 | +from sagemaker_studio import Project |
| 27 | +project = Project() |
| 28 | + |
| 29 | +project.id |
| 30 | +project.name |
| 31 | +project.iam_role # project IAM role ARN |
| 32 | +project.kms_key_arn # project KMS key ARN (if configured) |
| 33 | +project.mlflow_tracking_server_arn # MLflow ARN (if configured) |
| 34 | +project.s3.root # project S3 root path |
| 35 | +``` |
| 36 | + |
| 37 | +### Connections |
| 38 | +```python |
| 39 | +project.connections # list all connections |
| 40 | +project.connection() # default IAM connection |
| 41 | +project.connection("redshift") # named connection |
| 42 | +conn.name, conn.id, conn.iam_role |
| 43 | +conn.physical_endpoints[0].host # endpoint host |
| 44 | +conn.data # all connection properties |
| 45 | +conn.secret # credentials (dict or string) |
| 46 | +conn.create_client() # boto3 client with connection credentials |
| 47 | +conn.create_client("glue") # boto3 client for specific service |
| 48 | +``` |
| 49 | + |
| 50 | +### Catalogs, Databases, and Tables |
| 51 | +```python |
| 52 | +catalog = project.connection().catalog() # default catalog |
| 53 | +catalog = project.connection().catalog("catalog_id") |
| 54 | +catalog.databases # list databases |
| 55 | +db = catalog.database("my_db") |
| 56 | +db.tables # list tables |
| 57 | +table = db.table("my_table") |
| 58 | +table.columns # list columns (name, type) |
| 59 | +``` |
| 60 | + |
| 61 | +### SQL Utilities |
| 62 | +```python |
| 63 | +from sagemaker_studio import sqlutils |
| 64 | + |
| 65 | +# DuckDB (local, no connection needed) |
| 66 | +result = sqlutils.sql("SELECT * FROM my_df WHERE id > 1") |
| 67 | + |
| 68 | +# Athena |
| 69 | +result = sqlutils.sql("SELECT * FROM orders", connection_name="project.athena") |
| 70 | + |
| 71 | +# Redshift |
| 72 | +result = sqlutils.sql("SELECT * FROM products", connection_name="project.redshift") |
| 73 | + |
| 74 | +# Parameterized queries |
| 75 | +result = sqlutils.sql( |
| 76 | + "SELECT * FROM orders WHERE status = :status", |
| 77 | + parameters={"status": "completed"}, |
| 78 | + connection_name="project.redshift" |
| 79 | +) |
| 80 | + |
| 81 | +# Get SQLAlchemy engine |
| 82 | +engine = sqlutils.get_engine(connection_name="project.redshift") |
| 83 | +``` |
| 84 | + |
| 85 | +### DataFrame Utilities |
| 86 | +```python |
| 87 | +from sagemaker_studio import dataframeutils |
| 88 | +import pandas as pd |
| 89 | + |
| 90 | +# Read from catalog table |
| 91 | +df = pd.read_catalog_table(database="my_db", table="my_table") |
| 92 | + |
| 93 | +# Write to catalog table |
| 94 | +df.to_catalog_table(database="my_db", table="my_table") |
| 95 | + |
| 96 | +# S3 Tables catalog |
| 97 | +df = pd.read_catalog_table( |
| 98 | + database="my_db", table="my_table", |
| 99 | + catalog="s3tablescatalog/my_catalog" |
| 100 | +) |
| 101 | +``` |
| 102 | + |
| 103 | +### Spark Utilities |
| 104 | +```python |
| 105 | +from sagemaker_studio import sparkutils |
| 106 | + |
| 107 | +# Initialize Spark Connect session |
| 108 | +spark = sparkutils.init() |
| 109 | +spark = sparkutils.init(connection_name="my_spark_connection") |
| 110 | + |
| 111 | +# Get Spark options for JDBC connections |
| 112 | +options = sparkutils.get_spark_options("my_redshift_connection") |
| 113 | +df = spark.read.format("jdbc").options(**options).option("dbtable", "my_table").load() |
| 114 | +``` |
| 115 | + |
| 116 | +## Compute Options |
| 117 | +- **Local Python**: Runs directly on the Space instance. Use for single-machine Python, ML, and AI workloads. |
| 118 | +- **Apache Spark (AWS Glue / Amazon EMR)**: Use `%%pyspark`, `%%scalaspark`, or `%%sql` cell magics in notebooks. Default Spark connection is `project.spark.compatibility`. |
| 119 | +- **SQL (Athena)**: Use `%%sql project.athena` for Trino SQL queries via Amazon Athena. |
| 120 | +- **SQL (Redshift)**: Use `%%sql project.redshift` if a Redshift connection is available. |
| 121 | + |
| 122 | +## Code Patterns |
| 123 | +- Use `sagemaker_studio.Project()` for project-aware sessions and resource discovery |
| 124 | +- Reference data using S3 URIs in s3://bucket/prefix format |
| 125 | +- Write Spark DataFrames to the project catalog: `df.write.saveAsTable(f"{database}.table_name", format='parquet', mode='overwrite')` |
| 126 | +- SQL query results are available as DataFrames in subsequent cells via the `_` variable |
| 127 | +- Use `sqlutils.sql()` for programmatic SQL execution against any connection |
| 128 | +- Use `pd.read_catalog_table()` / `df.to_catalog_table()` for pandas catalog I/O |
| 129 | + |
| 130 | +## MCP Server Configuration |
| 131 | +- When configuring MCP servers, pass AWS credentials via environment variable expansion: |
| 132 | + "AWS_CONTAINER_CREDENTIALS_RELATIVE_URI": "${AWS_CONTAINER_CREDENTIALS_RELATIVE_URI}" |
0 commit comments