A website for browsing and downloading KGX (Knowledge Graph Exchange) files from S3. No AWS login needed.
Live site: https://kgx-storage.rtx.ai
KGX Storage is a small web app. It lets anyone browse and download KGX files from the S3 bucket called translator-ingests. The data is built by a different repo (translator-ingests). This repo is only the website.
What the site does:
- Browse folders and files like a file manager. URLs look like
/releases/alliance/latest/. - Download files. The server creates temporary download links that last 1 hour.
- View JSON files in the browser with formatting.
- A docs page with copy-paste commands for curl, wget, and AWS CLI.
How it works: Your browser talks to Nginx over HTTPS. Nginx forwards to the Flask app. The Flask app talks to S3 with boto3. Folder sizes and file counts are precomputed into a file called metrics.json so pages load fast. You can run a cron job to refresh that file when new data is added.
Important paths:
- App lives at
/home/ubuntu/kgx-storage-webserver/ - Systemd service name is
kgx-storage-webserver. Its config file is in the repo and gets copied to/etc/systemd/system/. - Logs:
/var/log/kgx-storage/(access.log and error.log) - Nginx config: copy
nginx-configfrom the repo to/etc/nginx/sites-available/kgx-storage - The metrics cache is
metrics.jsonin the app folder. The scriptcompute_metrics.pycreates it. The scriptupdate_metrics.shcan run in cron to refresh it and tell Gunicorn to reload.
Setup in short: Clone the repo. Make a Python venv and install from requirements.txt. Run sudo ./setup-webserver-service.sh. Set up Nginx using the repo’s nginx-config. Run certbot for kgx-storage.rtx.ai. Run compute_metrics.py once so the site has folder stats. Optionally add a cron job for update_metrics.sh. The EC2 instance needs an IAM role that can read from S3, and the domain must point to the instance’s IP.
This server gives people HTTP access to KGX files from the NCATS Biomedical Data Translator project. Anyone can browse and download the knowledge graph data stored in Amazon S3.
The pipeline that creates the data is in the translator-ingests repo. This repo is just the web interface. That way the data pipeline and the website can be updated separately.
Data pipeline code: https://github.com/NCATSTranslator/translator-ingests/tree/kgx_storage/src/translator_ingest/util/storage
- Browse the bucket. You see folders and files in a list. You click to go deeper. URLs use paths like
/releases/alliance/latest/. Old-style URLs with?path=...still work. They redirect to the new path-style URLs. - Download files. The server creates temporary S3 links so you can download without AWS credentials. Links expire after 1 hour.
- View JSON in the browser. Open a JSON file and it shows formatted with syntax highlighting. You can download it from there too.
- Docs page. Lists commands for downloading with curl, wget, and AWS CLI. Includes common paths and how to extract .tar.zst archives.
- HTTPS. SSL is handled by Let's Encrypt. Certificates renew automatically.
- No login. The site is read-only and public. Anyone can use it.
Traffic flow:
- User hits the site over HTTPS (port 443).
- Nginx receives it, does SSL, and forwards to the Flask app on localhost port 5000.
- The Flask app (run by Gunicorn) handles the request and calls S3 with boto3 when it needs to list or get files.
- S3 holds the actual KGX files in the bucket
translator-ingests.
Nginx: Handles HTTPS and passes requests to Flask. Good at connections and SSL.
Flask and Gunicorn: Flask has the routes and S3 logic. Gunicorn runs multiple workers so the app can handle several requests at once. Fixed routes are /docs/ and /public/. A catch-all route handles everything else: paths like /releases/alliance/latest/ list a folder; paths like /releases/alliance/latest/graph-metadata.json return the file (JSON as body or download for other types). Adding ?view to a JSON file URL shows the HTML viewer. Legacy /view/ and /download/ URLs are no longer routed and return 404.
S3: The bucket translator-ingests stores the files. The app uses presigned URLs for downloads so users never need AWS keys.
IAM: The EC2 instance has an IAM role. The app gets credentials from the instance metadata service. No keys are stored in the code.
You need:
- An EC2 instance (Ubuntu or Debian). Something like t3.medium (2 vCPU, 4 GB RAM) is enough. It needs systemd.
- An IAM role on that instance with permission to read from the bucket:
s3:GetObjectands3:ListBucketontranslator-ingests. - An Elastic IP so the instance has a fixed public IP.
- The domain kgx-storage.rtx.ai pointing at that IP (DNS A record).
- Security group open for: 22 (SSH), 80 (HTTP, for certbot and redirect), 443 (HTTPS).
Software on the server:
- Python 3.12.3 (see .python-version). Same version everywhere keeps things predictable.
- Nginx (reverse proxy and SSL).
- Certbot (gets and renews Let's Encrypt certificates).
- Python packages from requirements.txt (versions are pinned).
cd /home/ubuntu
git clone https://github.com/RTXteam/kgx-storage.git kgx-storage-webserver
cd kgx-storage-webserverThis puts the code in /home/ubuntu/kgx-storage-webserver. The systemd service expects that path.
sudo apt update
sudo apt install -y nginx certbot python3-certbot-nginx python3.12 python3.12-venv python3-pipYou get Nginx, Certbot, Python 3.12, venv, and pip.
python3.12 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtThis keeps the app’s dependencies separate from the system. All installs go into .venv.
cd /home/ubuntu/kgx-storage-webserver
sudo ./setup-webserver-service.sh
sudo systemctl status kgx-storage-webserverThe script copies the systemd unit file and starts the service. Check status to make sure it’s running.
sudo cp nginx-config /etc/nginx/sites-available/kgx-storage
sudo ln -sf /etc/nginx/sites-available/kgx-storage /etc/nginx/sites-enabled/
sudo rm -f /etc/nginx/sites-enabled/default
sudo nginx -t
sudo systemctl restart nginx
sudo systemctl enable nginxNginx now proxies to the Flask app. The default site is removed so port 80 is free. nginx -t checks the config before you restart.
sudo certbot --nginx -d kgx-storage.rtx.ai
sudo certbot renew --dry-runCertbot gets a certificate and configures Nginx. The dry run checks that renewal will work later.
In the AWS console, open these ports:
| Type | Port | Source | Why |
|---|---|---|---|
| SSH | 22 | Your IP | So you can log in |
| HTTP | 80 | 0.0.0.0/0 | Certbot and redirect to HTTPS |
| HTTPS | 443 | 0.0.0.0/0 | The actual site |
Restrict SSH to your IP if you can.
cd /home/ubuntu/kgx-storage-webserver
source .venv/bin/activate
python compute_metrics.pyThis builds metrics.json with folder sizes and file counts. The web app reads this so it doesn’t have to ask S3 for every folder. Run it once after setup. You can also run update_metrics.sh from cron (for example every hour) to refresh it.
curl -I https://kgx-storage.rtx.aiYou want a 200 or 302. That means Nginx, SSL, and the app are all working.
What’s in the repo:
web_server.py– Flask app. Routes, S3 calls, and the HTML for the browser, JSON viewer, and docs live here.compute_metrics.py– Script that scans the bucket and writes folder stats tometrics.json.update_metrics.sh– Runs compute_metrics.py and sends HUP to Gunicorn so workers reload. Use in cron.metrics.json– Created by compute_metrics.py. Not in git. Makes folder listing fast.requirements.txt– Python dependencies (pinned versions)..python-version– Says to use Python 3.12.3.kgx-storage-webserver.service– Systemd unit. Installed by setup-webserver-service.sh.setup-webserver-service.sh– Installs the service and starts it.nginx-config– Copy this to Nginx’s sites-available..gitignore– Tells git what not to track.public/– Static files (e.g. ncats-banner.png, favicon.png). Served by the app.README.md– This file.
The app runs as a systemd service named kgx-storage-webserver.
Check status:
sudo systemctl status kgx-storage-webserverStop, start, or restart:
sudo systemctl stop kgx-storage-webserver
sudo systemctl start kgx-storage-webserver
sudo systemctl restart kgx-storage-webserverWatch logs live:
sudo journalctl -u kgx-storage-webserver -fLast 100 lines:
sudo journalctl -u kgx-storage-webserver -n 100Turn on or off start at boot:
sudo systemctl enable kgx-storage-webserver
sudo systemctl disable kgx-storage-webserverNginx: After editing config, test then reload:
sudo nginx -t
sudo systemctl reload nginxCertificates: They renew automatically. To check or force renew:
sudo certbot certificates
sudo certbot renew
sudo certbot renew --dry-run- App:
/var/log/kgx-storage/access.loganderror.log - Service:
sudo journalctl -u kgx-storage-webserver - Nginx:
/var/log/nginx/access.loganderror.log
cd /home/ubuntu/kgx-storage-webserver
git pull
sudo systemctl restart kgx-storage-webserver
sudo systemctl status kgx-storage-webserverPython changes need a restart. Nginx config changes need sudo nginx -t then sudo systemctl reload nginx.
Service won’t start
Look at the logs:
sudo journalctl -u kgx-storage-webserver -n 50See if something else is on port 5000:
sudo ss -tulpn | grep 5000Check that Python can load the app:
cd /home/ubuntu/kgx-storage-webserver
source .venv/bin/activate
python -c "import flask, boto3, gunicorn"502 Bad Gateway
Nginx can’t reach the app. Usually the app isn’t running. Run sudo systemctl status kgx-storage-webserver and sudo journalctl -u kgx-storage-webserver -n 50 to see why.
SSL problems
Check certs: sudo certbot certificates. Test Nginx: sudo nginx -t. Look at the Nginx site config if something is wrong.
JSON opens as download instead of viewer
Restart the app so it picks up the routes: sudo systemctl restart kgx-storage-webserver.
S3 Access Denied
The instance probably doesn’t have an IAM role or the role can’t read the bucket. Check:
curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/If that returns a role name, the role is attached. Then in AWS make sure that role has s3:GetObject and s3:ListBucket on the translator-ingests bucket.
The EC2 instance role needs this policy (or equivalent):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::translator-ingests",
"arn:aws:s3:::translator-ingests/*"
]
}
]
}s3:GetObjecton the bucket’s objects: read files (for downloads and JSON viewer).s3:ListBucketon the bucket: list prefixes so we can show folders.
No write actions. Credentials come from the instance metadata, not from the code.
To run the app locally without Nginx or systemd:
cd /home/ubuntu/kgx-storage-webserver
source .venv/bin/activate
python web_server.pyThen open http://localhost:5000. This is single-threaded and not for production. Use it to test changes.
Site: https://kgx-storage.rtx.ai
- Home: https://kgx-storage.rtx.ai
- Folders: https://kgx-storage.rtx.ai/releases/alliance/latest/ (and similar paths; trailing slash lists the folder)
- File (canonical): https://kgx-storage.rtx.ai/releases/alliance/latest/graph-metadata.json — returns the file (JSON as response body, other types trigger download)
- JSON viewer: same path with
?view, e.g. https://kgx-storage.rtx.ai/releases/alliance/latest/graph-metadata.json?view — shows the HTML viewer - Docs: https://kgx-storage.rtx.ai/docs
Old links with ?path=... redirect to the path-style URL. Legacy /view/ and /download/ URLs are no longer supported (404).
Downloading without AWS
Use the canonical file URL with curl or wget. Always use the -fL flags with curl to ensure reliable downloads:
curl -fL -O "https://kgx-storage.rtx.ai/releases/alliance/latest/alliance.tar.zst"The flags:
-Lfollows HTTP redirects (required for the server's routing)-ffails on HTTP errors (prevents saving error pages as files)-Osaves with the remote filename
Without -L, curl saves redirect responses instead of the actual file. Without -f, curl silently saves HTTP error pages (404, 500, etc.) as if they were valid data, which can corrupt your analysis pipeline.
For wget, redirects are followed by default and errors return non-zero exit codes, so the basic command is sufficient:
wget "https://kgx-storage.rtx.ai/releases/alliance/latest/alliance.tar.zst"Examples and more commands are on the /docs page.
Downloading with AWS CLI
If you have credentials that can read the bucket, you can use aws s3 cp and aws s3 sync on s3://translator-ingests/. See the docs page for paths and examples.
Example paths in the bucket
- releases/alliance/latest/alliance.tar.zst
- releases/alliance/latest/graph-metadata.json
- releases/reactome/latest/ (and similar)
Metadata and external consistency
Use the canonical URL format for any reference to kgx-storage files: path only (e.g. https://kgx-storage.rtx.ai/releases/alliance/latest/graph-metadata.json), with optional ?view for the JSON viewer. Metadata files (e.g. graph-metadata.json), DAWG, or other systems that publish or consume kgx-storage URLs should use this format so "URL in metadata" matches "URL in app" and links work the same everywhere. For version checks (e.g. "is there a new release?"), using metadata (e.g. release_version in latest-release.json or url/id in graph-metadata.json) is more reliable than relying only on the Last-Modified HTTP header.
Edge cases
- Path that is neither an S3 object nor a prefix (e.g. typo): 404, same HTML response as reserved paths like
/viewor/download. - Directory path without trailing slash (e.g.
/releases/alliance/latest): redirect to the same path with trailing slash so the folder listing is shown. - Query parameters: only
?viewis significant for JSON viewer; other params (e.g.?foo=bar) are ignored. Redirects use the canonical path with no query string.
Translator-ingests (pipeline that writes the data): https://github.com/NCATSTranslator/translator-ingests/tree/kgx_storage
- HTTPS with Let's Encrypt. Traffic is encrypted.
- No AWS keys in the app. The instance role is used via metadata.
- Download links are presigned and expire in 1 hour.
- The Flask app only listens on the localhost. Only Nginx talks to it from the outside.
- S3 access is read-only. Nobody can change or delete data through this app.
- There is no rate limiting. If you need it, you can add it in Nginx.
Part of the NCATS Biomedical Data Translator project.