This repository provides a Flask-based server for running Large Language Models with integrated RAG (Retrieval-Augmented Generation) capabilities. The main API endpoint seamlessly integrates ads into LLMs outputs. This server combines:
- DeepSeek V3 API for text generation and inference. NOT IMPLEMENTED YET.
- Mistral 7B for advertisement integration.
- Stella-en-1.5B for high-quality embeddings
- SQLite RAG database for semantic product search
conda env create -f environment.yml
conda activate llm-serverThe server uses llama-cpp-python to serve models in the GGUF format. Building it from source may be necessary to enable CUDA support.
export CMAKE_ARGS="-DGGML_CUDA=on"
export FORCE_CMAKE=1
pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dirThis will take a significant amount of time, as it builds the library from source.
¹ Note: If you encounter library loading errors during compilation, you may need to set the library path:
export LD_LIBRARY_PATH=/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATHThe fastest way to download the models is to use the huggingface-hub.
pip install huggingface-hub
huggingface-cli loginRun the following script to download the models:
cd models
source download.shRun the following script to download the datasets:
cd datasets
source python download.pyCreate config.ini and save it in the Ads/ directory:
[server]
hostname = 0.0.0.0
port = 8888
[model]
path = models/mistral-7b-instruct-v0.3/Mistral-7B-Instruct-v0.3.Q4_K_M.gguf
max_tokens = 8096
gpu_device = 0
[embedding]
path = models/stella-en-1.5B
gpu_device = 0
[data]
csv_path = datasets/products.csv
[prompts]
qa_template_path = prompts/minimal_qa_template.txt
ad_with_url_insertion_template_path = prompts/ad_with_url_template.txt
ad_without_url_insertion_template_path = prompts/ad_without_url_template_pathCreate .env: and save it in the Ads/ directory:
DEEPSEEK_API_KEY=your_actual_api_key_hereAn example with common YouTube advertisements can be found here. The CSV should be tab-separated with these columns:
| Name | Category | Description |
|---|---|---|
| NordVPN | VPN/Privacy | Established in 2012, NordVPN is renowned for its robust security features... |
| Surfshark | VPN/Privacy | Launched in 2018, Surfshark offers features like CleanWeb... |
python src/server.pyThe server will automatically:
- Load both AI models
- Initialize the RAG database
- Import products from CSV (skips duplicates)
- Start the web server
POST /infer_local- Generate text locally with Mistral 7B.POST /insert_native_ads- Insert native advertisements with Mistral 7B.
GET /products- List all productsPOST /products- Add new productDELETE /products/<id>- Delete productPOST /search- Semantic search
GET /health- Server health check
python src/server_demo.pyThis tests all functionality including:
- Text generation
- RAG search with various queries
- Ad integration
Open ssh tunnel:
ssh -L 8888:localhost:8888 user@serverIn your browser, search localhost:8888/
Interactive Swagger docs available at: http://localhost:8888/apidocs/
- Mistral 7B: Quantized GGUF format for efficient inference
- Stella-en-1.5B: 1536-dimensional embeddings optimized for retrieval
CREATE TABLE products (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT NOT NULL UNIQUE,
category TEXT NOT NULL,
description TEXT NOT NULL,
embedding BLOB NOT NULL
);src/
├── server.py # Server setup and initialization
├── server_demo.py # Test all the APIs
├── models.py # Model loading and management
├── rag.py # RAG functionality (embeddings, database, search)
├── endpoints.py # All API endpoints
├── prompts/
├── datasets/ # Download and add advertisements to wildchat dataset
└── config.ini
- WildChat advertisment preference dataset: Annotated dataset containing prompt/response pairs from the WildChat dataset with advertisements inserted. Each prompt is paired with two response variations and also has human preference annotations for DPO fine-tuning.
- Corporate advertiser dataset: A dataset containing corporations and products commonly appearing in advertisments, especially advertisements appearing in YouTube videos. Includes a description, URL, and category for each.
To measure the impact of advertisement insertion on factual consistency, we use the TRUE benchmark as an estimate. This benchmark contains a number of text generation tasks and associated datasets and measures model consistency on each task. While this only indirectly measures consistency, to our knowledge there are no direct evaluation strategies for this metric.
