PersonaCraft.AI is a comprehensive machine learning platform that crawls, processes, and transforms digital content from various sources into high-quality instruction and preference datasets for training AI models. The system creates personalized AI training data by analyzing an individual's digital footprint across multiple platforms.
- Multi-Platform Data Crawling: Automated extraction from LinkedIn profiles, Medium articles, GitHub repositories, and custom web articles
- Intelligent Data Processing: Advanced text cleaning, chunking, and embedding generation using sentence transformers
- Dataset Generation: Creates both instruction-following and preference datasets using OpenAI GPT models
- Model Fine-tuning: Supports both Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) training
- MLOps Pipeline: Built with ZenML for reproducible machine learning workflows
- Vector Database Integration: Qdrant for efficient similarity search and retrieval
- Data Warehouse: MongoDB for structured data storage
- Cloud Training: AWS SageMaker integration for scalable model training
- Automated Publishing: Direct integration with Hugging Face Hub for dataset and model sharing
The system follows a modular, pipeline-based architecture:
- Data Extraction Layer: Selenium-based crawlers for different platforms
- Data Processing Layer: Text cleaning, chunking, and embedding generation
- Feature Engineering: Vector database storage and retrieval augmented generation (RAG)
- Dataset Generation: LLM-powered creation of training datasets
- Model Training: SFT and DPO fine-tuning pipelines with cloud support
- Publishing Layer: Automated dataset and model validation and publishing
- Backend: Python 3.11, FastAPI
- ML Pipeline: ZenML, LangChain
- Data Storage: MongoDB, Qdrant Vector Database
- ML Models: OpenAI GPT-4, Sentence Transformers, Llama 3.1
- Training: TRL (Transformers Reinforcement Learning), Unsloth
- Web Scraping: Selenium, BeautifulSoup4
- Cloud Platform: AWS SageMaker
- Containerization: Docker, Docker Compose
- Package Management: Poetry
- Monitoring: Comet ML, Opik
-
Install dependencies:
- Partial installation (excluding AWS-related packages):
poetry install --without aws
- Full installation (all dependencies):
poetry install
- Partial installation (excluding AWS-related packages):
-
Install Poe the Poet plugin (one-time per system):
poetry self add 'poethepoet[poetry_plugin]' -
Test Poe with a sample task:
- Run the task:
poetry poe run-sample-hello
- Expected output:
hello poe is working
- Run the task:
Poetry 2.0+ does not enable poetry shell by default. You can activate the virtual environment manually:
source $(poetry env info --path)/bin/activateOnce inside the activated environment, you can run Poe tasks directly with:
poe run-sample-hello# Start MongoDB and Qdrant databases
poetry poe local-infrastructure-upSet up your environment variables in .env file or export them:
export OPENAI_API_KEY="your_openai_api_key"
export HUGGINGFACE_ACCESS_TOKEN="your_hf_token"# Run ETL for specific person configurations
poetry poe run-digital-data-etl-person1
poetry poe run-digital-data-etl-person2
# Or run both
poetry poe run-digital-data-etlpoetry poe run-feature-engineering-pipeline# Generate instruction datasets for SFT
poetry poe run-generate-instruct-datasets-pipeline
# Generate preference datasets for DPO
poetry poe run-generate-preference-datasets-pipeline# Run all data pipelines in sequence
poetry poe run-end-to-end-data-pipeline# Train locally (requires GPU)
poetry run python -m tools.run --run-training
# Train on AWS SageMaker
python -m persona_craft_ai.model.finetuning.sagemakerpoetry poe local-infrastructure-downThe project provides a comprehensive CLI through Poe the Poet tasks:
# Start all local infrastructure (MongoDB, Qdrant, ZenML)
poetry poe local-infrastructure-up
# Stop all local infrastructure
poetry poe local-infrastructure-down
# Start only Docker services
poetry poe local-docker-infrastructure-up
# Stop Docker services
poetry poe local-docker-infrastructure-down# Run ETL for different personas
poetry poe run-digital-data-etl-person1
poetry poe run-digital-data-etl-person2
poetry poe run-digital-data-etl # Runs both
# Feature engineering
poetry poe run-feature-engineering-pipeline
# Dataset generation
poetry poe run-generate-instruct-datasets-pipeline
poetry poe run-generate-preference-datasets-pipeline
# Complete data pipeline
poetry poe run-end-to-end-data-pipeline# Access the main CLI with all options
poetry run python -m tools.run --help
# Available flags:
# --run-etl # Run ETL pipeline
# --run-feature-engineering # Run feature engineering
# --run-generate-instruct-datasets # Generate instruction datasets
# --run-generate-preference-datasets # Generate preference datasets
# --run-end-to-end-data # Run complete data pipeline
# --run-training # Run model training
# --export-settings # Export settings to ZenML
# --no-cache # Disable pipeline caching- Input: User profile links (LinkedIn, Medium, GitHub, custom articles)
- Process: Automated crawling and data extraction using Selenium
- Output: Raw documents stored in MongoDB
- Configuration: Configured via YAML files in
configs/directory
- Input: Raw documents from data warehouse
- Process: Text cleaning, chunking, embedding generation using sentence transformers
- Output: Vector embeddings stored in Qdrant for similarity search
- Input: Processed documents and embeddings
- Process: LLM-powered generation of instruction/preference pairs using RAG
- Output: Formatted datasets ready for model training
- Types: Instruction datasets (SFT) and preference datasets (DPO)
- Input: Generated datasets from Hugging Face
- Process: Fine-tuning using SFT and/or DPO methods
- Models: Based on Llama 3.1-8B architecture
- Output: Trained models published to Hugging Face Hub
- Deployment: Supports local training and AWS SageMaker
PersonaCraft.AI supports two main training paradigms:
- Base Model: Llama 3.1-8B
- Method: Instruction-following fine-tuning using LoRA
- Dataset: Generated instruction-output pairs from digital content
- Output:
PersonaCraftAILlama-3.1-8Bmodel
- Base Model: SFT-trained PersonaCraft model
- Method: Preference-based alignment training
- Dataset: Preference pairs with chosen/rejected responses
- Output:
PersonaCraftAILlama-3.1-8B-DPOmodel
# SFT training
poetry run python -m tools.run --run-training
# Configure training parameters in configs/training.yaml# Cloud training with GPU instances
python -m persona_craft_ai.model.finetuning.sagemaker
# Supports ml.g5.2xlarge instances with automatic scaling- Efficient Training: Uses Unsloth for 2x faster training with reduced memory usage
- Monitoring: Integrated with Comet ML for experiment tracking
- Automatic Publishing: Models are automatically pushed to Hugging Face Hub
- Flexible Configuration: Support for different model sizes and training parameters
- Memory Optimization: 4-bit quantization and LoRA for efficient training
- Personal AI Assistants: Train models that understand and mimic specific writing styles and expertise
- Content Generation: Create AI models specialized in particular domains or professional areas
- Educational AI: Develop tutoring systems based on expert knowledge and teaching styles
- Research: Generate synthetic datasets for training domain-specific language models
- Corporate Training: Create company-specific AI assistants based on internal expertise and documentation
- Professional Development: Build AI coaches that understand specific career paths and skills
# OpenAI API (required for dataset generation)
OPENAI_API_KEY=your_openai_api_key
OPENAI_MODEL_ID=gpt-4o-mini
# Hugging Face (for dataset and model publishing)
HUGGINGFACE_ACCESS_TOKEN=your_hf_token
# Database connections
DATABASE_HOST=mongodb://persona_craft_ai:persona_craft_ai@127.0.0.1:27017
QDRANT_DATABASE_HOST=localhost
QDRANT_DATABASE_PORT=6333
# AWS SageMaker (optional, for cloud training)
AWS_ARN_ROLE=your_aws_sagemaker_role
AWS_REGION=us-east-1
# Monitoring (optional)
COMET_API_KEY=your_comet_api_key
COMET_PROJECT=your_project_namePipeline configurations are managed through YAML files in the configs/ directory:
digital_data_person_1.yaml- ETL configuration for first personafeature_engineering.yaml- Feature engineering pipeline settingsgenerate_instruct_datasets.yaml- Instruction dataset generationgenerate_preference_datasets.yaml- Preference dataset generationtraining.yaml- Model training configuration
Export your settings to ZenML:
poetry run python -m tools.run --export-settingspersona_craft_ai/
βββ application/ # Core application logic
β βββ crawlers/ # Platform-specific data extractors (LinkedIn, Medium, GitHub)
β βββ dataset/ # Dataset generation and processing
β βββ networks/ # ML model interfaces and embeddings
β βββ preprocessing/ # Data cleaning and transformation
βββ domain/ # Business logic and data models
βββ infrastructure/ # Database and external service integrations
β βββ db/ # MongoDB and Qdrant clients
βββ model/ # Model training and fine-tuning
β βββ finetuning/ # SFT and DPO training scripts
pipelines/ # ZenML pipeline definitions
βββ digital_data_etl.py # Data extraction pipeline
βββ feature_engineering.py # Data processing pipeline
βββ generate_datasets.py # Dataset generation pipeline
βββ training.py # Model training pipeline
βββ end_to_end_data.py # Complete data pipeline
steps/ # Individual pipeline steps
βββ etl/ # Data extraction steps
βββ feature_engineering/ # Data processing steps
βββ generate_datasets/ # Dataset generation steps
βββ training/ # Model training steps
tools/ # CLI and utility scripts
configs/ # Pipeline configuration files
Perfect for supervised fine-tuning of language models:
{
"instruction": "Explain the concept of vector embeddings in machine learning",
"output": "Vector embeddings are dense numerical representations that capture semantic meaning..."
}Ideal for Direct Preference Optimization and RLHF:
{
"prompt": "Describe best practices for API design",
"chosen": "High-quality extracted response from expert content showcasing detailed API design principles...",
"rejected": "Generated alternative response with less comprehensive or lower-quality information"
}Both dataset types are automatically published to Hugging Face Hub and can be used for:
- SFT Training: Teaching models to follow instructions and generate appropriate responses
- DPO Training: Aligning models with human preferences and improving response quality
- Model Evaluation: Benchmarking model performance on domain-specific tasks
This project is licensed under the MIT License - see the LICENSE file for details.
For questions and support, please open an issue in the GitHub repository.
# 1. Setup infrastructure
poetry install --without aws
poetry poe local-infrastructure-up
# 2. Configure environment
export OPENAI_API_KEY="your_key"
export HUGGINGFACE_ACCESS_TOKEN="your_token"
# 3. Run complete data pipeline
poetry poe run-end-to-end-data-pipeline
# 4. Generate both dataset types
poetry poe run-generate-instruct-datasets-pipeline
poetry poe run-generate-preference-datasets-pipeline
# 5. Train your model
poetry run python -m tools.run --run-training# 1. Create your own config file in configs/
# 2. Configure ETL for your specific persona
poetry run python -m tools.run --run-etl --etl-config-filename your_config.yaml
# 3. Process the data
poetry poe run-feature-engineering-pipeline# 1. Configure AWS credentials and SageMaker role
export AWS_ARN_ROLE="your_sagemaker_role"
# 2. Run training on SageMaker
python -m persona_craft_ai.model.finetuning.sagemaker