Empowering Recommender Systems based on Large Language Models through Knowledge Injection Techniques
- Abstract
- Datasets Information
- Apriori Algorithm Parameters
- LoRA Hyperparameters
- Repository Structure
- Data Preprocessing
- Large Language Model (LLM) Training and Inference
- Results Parsing
- Metrics Calculation
Recommender systems (RSs) have become increasingly versatile, finding applications across diverse domains. %As shown by several works, Large Language Models (LLMs) significantly contribute to this advancement since the vast amount of knowledge embedded in these models can be easily exploited to provide users with high-quality recommendations. However, current RSs based on LLMs have room for improvement. As an example, knowledge injection techniques can be used to fine-tune LLMs by incorporating additional data, thus improving their performance on downstream tasks. In a recommendation setting, these techniques can be exploited to incorporate further knowledge, which can result in a more accurate representation of the items. Accordingly, in this paper, we propose a pipeline for knowledge injection specifically designed for RS. First, we incorporate external knowledge by drawing on three sources: (a) knowledge graphs; (b) textual descriptions; (c) collaborative information about user interactions. Next, we lexicalize the knowledge, and we instruct and fine-tune an LLM, which can then be easily to return a list of recommendations. Extensive experiments on movie, music, and book datasets validate our approach. Moreover, the experiments showed that knowledge injection is particularly needed in domains (i.e., music and books) that are likely to be less covered by the data used to pre-train LLMs, thus leading the way to several future research directions.
The following datasets are used in this project:
| Dataset | Users | Items | Ratings | Sparsity |
|---|---|---|---|---|
| Last.FM | 1881 | 2828 | 71,426 | 98.66% |
| DBbook | 5660 | 6698 | 129,513 | 99.66% |
| MovieLens 1M | 6036 | 3081 | 946,120 | 94.91% |
The Apriori algorithm extracts association rules using the following parameters:
| Dataset | Support | Confidence | Extracted Rules |
|---|---|---|---|
| Last.FM | 0.0015 | 0.002 | 13,391 |
| DBbook | 0.0003 | 0.001 | 13,245 |
| MovieLens 1M | 0.01 | 0.05 | 62,521 |
LoRA (Low-Rank Adaptation) is used to fine-tune the LLM with the following hyperparameters:
| Parameter | Value |
|---|---|
| r | 64 |
| alpha | 128 |
| target | All linear layers |
| sequence length | 2048 |
| learning rate | 0.0001 |
| training epochs | 10 |
| weight decay | 0.0001 |
| max grad norm | 1.0 |
| per device train batch size | 4 |
| optimizer | AdamW (Torch) |
- DataPreprocessing/: Preprocessing scripts and knowledge extraction.
- LLM/: Scripts for fine-tuning and inference.
- MetricsCalculation/: Scripts for evaluating the recommender system.
- Create a virtual environment (Python 3.10.12 recommended):
python -m venv env source env/bin/activate # On Windows: `env\Scripts\activate`
- Install dependencies:
pip install -r req.txt
-
Download Item Descriptions
- Obtain files from this link and place them in dataset folders.
-
Map DBpedia IDs to Items
- Run
dbpedia_quering.pyin thenotebooks/folder.
- Run
-
Create JSON Training Files
- Execute:
Process_text_candidate.ipynbProcess_graph_candidate.ipynbProcess_collaborative_candidate.ipynb
- Select dataset using the
domainvariable.
- Execute:
-
Create Training Sets for Ablation Studies
- Run
Merge_sources_candidate.ipynbto merge data sources.
- Run
sudo singularity build llm_cuda121.sif LLM/llm_cuda121.defsingularity exec --nv llm_cuda121.sif python main_train_task.py- Configure parameters in
config_task.yaml.
singularity exec --nv llm_cuda121.sif python main_merge.py- Configure settings in
config_merge.yaml.
singularity exec --nv llm_cuda121.sif python main_inference_pipe.py- Adjust
config_inference.yaml.
Before calculating metrics, parse the inference results:
- Use the existing
envfrom data preprocessing. - Run
Parse_results.ipynbinDataPreprocessing/notebooks. - Select the appropriate file and dataset in the first cell.
To evaluate model performance:
- Create a new environment:
python -m venv metrics_env source metrics_env/bin/activate # On Windows: `metrics_env\Scripts\activate`
- Install dependencies:
pip install -r MetricsCalculation/Clayrsrequirements.txt
- Run metric calculation script:
python MetricsCalculation/metric_cal.py
- Select the dataset in the script.
- Modify
models_nameto evaluate specific configurations.
