-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
The code for processing sharegpt in the script prepare_train_data.sh uses sharegpt conversation splitter (split_sharegpt_conversations.py) , but there is no such code in the corresponding directory. Where can I find it? As shown below.
echo "Downloading ShareGPT dataset..."
wget -nc -P data/raw_train/sharegpt/ https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/HTML_cleaned_raw_dataset/sg_90k_part1_html_cleaned.json
wget -nc -P data/raw_train/sharegpt/ https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/HTML_cleaned_raw_dataset/sg_90k_part2_html_cleaned.json
echo "Splitting the ShareGPT dataset with 2048 max tokens per conversation..."
python split_sharegpt_conversations.py \
--in-files data/raw_train/sharegpt/sg_90k_part1_html_cleaned.json data/raw_train/sharegpt/sg_90k_part2_html_cleaned.json \
--out-file data/raw_train/sharegpt/sharegpt_html_cleaned_and_split_2048.json \
--model-name-or-path /data3/MODELS/llama-7b-hf \
--max-length 2048
echo "Splitting the ShareGPT dataset with 4096 max tokens per conversation..."
python split_sharegpt_conversations.py \
--in-files data/raw_train/sharegpt/sg_90k_part1_html_cleaned.json data/raw_train/sharegpt/sg_90k_part2_html_cleaned.json \
--out-file data/raw_train/sharegpt/sharegpt_html_cleaned_and_split_4096.json \
--model-name-or-path /data3/MODELS/llama-7b-hf \
--max-length 4096
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels