You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+13-8Lines changed: 13 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,16 +17,10 @@ The speculative sampling is proposed by Google and Deepmind independently. So I
17
17
You need prepare a pair of models using the same embedding and vocabulary. The approximation model should be smaller than the target model. Here are some
In the sample, I use[bloomz-7b1](https://huggingface.co/bigscience/bloomz-7b1/tree/main) as the target model, [bloom-560m](https://huggingface.co/bigscience/bloom-560m/tree/main) as the approximation model.
23
+
In the sample, we demostrate[bloomz-7b1](https://huggingface.co/bigscience/bloomz-7b1/tree/main) as the target model, [bloom-560m](https://huggingface.co/bigscience/bloom-560m/tree/main) as the approximation model.
30
24
31
25
```bash
32
26
python main.py \
@@ -35,10 +29,21 @@ python main.py \
35
29
--approx_model_name bigscience/bloom-560m
36
30
```
37
31
38
-
You can also use `--v` args to see a token is generated by which model.
32
+
You can also use `-v` args to see a token is generated by which model.
39
33
40
34

41
35
36
+
I recommand you to use llama2-7B and llama2-70B as the approximation and target model respectively. I did observe speedup on this case as shown in the following.
37
+
Note the choice of approx model and target model are essential for the speedup. The speedup will not be observed in the following cases:
38
+
If the models are both small ones, the speedup will not be observed since the speed differences are not significant.
39
+
If the model size difference is too large, more rejection and resampling will occure.
40
+
Also the sampling logic is not efficient enough. I noticed substantial overhead is on Softmax and Layernorm. I will try to optimize it in the future.
41
+
Do not histant to open an idea on performance improvements.
0 commit comments