-
Notifications
You must be signed in to change notification settings - Fork 28
Description
hey,
can you speed up those qwen models?
seems "usual models" optimized for embedding generation while qwen-0.6B is rather optimized for text generation!?!
Or does it just depend on the model itself?
(all same settings only model change, all fp16 gguf)
Qwen3-Embedding-0.6B and jinja_v5 (qwen based) all in gguf oc:
init: embeddings required but some input tokens were not marked as outputs -> overriding
llama_perf_context_print: load time = 4.21 ms
llama_perf_context_print: prompt eval time = 59.46 ms / 409 tokens ( 0.15 ms per token, 6878.34 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 64.72 ms / 410 tokens
llama_perf_context_print: graphs reused = 0
jinja_v3/bge-m3/snowflake-arctic-embed-l-v2.0 and so one:
init: embeddings required but some input tokens were not marked as outputs -> overriding
llama_perf_context_print: load time = 1.20 ms
llama_perf_context_print: prompt eval time = 11.73 ms / 368 tokens ( 0.03 ms per token, 31383.25 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 12.78 ms / 369 tokens
llama_perf_context_print: graphs reused = 0