Skip to content

speed - emebdder model #68

@kalle07

Description

@kalle07

hey,
can you speed up those qwen models?
seems "usual models" optimized for embedding generation while qwen-0.6B is rather optimized for text generation!?!
Or does it just depend on the model itself?

(all same settings only model change, all fp16 gguf)

Qwen3-Embedding-0.6B and jinja_v5 (qwen based) all in gguf oc:
init: embeddings required but some input tokens were not marked as outputs -> overriding
llama_perf_context_print: load time = 4.21 ms
llama_perf_context_print: prompt eval time = 59.46 ms / 409 tokens ( 0.15 ms per token, 6878.34 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 64.72 ms / 410 tokens
llama_perf_context_print: graphs reused = 0

jinja_v3/bge-m3/snowflake-arctic-embed-l-v2.0 and so one:
init: embeddings required but some input tokens were not marked as outputs -> overriding
llama_perf_context_print: load time = 1.20 ms
llama_perf_context_print: prompt eval time = 11.73 ms / 368 tokens ( 0.03 ms per token, 31383.25 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 12.78 ms / 369 tokens
llama_perf_context_print: graphs reused = 0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions