Excellent work!
I noticed in the paper that the LLM part is fine-tuned based on Qwen2.5-VL. Is it possible to separate the weights of the image part and the LLM part, so that other mature frameworks (such as TensorRT and Ollama) can be used for inference later?