Fix Gemini finder coordinate detection bug caused by token limit#15
Fix Gemini finder coordinate detection bug caused by token limit#15Vasiliy-b wants to merge 2 commits intoinstavm:mainfrom
Conversation
Root cause: The max_output_tokens limit was set to 80, which was insufficient for Gemini to return a complete JSON response. This caused the response to be truncated to just "Here", resulting in invalid coordinates (0,0,0,0). Changes: 1. Increased max_output_tokens from 80 to 4096 for Gemini finder - Allows for reasoning and full JSON responses - Prevents response truncation 2. Improved JSON parsing in GeminiFinder - Added robust extraction from markdown code blocks - Added regex-based JSON object extraction - Added detailed debug logging - Added full exception tracebacks 3. Enhanced system prompts for clarity - More imperative language - Clearer JSON format requirements - Explicit "no extra text" instruction Note: This fix uses the deprecated google.generativeai package. Migration to google.genai would be needed for structured output support.
Summary of ChangesHello @Vasiliy-b, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical bug in the Gemini finder where invalid coordinates were returned due to truncated model responses. The changes focus on ensuring the Gemini model provides complete and correctly formatted JSON output by increasing token limits, refining system prompts for explicit JSON requirements, and implementing resilient parsing logic to extract JSON from varied response formats. This significantly improves the reliability of coordinate detection and prevents misclicks. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a critical bug where the Gemini finder returned invalid coordinates due to a low token limit. The solution, which includes increasing the token limit, improving the system prompts for clearer instructions, and adding robust JSON parsing logic, is well-thought-out. My review includes suggestions to further improve the code, such as using standard logging instead of print statements for easier debugging, moving an import to the top level to adhere to best practices, and a query about a potentially incorrect model name in the configuration. Overall, these are solid changes.
| gemini: | ||
| api_key: !ENV GEMINI_API_KEY | ||
| model_name: gemini-1.5-flash | ||
| model_name: gemini-3-flash-preview |
There was a problem hiding this comment.
| from tempfile import NamedTemporaryFile | ||
| from PIL import Image | ||
| import json | ||
| import re |
clickclickclick/finder/gemini.py
Outdated
| except Exception as e: | ||
| # Log the exception or handle it as necessary | ||
| print(f"Attempt {attempt + 1} failed with exception: {e}") | ||
| import traceback |
| temperature: 1.0 | ||
| top_p: 0.95 | ||
| top_k: 40 | ||
| max_output_tokens: 200 | ||
| max_output_tokens: 65536 |
There was a problem hiding this comment.
The configuration for the planner has been updated to set max_output_tokens to 65536, which is a very high value and might have cost or performance implications. Additionally, the top_k parameter was removed, which alters the text generation sampling strategy. It would be beneficial to confirm if these changes were intentional and if such a high token limit is necessary.
clickclickclick/finder/gemini.py
Outdated
| super().__init__(api_key, model_name, generation_config, system_prompt, executor) | ||
| genai.configure(api_key=api_key) | ||
|
|
||
| print(f"DEBUG - Generation config: {generation_config}") |
There was a problem hiding this comment.
This print statement appears to be for debugging. It's recommended to use the logging module instead, as it provides more control over log verbosity and output destinations. Please consider replacing this with a logger.debug() call after importing and configuring a logger.
| print(f"DEBUG - Generation config: {generation_config}") | |
| logger.debug(f"DEBUG - Generation config: {generation_config}") |
clickclickclick/finder/gemini.py
Outdated
| print(f"DEBUG - Gemini raw response: '{response_text}'") | ||
| print(f"DEBUG - Response type: {type(response_text)}") | ||
|
|
||
| # Try to extract JSON from response | ||
| # Sometimes Gemini wraps JSON in markdown code blocks | ||
| json_match = re.search(r'```json\s*(\{.*?\})\s*```', response_text, re.DOTALL) | ||
| if json_match: | ||
| response_text = json_match.group(1) | ||
| print(f"DEBUG - Extracted JSON from markdown: '{response_text}'") | ||
|
|
||
| # Or it might just have extra text before/after | ||
| json_match = re.search(r'\{[^{}]*"ymin"[^{}]*\}', response_text) | ||
| if json_match: | ||
| response_text = json_match.group(0) | ||
| print(f"DEBUG - Extracted JSON object: '{response_text}'") | ||
|
|
||
| # Validate it's valid JSON | ||
| try: | ||
| parsed = json.loads(response_text) | ||
| print(f"DEBUG - Parsed JSON successfully: {parsed}") | ||
| except json.JSONDecodeError as je: | ||
| print(f"DEBUG - JSON decode error: {je}") | ||
| print(f"DEBUG - Failed to parse: '{response_text}'") |
There was a problem hiding this comment.
clickclickclick/finder/gemini.py
Outdated
| json_match = re.search(r'```json\s*(\{.*?\})\s*```', response_text, re.DOTALL) | ||
| if json_match: | ||
| response_text = json_match.group(1) | ||
| print(f"DEBUG - Extracted JSON from markdown: '{response_text}'") | ||
|
|
||
| # Or it might just have extra text before/after | ||
| json_match = re.search(r'\{[^{}]*"ymin"[^{}]*\}', response_text) | ||
| if json_match: | ||
| response_text = json_match.group(0) | ||
| print(f"DEBUG - Extracted JSON object: '{response_text}'") |
There was a problem hiding this comment.
The current logic for extracting JSON is slightly inefficient as it attempts the second regex search even if the first one (for markdown blocks) was successful. You can make this more efficient and readable by using an else block to ensure the second search only runs if the first one fails.
| json_match = re.search(r'```json\s*(\{.*?\})\s*```', response_text, re.DOTALL) | |
| if json_match: | |
| response_text = json_match.group(1) | |
| print(f"DEBUG - Extracted JSON from markdown: '{response_text}'") | |
| # Or it might just have extra text before/after | |
| json_match = re.search(r'\{[^{}]*"ymin"[^{}]*\}', response_text) | |
| if json_match: | |
| response_text = json_match.group(0) | |
| print(f"DEBUG - Extracted JSON object: '{response_text}'") | |
| json_match = re.search(r'```json\s*(\{.*?\})\s*```', response_text, re.DOTALL) | |
| if json_match: | |
| response_text = json_match.group(1) | |
| print(f"DEBUG - Extracted JSON from markdown: '{response_text}'") | |
| else: | |
| # Or it might just have extra text before/after | |
| json_match = re.search(r'\{[^{}]*"ymin"[^{}]*\}', response_text) | |
| if json_match: | |
| response_text = json_match.group(0) | |
| print(f"DEBUG - Extracted JSON object: '{response_text}'") |
|
Thanks for the review! A few clarifications: Re: gemini-3-flash-preview model name Re: planner max_output_tokens = 65536 Re: code style suggestions
I can address these in a follow-up commit if the maintainers would like, or they can be handled separately to keep this PR focused on the core bug fix (token limit issue). |
|
Thanks for the review! Regarding the feedback: Re: gemini-3-flash-preview model name
So no issue there. Re: planner max_output_tokens = 65536 Re: code style suggestions
I can address these in a follow-up commit if needed, or they can be handled separately to keep this PR focused on the core bug fix. |
- Move traceback and logging imports to top level - Replace all print statements with logger.debug/error - Optimize JSON extraction logic with else block - Add proper logger initialization Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
Correction to my previous comment: The available Gemini 3 models are:
(I incorrectly mentioned gemini-3-pro-vision-preview - that doesn't exist. All Gemini 3 models are natively multimodal, vision is a legacy naming convention from older versions.) |
Problem
Gemini finder was returning invalid coordinates
0,0,0,0instead of actual element positions, causing clicks to happen at the wrong location (top-left corner).Root Cause
The
max_output_tokenswas set to 80 in the Gemini finder configuration. This was insufficient for the model to return a complete JSON response. The response was being truncated to just "Here", which couldn't be parsed as valid JSON.Solution
1. Increased Token Limit
max_output_tokensfrom 80 to 4096 for Gemini finder2. Improved JSON Parsing
Enhanced
clickclickclick/finder/gemini.pywith robust parsing:json ...)3. Clearer System Prompts
Updated prompts in
clickclickclick/config/prompts.yaml:Testing
Verified that Gemini finder now returns valid JSON coordinates instead of truncated responses.
Technical Note
The codebase currently uses the deprecated
google.generativeaipackage. Future work could migrate togoogle.genaifor native structured output support, which would eliminate the need for JSON parsing workarounds.Files Changed
clickclickclick/config/models.yaml- increased token limitclickclickclick/config/prompts.yaml- improved promptsclickclickclick/finder/gemini.py- robust JSON parsing and logging