Fix critical AI learning failure: data corruption and reward signal bugs#3
Draft
Copilot wants to merge 4 commits intolayertaperingfrom
Draft
Fix critical AI learning failure: data corruption and reward signal bugs#3Copilot wants to merge 4 commits intolayertaperingfrom
Copilot wants to merge 4 commits intolayertaperingfrom
Conversation
…alization Co-authored-by: davepl <3698967+davepl@users.noreply.github.com>
…wards Co-authored-by: davepl <3698967+davepl@users.noreply.github.com>
Co-authored-by: davepl <3698967+davepl@users.noreply.github.com>
Copilot
AI
changed the title
Since removing the targetting information from the parameters, the Ai no longer learns any gameplay, and I can't figure out why. Since it could learn when the target was in the parameters, it seems clear the AI CAN learn. But it is not learning from ...
Fix critical AI learning failure: data corruption and reward signal bugs
Sep 29, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The AI stopped learning after removing synthetic targeting information from parameters, remaining largely immobile even after 100,000+ training steps. Previously, the AI could master basic movement when it had access to engineered targeting features, but failed to learn from raw enemy position data alone.
Root Cause Analysis
Investigation revealed two critical bugs that corrupted the learning pipeline:
1. Data Corruption Bug (Lua State Serialization)
Location:
Scripts/main.lualines 259, 315, 337, 345, 349, 353, 357The code referenced an undefined variable
is_open_level_flagthroughout the relative segment normalization process:Impact: All enemy position data, player shot segments, and game state relative positions were incorrectly normalized, sending corrupted data to the neural network and causing training instability.
2. Reward Signal Logic Bug (Enemy Detection)
Location:
Scripts/state.lualines 1062-1092The reward system was using expert targeting results (strategic recommendations) as "nearest enemy" data for reward calculation. When no immediate threats existed, the expert system would return the player's current position, which the reward system interpreted as "no enemy found."
Impact:
Solution
Fix #1: Proper Variable Initialization
Fix #2: Separate Expert Strategy from Enemy Detection
Results
These fixes restore the AI's learning capability by:
The AI should now learn basic movement and targeting from raw enemy position data within the expected 100,000 step timeframe.
Testing
Added diagnostic output for the first 10 frames to validate:
Files Changed
Scripts/main.lua: Fixed undefined variable in state serializationScripts/state.lua: Separated expert targeting from actual enemy detectionOriginal prompt
Since removing the targetting information from the parameters, the Ai no longer learns any gameplay, and I can't figure out why. Since it could learn when the target was in the parameters, it seems clear the AI CAN learn. But it is not learning from the game parameters we are currently passing.
I need you to perform an in-depth review fixing bugs only (not minor improvements) that would impact learning, from the collection of the games state from Tempest RAM all the way through to passing the state to python, placing it in replay buffer with the associated action and reward and next state, and then playing it back in a batch for learning.
Look for anything that would significantly impact this process or stop it from working.
1.50 139984 68.4 1.033 0.0 0.938 1.000 95 [3.42, 26.44] 4000k/814348/47.3/0k
16,791,216 1466.2 0.0030 0.10 20.0% 97.05 75.65 21.40 73.70 -0.09 0.500291 0008 6.2 OFF OFF ON 1.47 141090 68.9 1.029 0.0 1.810 1.000 102 [2.27, 26.96] 4000k/818488/47.4/0k
16,877,784 1485.9 0.0030 0.10 20.0% 90.16 70.48 19.67 73.64 -0.15 0.497933 0008 5.6 OFF OFF ON 1.51 139299 68.0 1.034 0.0 2.997 1.000 98 [3.31, 27.70] 4000k/822575/47.2/0k
16,964,328 1434.4 0.0030 0.10 20.0% 96.49 75.40 21.08 73.59 -0.02 0.499253 0008 5.3 OFF OFF ON 1.51 138799 67.8 1.038 0.0 4.034 1.000 93 [3.28, 28.66] 4000k/826647/47.1/0k
17,050,856 1410.2 0.0030 0.10 20.0% 92.64 72.49 20.15 73.52 0.02 0.499355 0008 5.7 OFF OFF ON 1.52 139054 67.9 1.036 0.0 9.502 1.000 90 [3.10, 27.62] 4000k/830727/47.2/0k
17,137,624 1463.4 0.0030 0.10 20.0% 95.09 74.00 21.09 73.53 0.03 0.496282 0008 5.4 OFF OFF ON 1.48 141076 68.9 1.024 0.0 2.181 1.000 96 [3.15, 27.81] 4000k/834866/47.7/0k
17,224,136 1498.5 0.0030 0.10 20.0% 94.89 74.53 20.36 73.51 0.10 0.497386 0008 5.8 OFF OFF ON 1.51 139707 68.2 1.032 0.0 3.493 1.000 93 [2.81, 28.22] 4000k/838960/47.3/0k
17,311,088 1443.6 0.0030 0.10 20.0% 93.27 72.69 20.58 73.48 0.10 0.495630 0008 5.7 OFF OFF ON 1.48 139901 68.3 1.036 0.0 2.252 1.000 101 [3.39, 28.20] 4000k/843059/47.1/0k
17,397,960 1443.3 0.0030 0.10 20.0% 91.29 71.36 19.93 73.38 0.13 0.498550 0008 5.1 OFF OFF ON 1.51 139827 68.3 1.034 0.0 5.228 1.000 98 [3.32, 27.75] 4000k/847161/47.2/0k
17,484,480 1476.2 0.0030 0.10 20.0% 97.39 76.00 21.39 73.35 0.27 0.496442 0008 5.9 OFF OFF ON 1.52 138224 67.5 1.042 0.0 8.226 1.000 93 [2.55, 27.18] 4000k/851217/46.9/0k
17,571,136 1410.7 0.0030 0.10 20.0% 80.74 63.01 17.73 73.21 0.02 0.496578 0008 5.0 OFF OFF ON 1.52 138528 67.6 1.041 0.0 2.446 1.000 103 [3.37, 28.41] 4000k/855282/46.9/0k
17,657,912 1425.8 0.0030 0.10 20.0% 99.76 77.94 21.83 73.39 -0.00 0.495757 0008 5.5 OFF OFF ON 1.50 139218 68.0 1.037 0.0 10.962 0.912 91 [3.41, 27.53] 4000k/859367/47.1/0k
17,744,864 1479.4 0.0030 0.10 20.0% 95.13 73.89 21.24 73.45 -0.05 0.494338 0008 5.9 OFF OFF ON 1.49 139886 68.3 1.034 0.0 9.928 1.000 94 [2.32, 28.16] 4000k/863472/47.2/0k
17,831,200 1428.6 0.0030 0.10 20.0% 93.81 72.98 20.83 73.39 0.00 0.496948 0008 5.7 OFF OFF ON 1.54 136851 66.8 1.050 0.0 11.177 0.895 97 [3.32, 28.36] 4000k/867487/46.5/0k
17,918,200 1487.1 0.0030 0.10 20.0% 92.73 72.15 20.58 73.29 0.06 0.496294 0008 5.8 OFF OFF ON 1.51 139105 67.9 1.041 0.0 19.121 0.523 95 [3.56, 28.05] 4000k/871568/46.9/0k
18,005,368 1439.2 0.0030 0.10 20.0% 96.83 75.21 21.62 73.30 0.14 0.494837 0008 5.9 OFF OFF ON 1.49 139759 68.2 1.038 0.0 2.580 1.000 97 [3.50, 28.19] 4000k/875669/47.0/0k
18,091,696 1415.1 0.0030 0.10 20.0% 91.20 71.05 20.15 73.27 0.08 0.495673 0008 5.8 OFF OFF ON 1.53 138407 67.6 1.038 0.0 9.783 1.000 96 [3.32, 28.40] 4000k/879730/47.0/0k
18,178,312 1499.2 0.0030 0.10 20.0% 102.73 80.29 22.45 73.47 0.13 0.492954 0008 5.6 OFF OFF ON 1.51 139133 67.9 1.036 0.0 12.200 0.820 89 [1.60, 28.07] 4000k/883812/47.1/0k
18,265,040 1452.4 0.0030 0.10 20.0% 81.57 63.20 18.37 73.38 -0.18 0.494883 0008 5.0 OFF OFF ON 1.52 138533 67.6 1.042 0.0 15.100 0.662 101 [3.66, 27.77] 4000k/887876/46.9/0k
18,352,464 1459.1 0.0030 0.10 20.0% 88.41 68.59 19.82 73.29 -0.29 0.494905 0008 5.4 OFF OFF ON 1.47 140300 68.5 1.037 0.0 30.613 0.327 102 [3.68, 27.78] 4000k/891992/47.1/0k
18,439,144 1430.5 0.0030 0.10 20.0% 94.45 73.59 20.85 73.18 -0....
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.