Skip to content

Fix critical AI learning failure: data corruption and reward signal bugs#3

Draft
Copilot wants to merge 4 commits intolayertaperingfrom
copilot/fix-aed3497d-37f5-415f-9e78-6abd89c1fecc
Draft

Fix critical AI learning failure: data corruption and reward signal bugs#3
Copilot wants to merge 4 commits intolayertaperingfrom
copilot/fix-aed3497d-37f5-415f-9e78-6abd89c1fecc

Conversation

Copy link
Contributor

Copilot AI commented Sep 29, 2025

Problem

The AI stopped learning after removing synthetic targeting information from parameters, remaining largely immobile even after 100,000+ training steps. Previously, the AI could master basic movement when it had access to engineered targeting features, but failed to learn from raw enemy position data alone.

Root Cause Analysis

Investigation revealed two critical bugs that corrupted the learning pipeline:

1. Data Corruption Bug (Lua State Serialization)

Location: Scripts/main.lua lines 259, 315, 337, 345, 349, 353, 357

The code referenced an undefined variable is_open_level_flag throughout the relative segment normalization process:

-- This variable was undefined but used everywhere:
push_relative_norm(binary_data_parts, es.enemy_segments[i], is_open_level_flag)

Impact: All enemy position data, player shot segments, and game state relative positions were incorrectly normalized, sending corrupted data to the neural network and causing training instability.

2. Reward Signal Logic Bug (Enemy Detection)

Location: Scripts/state.lua lines 1062-1092

The reward system was using expert targeting results (strategic recommendations) as "nearest enemy" data for reward calculation. When no immediate threats existed, the expert system would return the player's current position, which the reward system interpreted as "no enemy found."

-- Before: Expert target used for both strategy AND rewards
local nearest_abs_seg = find_target_segment(...)  -- Returns strategy, not actual enemy
if nearest_abs_seg == -1 then
    self.nearest_enemy_seg = INVALID_SEGMENT  -- Skips proximity rewards
end

Impact:

  • Proximity optimization rewards were skipped during normal gameplay
  • Targeting rewards received weak signal
  • AI only learned during crisis situations, not regular enemy engagement
  • ~90% reduction in learning signal strength

Solution

Fix #1: Proper Variable Initialization

-- Calculate is_open_level flag early before normalization
local is_open_level_flag = ls.level_type == 0xFF

Fix #2: Separate Expert Strategy from Enemy Detection

-- Get expert targeting for strategic positioning
local expert_target_abs_seg = find_target_segment(...)

-- Find ACTUAL nearest enemy for reward calculation (independent logic)
local actual_nearest_abs_seg = -1
local min_distance = 999
for i = 1, 7 do
    if self.enemy_abs_segments[i] ~= INVALID_SEGMENT and self.enemy_depths[i] > 0 then
        local distance = math.abs(abs_to_rel_func(player_abs_segment, self.enemy_abs_segments[i], is_open))
        if distance < min_distance then
            min_distance = distance
            actual_nearest_abs_seg = self.enemy_abs_segments[i]
        end
    end
end

Results

These fixes restore the AI's learning capability by:

  1. Data Integrity: Enemy positions are now properly normalized in [-1,+1] range
  2. Strong Reward Signal: Consistent proximity (0.25 units), targeting (up to 0.75 units), and positioning (up to 0.1 units) rewards
  3. Signal Consistency: Learning feedback available throughout all gameplay phases, not just crisis situations

The AI should now learn basic movement and targeting from raw enemy position data within the expected 100,000 step timeframe.

Testing

Added diagnostic output for the first 10 frames to validate:

  • Enemy detection accuracy (expert vs actual tracking)
  • Reward signal strength (objective, subjective, total components)
  • Data normalization integrity

Files Changed

  • Scripts/main.lua: Fixed undefined variable in state serialization
  • Scripts/state.lua: Separated expert targeting from actual enemy detection
  • Added debug output for validation
Original prompt

Since removing the targetting information from the parameters, the Ai no longer learns any gameplay, and I can't figure out why. Since it could learn when the target was in the parameters, it seems clear the AI CAN learn. But it is not learning from the game parameters we are currently passing.

I need you to perform an in-depth review fixing bugs only (not minor improvements) that would impact learning, from the collection of the games state from Tempest RAM all the way through to passing the state to python, placing it in replay buffer with the associated action and reward and next state, and then playing it back in a batch for learning.

Look for anything that would significantly impact this process or stop it from working.
1.50 139984 68.4 1.033 0.0 0.938 1.000 95 [3.42, 26.44] 4000k/814348/47.3/0k
16,791,216 1466.2 0.0030 0.10 20.0% 97.05 75.65 21.40 73.70 -0.09 0.500291 0008 6.2 OFF OFF ON 1.47 141090 68.9 1.029 0.0 1.810 1.000 102 [2.27, 26.96] 4000k/818488/47.4/0k
16,877,784 1485.9 0.0030 0.10 20.0% 90.16 70.48 19.67 73.64 -0.15 0.497933 0008 5.6 OFF OFF ON 1.51 139299 68.0 1.034 0.0 2.997 1.000 98 [3.31, 27.70] 4000k/822575/47.2/0k
16,964,328 1434.4 0.0030 0.10 20.0% 96.49 75.40 21.08 73.59 -0.02 0.499253 0008 5.3 OFF OFF ON 1.51 138799 67.8 1.038 0.0 4.034 1.000 93 [3.28, 28.66] 4000k/826647/47.1/0k
17,050,856 1410.2 0.0030 0.10 20.0% 92.64 72.49 20.15 73.52 0.02 0.499355 0008 5.7 OFF OFF ON 1.52 139054 67.9 1.036 0.0 9.502 1.000 90 [3.10, 27.62] 4000k/830727/47.2/0k
17,137,624 1463.4 0.0030 0.10 20.0% 95.09 74.00 21.09 73.53 0.03 0.496282 0008 5.4 OFF OFF ON 1.48 141076 68.9 1.024 0.0 2.181 1.000 96 [3.15, 27.81] 4000k/834866/47.7/0k
17,224,136 1498.5 0.0030 0.10 20.0% 94.89 74.53 20.36 73.51 0.10 0.497386 0008 5.8 OFF OFF ON 1.51 139707 68.2 1.032 0.0 3.493 1.000 93 [2.81, 28.22] 4000k/838960/47.3/0k
17,311,088 1443.6 0.0030 0.10 20.0% 93.27 72.69 20.58 73.48 0.10 0.495630 0008 5.7 OFF OFF ON 1.48 139901 68.3 1.036 0.0 2.252 1.000 101 [3.39, 28.20] 4000k/843059/47.1/0k
17,397,960 1443.3 0.0030 0.10 20.0% 91.29 71.36 19.93 73.38 0.13 0.498550 0008 5.1 OFF OFF ON 1.51 139827 68.3 1.034 0.0 5.228 1.000 98 [3.32, 27.75] 4000k/847161/47.2/0k
17,484,480 1476.2 0.0030 0.10 20.0% 97.39 76.00 21.39 73.35 0.27 0.496442 0008 5.9 OFF OFF ON 1.52 138224 67.5 1.042 0.0 8.226 1.000 93 [2.55, 27.18] 4000k/851217/46.9/0k
17,571,136 1410.7 0.0030 0.10 20.0% 80.74 63.01 17.73 73.21 0.02 0.496578 0008 5.0 OFF OFF ON 1.52 138528 67.6 1.041 0.0 2.446 1.000 103 [3.37, 28.41] 4000k/855282/46.9/0k
17,657,912 1425.8 0.0030 0.10 20.0% 99.76 77.94 21.83 73.39 -0.00 0.495757 0008 5.5 OFF OFF ON 1.50 139218 68.0 1.037 0.0 10.962 0.912 91 [3.41, 27.53] 4000k/859367/47.1/0k
17,744,864 1479.4 0.0030 0.10 20.0% 95.13 73.89 21.24 73.45 -0.05 0.494338 0008 5.9 OFF OFF ON 1.49 139886 68.3 1.034 0.0 9.928 1.000 94 [2.32, 28.16] 4000k/863472/47.2/0k
17,831,200 1428.6 0.0030 0.10 20.0% 93.81 72.98 20.83 73.39 0.00 0.496948 0008 5.7 OFF OFF ON 1.54 136851 66.8 1.050 0.0 11.177 0.895 97 [3.32, 28.36] 4000k/867487/46.5/0k
17,918,200 1487.1 0.0030 0.10 20.0% 92.73 72.15 20.58 73.29 0.06 0.496294 0008 5.8 OFF OFF ON 1.51 139105 67.9 1.041 0.0 19.121 0.523 95 [3.56, 28.05] 4000k/871568/46.9/0k
18,005,368 1439.2 0.0030 0.10 20.0% 96.83 75.21 21.62 73.30 0.14 0.494837 0008 5.9 OFF OFF ON 1.49 139759 68.2 1.038 0.0 2.580 1.000 97 [3.50, 28.19] 4000k/875669/47.0/0k
18,091,696 1415.1 0.0030 0.10 20.0% 91.20 71.05 20.15 73.27 0.08 0.495673 0008 5.8 OFF OFF ON 1.53 138407 67.6 1.038 0.0 9.783 1.000 96 [3.32, 28.40] 4000k/879730/47.0/0k
18,178,312 1499.2 0.0030 0.10 20.0% 102.73 80.29 22.45 73.47 0.13 0.492954 0008 5.6 OFF OFF ON 1.51 139133 67.9 1.036 0.0 12.200 0.820 89 [1.60, 28.07] 4000k/883812/47.1/0k
18,265,040 1452.4 0.0030 0.10 20.0% 81.57 63.20 18.37 73.38 -0.18 0.494883 0008 5.0 OFF OFF ON 1.52 138533 67.6 1.042 0.0 15.100 0.662 101 [3.66, 27.77] 4000k/887876/46.9/0k
18,352,464 1459.1 0.0030 0.10 20.0% 88.41 68.59 19.82 73.29 -0.29 0.494905 0008 5.4 OFF OFF ON 1.47 140300 68.5 1.037 0.0 30.613 0.327 102 [3.68, 27.78] 4000k/891992/47.1/0k
18,439,144 1430.5 0.0030 0.10 20.0% 94.45 73.59 20.85 73.18 -0....

Created from VS Code via the [GitHub Pull Request](https://marketplace.visualstudio.com/items?itemName=GitHub.vscode-pull-request-github) extension.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 3 commits September 29, 2025 16:49
…alization

Co-authored-by: davepl <3698967+davepl@users.noreply.github.com>
…wards

Co-authored-by: davepl <3698967+davepl@users.noreply.github.com>
Co-authored-by: davepl <3698967+davepl@users.noreply.github.com>
Copilot AI changed the title Since removing the targetting information from the parameters, the Ai no longer learns any gameplay, and I can't figure out why. Since it could learn when the target was in the parameters, it seems clear the AI CAN learn. But it is not learning from ... Fix critical AI learning failure: data corruption and reward signal bugs Sep 29, 2025
Copilot AI requested a review from davepl September 29, 2025 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants