AgentWorkBench is an OpenEnv-compatible evaluation environment designed to test how well AI agents can manage real-world workplace tasks such as task classification, priority assignment, and workflow scheduling. Unlike game environments, this project focuses on realistic engineering workflow evaluation with deterministic grading.
The environment evaluates whether AI agents can:
• Classify engineering tasks
• Assign correct priorities
• Schedule workflows efficiently
• Complete tasks correctly
• Minimize unnecessary actions
The environment follows the OpenEnv interaction model:
reset() → Initialize environment
step(action) → Execute agent decision
state() → Return evaluation results
Easy → Task classification only
Medium → Classification + priority assignment
Hard → Classification + priority + scheduling optimization
Rewards include:
• Correct classification
• Correct priority assignment
• Correct scheduling
• Efficient task completion
Penalties include:
• Incorrect decisions
• Duplicate actions
• Unnecessary steps
Final scores are normalized between 0.0 and 1.0.
The project follows a modular environment design:
Environment Layer → Task simulation and state transitions
Reward Layer → Decision evaluation logic
Grader Layer → Score normalization
Baseline Layer → Deterministic benchmark agent
API Layer → Environment interaction endpoints
This modular design ensures deterministic evaluation and maintainability.
env/ environment.py tasks.py grader.py reward.py models.py
baseline/ baseline_agent.py
api/ server.py
openenv.yaml
Dockerfile
requirements.txt
README.md
pip install -r requirements.txt
python -m baseline.baseline_agent
uvicorn api.server:app --reload
Open in browser:
/tasks → List available tasks
/grader → Return evaluation state
/baseline → Run baseline benchmark
The environment prioritizes:
• Decision correctness
• Workflow efficiency
• Task prioritization quality
rather than simple task completion to better reflect real-world AI workplace evaluation.
This environment can be used for:
• AI agent evaluation
• Workflow optimization testing
• Task management benchmarking
• Research experimentation
If dependencies missing:
pip install -r requirements.txt
If port is busy:
uvicorn api.server:app --port 8001
The environment was built with three priorities:
Reliability → deterministic grading
Realism → workplace simulation
Simplicity → clean environment interface
Shivam Modanwal
B.Tech – AI Environment Evaluation Project