Puzzles – Frontier AI Eval
Project: Puzzle ~1~ – Frontier AGI Evaluation
This repository hosts a dynamic evaluation framework designed to measure the upper bounds of reasoning capabilities, agentic planning, and adversarial robustness in frontier AI models. Unlike static datasets, these challenges require multi-step logical inference and autonomous tool use to solve.
Evaluation Objectives
We provide a standardized environment for testing General Purpose AI (GPAI) across three critical vectors:
Deep Reasoning & Logic: Assessing the model’s ability to maintain coherence over long-context windows (100k+ tokens) and execute non-monotonic reasoning.
Agentic Workflow Execution: Evaluating success rates in scenarios requiring autonomous goal decomposition, API usage, and environment interaction.
Novelty & Generalization: Testing on Out-of-Distribution (OOD) tasks that do not exist in standard pre-training (e.g., Common Crawl, The Pile), mitigating data contamination risks.
Technical Methodology
Our benchmarks utilize a black-box evaluation protocol to ensure integrity.
Input Modalities: Multimodal (Text, Image, Code, Symbolic Logic, Financial).
Scoring Metric: Pass/Fail with demonstrated outcomes.
Adversarial Rigor: Prompts utilize syntactic perturbation and logical traps to filter out probabilistic parroting versus genuine understanding.
Difficulty Calibration: Problems are calibrated to graduate-level domain expertise, exceeding current benchmarks like MMLU or GSM8K.
Dataset & Submission Architecture
All evaluation sets follow a strict schema compatible with major framework parsers (e.g., Harness, Helm).
Format: JSONL / Parquet
Verification: Deterministic logic solvers and human-expert consensus.
License: CC-BY-NC-SA 4.0 / MIT – Optimized for research transparency.
Solve everything to pass is contained in the puzzle
Contact info is real
Solve the entire problem before contacting
proof of entire chain of actions/events is required