AI Model Evaluations Overview and Comparisons
Purpose
This wiki centralizes evaluations of AI models against CORTEX needs: personalization, NEXUS methodology support (e.g., design-before-code, event modeling), LOGOS compatibility (e.g., metadata-rich exports), and general abilities/limitations. Link to model-specific topics for details.
Evaluation Criteria
- Knowledge Preservation: Export formats, metadata completeness for LOGOS import.
- Methodology Fit: Handling Event Models, specs, paths without errors.
- Context & Scalability: Window size, compaction for long threads.
- Refusals/Censorship: Minimal for iterative dev.
- Integration Potential: API, real-time tools for CORTEX.
- Other: Speed, cost, strengths in coding/reasoning.
Comparison Table
(Update as tests/refinements come in. Scores 1-10 based on CORTEX fit.)
| Model | Context Window | Exportability | NEXUS Fit | Limitations | Strengths | Score | Links |
|---|---|---|---|---|---|---|---|
| Claude (4.x) | 200k-1M (compaction) | JSON (metadata gaps) | Strong reasoning for designs | High refusals | Polished output | 8/10 | Evaluating Claude |
| Grok (4.x) | 256k-2M | Third-party MD/JSON | Fast iterations, low refusals | No native compaction | Real-time tools | 9/10 | Evaluating Grok |
| GPT-4o/o1 | 128k-1M | Native JSON/HTML | Versatile coding | Hallucinations, costs | Ecosystem | 7/10 | Evaluating OpenAI GPT |
| [Add e.g., Gemini] | TBD | TBD | TBD | TBD | TBD | TBD | TBD |
Next Steps
- Run benchmarks from Test Benchmarks on each model.
- Post discoveries in model topics, then refine this wiki.
- Tie to Chat Import Pipeline for export handling.
- Progress