Model and prompt evaluation comparisons

Comparing outputs and quality across models and prompt variants.

Placeholder content area. Add your full experiment write-up here in the future.