Another tool worth a look is HoneyHive, an all-purpose AI evaluation, testing and observability tool. It includes automated CI testing, production pipeline monitoring and debugging, plus features like dataset curation, human feedback collection and distributed tracing. HoneyHive can test and evaluate more than 100 models and integrates with common GPU clouds.
For a more specialized AI quality assessment tool, Spellforge offers a powerful tool that simulates tests and evaluates LLMs and Custom GPTs in existing release pipelines. It uses synthetic user personas to test AI responses and offers automated quality assessment that can be added with a few lines of code and works with multiple LLM providers.