If you're looking for a powerful foundation for testing and comparing different versions of your LLM app, Deepchecks is a great option. It automates evaluation, detects issues like hallucinations and bias, and offers a "Golden Set" approach to create a rich ground truth. Deepchecks enables version comparison, debugging, and custom properties for more advanced testing, making it a great choice for ensuring high-quality LLM apps from development to deployment.
Another option is Langfuse, an open-source foundation for debugging, analysis and iteration of LLM applications. It offers tracing, prompt management, evaluation and analytics. Langfuse also captures full context of LLM executions and supports integrations with popular SDKs and services. This means you can easily monitor and compare different versions of your app, gaining insights to optimize its performance.
For a no-code option, Langtail offers tools for debugging, testing and deploying LLM prompts. It includes features like fine-tuning prompts, running tests to avoid unexpected app behavior, and monitoring production performance. Langtail's no-code playground and adjustable parameters make it easy to use, so you can quickly test and compare different versions of your app without needing deep technical expertise.