One of the biggest is Gremlin, a reliability management and chaos engineering platform. Gremlin helps companies avoid outages and move faster to innovation by finding reliability problems at enterprise scale and fixing them. Its tools include Fault Injection, Reliability Scoring and Dependency Discovery to keep systems up and running. It works on a variety of cloud computing foundations, but is geared for companies in finance, retail and tech.
Another option is xMatters, a service reliability platform for DevOps, SREs and operations teams. xMatters automates workflows, ensures infrastructure is available, and delivers products at scale. Its features include no-code and low-code workflow automation, adaptive incident management and signal intelligence for alert filtering and correlation. It's geared for teams trying to keep services up and running and to protect against service problems that can cause outages and hurt customers.
For a more AI-infused approach, ServiceNow Cloud Observability monitors cloud-native and monolithic applications in real time and responds to changes. The system can help keep systems up and running by giving developers and operations teams visibility into dependencies and by consolidating events to speed up problem resolution. It integrates with existing workflows and speeds up time to value with its AI and digital IT tools.