Is your AI product reliable?

At Discern Labs, we provide comprehensive AI evaluation and alignment services—giving you the confidence to ship AI products that work reliably in production.

Talk to us

Our Services

We partner with product and engineering teams to understand pain points in AI systems; we then build tools to measure success and systematically improve your product.

Bespoke evaluations

We build evals that measure whether your product is doing its intended job—at scale.

Error analysis

We build human-in-the-loop tools to identify recurrent failures, and fix them before they reach your users.

Model alignment

We align your models to user intent with supervised and reinforcement learning, ensuring they behave as expected for your users.

Observability

You can't fix what you don't understand. We build cost-efficient observability tools to track model behavior in production.

Synthetic data

Don't have data to fine-tune your system? We generate training-quality, domain-specific synthetic data to fill in the gaps.

Trust & Safety

We test your product's safety, trustworthiness, and compliance with the frontier of Responsible AI standards.

Customer Wins

Real results from partnering with smart teams to solve their AI challenges.

AI Assistant Reliability

Problem

An e-commerce company faced challenges with their AI assistant providing inconsistent and incorrect responses, leading to customer dissatisfaction and trouble going to market.

Our Solution

We instrumented the assistant's code to log all inputs, tool calls and outputs to enable observability
We partnered with the customer and conducted human-in-the-loop error analysis to find recurrent failure modes
We built LLM-as-judge evals to systematically catch failures in a test set
➤ We drove a 70% reduction in error rate, and a significant increase in customer satisfaction in follow-up focus groups.

Search Safety Testing

Problem

A SaaS provider's search feature needed a robust way of rejecting unsafe queries, e.g. hate speech, mentions of suicide, CSAM etc. Their existing classifier had a high false negative rate.

Our Solution

We built a domain-specific taxonomy of harms based on state-of-the-art Responsible AI research
We generated synthetic training and validation datasets to supplement the customer's existing datasets
We set up high quality test set evaluations to guide the team's experimentation and systematically improve
➤ We improved recall from 55% to 95% for unsafe queries, and delivered reusable training data for the customer's ML pipelines.