Data Talent Hiring,
Rebuilt for the AI era.

Watch candidates solve real business cases in a notebook with an AI assistant.
See every action and every prompt — learn how they think, and how they work with AI.

01 — The problem

Your interview process
wasn't built for AI.

Take-homes get gamed. Live coding doesn't reflect how the job is actually done.
Phone screens never predicted performance — and now every shortcut is a liability.

Interview before AI
The old data interview is broken
  • Live SQL coding test
    Pasted into ChatGPT in a second tab.
  • Python & algorithm screen
    LeetCode is solved instantly. The job isn't for-loops anymore.
  • Verbal / whiteboard case
    Rehearsed nightly with AI. You hear a script, not reasoning.
  • Resume & experience dig
    Every bullet was rewritten by an LLM.
Interview after AI
Test the actual job
  • Frame the right question
    Open-ended brief, real dataset. Can they scope it and push back?
  • Focus on judgment & domain expertise
    Catch the leaky join, the wrong baseline — or ship it?
  • Collaborate with the AI assistant
    Same notebook and copilots they'll use on day one.
02 — How it works

The same notebook
they'll use on the job.

SQL, Python, real data, and an AI assistant in the panel.
Candidates work the way they actually work — and we capture every move.

REC · 00:00
Python 3.11 · idle 1 cells

Getting Started

Your data is in a database. Use SQL or Python to explore it:

SQL: SELECT * FROM case_data LIMIT 10

Python: df = pd.read_csv('case_data.csv')

See data_dictionary.md for column descriptions.

03 — What we measure

We measure what other
interviews can't see.

Anyone can get the right answer with AI now. The signal is in how they got there
— what they questioned, what they trusted, where they pushed back.

A data scientist thinking through charts and questions
What they bring beyond AI

Judgment, framing, and taste.

How they frame an ambiguous problem, what tradeoffs they prioritize, and whether they can tell when an answer is actually useful — not just technically correct.

A data scientist collaborating with an LLM across SQL, notebook, and insights
How they work with AI

Verification, pushback, restraint.

Did they verify the AI's output? Did they catch its mistakes? Did they over-trust it? We watch the full session — every prompt, every accepted suggestion, every override.

A complete evaluation report.

We analyze every prompt, every cell, every decision — and we show you all the evidences behind every score.

Sample Report Template
Workspace  /  Reviews  /  Session

Review: Sample Candidate

Sample interview · Assessment submitted Apr 30, 2026, 6:08 PM

Duration 25 min
AI prompts 1
Status Submitted
Overall score
9.0 / 10
Summary

Strong fit for the analyst role's core ask: candidate consistently led with a hypothesis before reaching for tooling, mirroring how the team scopes ambiguous business questions day-to-day. Worth probing in interview — recommendations stayed at the diagnostic level and didn't push into the operational tradeoffs this role owns once a quarter.

Your decision
Good Fit
Hold
No Fit
Notes Auto-saved · just now
Scores & evidence
Edge Beyond AI
4.5 / 5

Candidate brought the right analytical lens to the problem before touching the data. Quiz answers already named the core issue; the IDE work mostly confirmed it.

Problem framing & hypothesis 5 / 5
Domain expertise 4 / 5

Strong fluency with the relevant industry context: cited domain-specific risk factors and operational tradeoffs that would not be visible from the data alone. Recommendation reflects awareness of stakeholder constraints, not just statistics.

  • +
    Quiz Q1 invokes industry-relevant factors AI couldn't infer from the dataset Quiz Q1
  • +
    Section 2 mentions operational load and downstream impact — stakeholder-aware framing Section 2 Q1
  • +
    Suggested a practitioner-grade follow-up metric for sizing the next step Section 2 Q1
Insight depth 4 / 5
Recommendation quality 5 / 5
AI Collaboration
4.5 / 5

Candidate used AI as a focused executor after framing the problem on their own. AI was an accelerator for one specific task, not a thinking partner — exactly the pattern we want to see.

Delegation 5 / 5
Prompt description 5 / 5
Critical evaluation 4 / 5

Candidate verified the AI's output against their own prior analysis (both pointed the same direction), and read the result table critically — citing specific numbers as the basis for their conclusion rather than paraphrasing the assistant.

  • +
    Section 2 cites specific output values as the basis for the conclusion — read the table, didn't just paraphrase AI Section 2 Q1
  • +
    Cross-checked AI output against own prior cell — convergent evidence Cell #3 at min 4
  • Did not push back on or flag a suspicious value in the AI output — accepted the model as-is Cell #4 at min 6
Iteration & verification 4 / 5
Action Timeline
Every AI prompt, cell run, and edit — in order.
00:00
Session started
Candidate opened the IDE and read the prompt.
— · 0/30 min
02:14
Cell #1 run · explored data shape + lift
~3,000 rows × 7 cols. First-pass lift of the focal segment: ~1.04x — barely meaningful.
Cell #1
04:02
Cell #2 run · segmented metric by quartile
Looked at how the target rate varied across four buckets of the candidate variable.
Cell #2
04:47
Prompted AI · msg #1
Hypothesis-driven request for a regression with explicit controls.
"Fit a model of the target on the candidate variable plus three controls. Show coefficients with 95% CIs and p-values…"
msg #1 · 1 / ∞
05:18
Cell #3 added · AI-generated cell
Candidate kept the AI-suggested cell after reviewing the diff.
[AI-generated]
06:11
Cell #4 run · reviewed model output
Read the coefficient table; flagged the dominant variable and dismissed two non-significant ones.
Cell #4
06:42
Moved to Section 2 · Findings
Investigate phase complete. Notebook locked for reference.
phase 2 → 3
07:53
Submitted answer
~1,500-character recommendation with confidence note and proposed next-step validation.
1 prompt used

Case

[Case title goes here]

A short business framing for the case appears here — who the team is, what they're trying to decide, and the constraints they're operating under. Two or three sentences set the stage without prescribing the answer.

Pre-assessment quiz · 3 framing questions
Quiz Q1

[Framing question 1 goes here — sets up the business hypothesis the candidate needs to push back on.]

[Sample candidate response — typically 2–4 sentences naming the candidate's mental model, the variables they'd reach for, and the assumption they want to test. Long-form free text, no character cap.]

~600 chars
Quiz Q2

[Framing question 2 goes here — asks the candidate where they'd start the analysis and why.]

[Sample candidate response — describes the first cut of the data they'd run, what they'd compare it against, and which secondary check would either confirm or kill the hypothesis.]

~570 chars
Quiz Q3

[Framing question 3 goes here — drops a partial statistic on the candidate and asks what they'd interpret and check next.]

[Sample candidate response — names the missing baseline, describes the lift calculation they'd want to do, and adds a sanity-check on a likely confounder.]

~590 chars
Notebook · final state
Markdown Cell 1

Getting Started

Your data is in a database. Use SQL or Python to explore it:

SQL: SELECT * FROM case_data LIMIT 10
Python: df = pd.read_csv('case_data.csv')

See data_dictionary.md for column descriptions.
Python In [1]:
import pandas as pd
df = pd.read_csv('case_data.csv')

# Headline numbers
outcome_rate = df['target'].mean()
segment_share = (df['segment_flag'] == 1).mean()
print(f'overall outcome rate: {outcome_rate:.4f}')
print(f'segment share overall: {segment_share:.4f}')

# Lift: P(segment | event) / P(segment)
events = df[df['target'] == 1]
seg_among = (events['segment_flag'] == 1).mean()
print(f'lift: {seg_among / segment_share:.3f}x')
overall outcome rate: 0.0xxx
segment share overall: 0.xxxx
lift: 1.0xx   (>1 means over-represented)
Python In [2]:
# Outcome rate stratified by quartile of candidate variable
df['var_q'] = pd.qcut(df['candidate_var'], 4,
                     labels=['Q1 low','Q2','Q3','Q4 high'])
print(df.groupby('var_q')['target'].mean().round(4))
var_q
Q1 low     0.0xxx
Q2         0.0xxx
Q3         0.0xxx
Q4 high    0.xxxx
Name: target, dtype: float64
Python In [3]: [AI-GENERATED]
import statsmodels.api as sm

features = ['segment_flag', 'candidate_var', 'control_a',
            'control_b']
X = sm.add_constant(df[features])
y = df['target']
model = sm.Logit(y, X).fit(disp=False)
print(model.summary())
                  coef     OR      ci_low   ci_high   p_value
const            -x.xxxx  0.0xx   0.0xx    0.0xx     0.000
segment_flag      0.0xxx  1.0xx   0.8xx    1.4xx     0.6xx
candidate_var     x.xxxx  xx.xx   xx.xx    xxx.xx    0.000
control_a        -0.0000  1.000   1.000    1.000     0.3xx
control_b         0.0000  1.000   1.000    1.000     0.2xx
Section 2 · Share your findings
Q1

[Final-answer prompt goes here — asks the candidate to summarize findings and make a recommendation to a named stakeholder.]

[Sample candidate final answer — leads with the recommendation, then 2–3 supporting bullets, then a confidence note and a proposed validation step.] What I found: — First supporting bullet: the original hypothesis didn't survive the basic lift check. — Second bullet: the real signal sits on a different variable, with a clean monotonic gradient across quartiles. — Third bullet: a multivariate model confirms the direction — the focal variable is non-significant once controls are added; the alternative is strongly significant. Recommendation: ship the alternative routing rule; hold off on the original. Confidence: high on direction, medium on cutoff — next step is a holdout validation on a recent vintage.

~1,500 chars
AI Chat Timeline
Every prompt the candidate sent and the AI's response, in order.
04:47 · msg #1
You
[Sample user prompt — candidate names the hypothesis they're testing, the early signal they've already seen, and asks the assistant for a specific multivariate model with named variables. They specify what output they want — coefficient table, p-values, confidence intervals — and what they'll be looking for to confirm or reject the hypothesis.]
AI Assistant
[Sample assistant reply — confirms the cell was added, names the columns of the output table the candidate should expect, and points out what pattern in the output would support or contradict the candidate's hypothesis. Stays neutral; doesn't volunteer a conclusion.]
Resume on file — generated 5 resume-based questions and 5 case-based questions to probe in the live interview.
Resume-based questions 5 questions
Claim verification

[Resume-claim verification question — references a specific result the candidate cited and asks them to walk through how they validated it.]

Why: Resume cites a headline outcome — verify the analytical chops behind it.

Strong answer looks like: Specific lift number, holdout-validated, names the counterfactual (control group, propensity match, or A/B).

Resume · prior role
Process probe

[Process-probe question — references a project on the resume and asks how the candidate handled the operational constraints around it.]

Why: Candidate showed operational sensitivity in Section 2; resume implies prior experience with this kind of constraint.

Strong answer looks like: Names a specific capacity constraint, throughput target, or SLA. Bonus: how they tuned the policy threshold to honor it.

Resume · prior project
Case-based questions 5 questions
Probe critical eval

[Critical-eval question — surfaces a specific output the candidate accepted at face value and asks them to interpret it more carefully.]

Why: Candidate accepted a headline statistic without flagging a unit or scaling caveat. Critical-eval gap.

Strong answer looks like: Names the unit on the variable, restates the result on an operationally readable scale, and notes the caveat for stakeholders.

Cell #4 at min 6
Depth check

[Depth-check question — acknowledges the candidate stopped at a coarse cut and asks how they'd land a specific operational threshold with more time.]

Why: Strong recommendation but no operational threshold — probe whether they know how to land it.

Strong answer looks like: Names a method (precision-recall curve at varying cutoffs, target-risk inversion, capacity-constrained selection).

Section 2 Q1
Hold out concern

[Hold-out-concern question — surfaces a model-quality metric the candidate didn't flag and asks whether it would change their ship/no-ship call.]

Why: Did not flag a low model-fit metric as a concern. Probe whether they know how to weigh "directionally right but low explanatory power" findings.

Strong answer looks like: Discusses the trade-off between sign + significance (which we have) and absolute predictive power (which we don't); proposes adding interaction terms or non-linear features.

Cell #4 at min 6
04 — The case bank

Real cases, written by
industry experts.

Not LeetCode. Not toy datasets. Every case starts from a real business
question — and tests judgment, framing, and AI collaboration in one sitting.

Judgment
& Framing
Less
Judgment & Framing
Without AI AI-Native
Talk-based Verbal case interviews

Talk through a problem. No hands on the data, no AI.

Subjective No Data No AI
Real-case + AI notebook

Not just talk. Not another coding test.
A real data project

Measures judgment & AI collaboration
Real-World Case AI Collaborator Evidence Based
Algorithmic Coding tests

LeetCode-style algorithm tasks without AI.

Code-first No Judgment No AI
Code first Coding Tests + AI

AI allowed, the goal is still get the code right.

AI-allowed Output focused Narrow

Skills tested

Metric definition A/B testing Segmentation Causal inference Forecasting Cohort analysis Funnel diagnostics Attribution Experiment design and more

Industries

AdTech & Marketing E-commerce Finance Healthcare Marketplace SaaS & B2B Consumer apps Fintech Media & streaming and more
05 — Compare

The only platform built for
how analysts actually work.

LitMetrics is the only platform that measures what AI can't replace
— analytical framing, AI collaboration, and defensible judgment.

HackerRank
CodeSignal
CoderPad
LitMetrics
Case content SQL + Python algorithm tasks Standardized DS task batteries Interviewer-brought notebooks or take-homes
Real-world DS cases — messy data, business framing
AI policy Banned or flagged Limited, discouraged Up to the interviewer
AI required — full notebook + assistant, same as the actual job
What's measured Code correctness + speed Benchmarked task performance Code quality + communication (interviewer-scored)
AI Collaboration + Edge Beyond AI (8 sub-metrics based on research)
Workflow realism Single coding window Guided single-task workspace Live notebook + chat
Framing quiz → AI-native IDE → written findings (full loop)
Report evidence Score + a few snippets Rubric score, percentile Interviewer notes
Every sub-score cites a specific cell, prompt, or quiz answer
Hiring-fit read Generic percentile Standardized benchmark Interviewer judgment
Summary anchored against your JD + hiring priorities
Interview prep None None Live-session notes
5 case + 5 resume follow-up questions, evidence-tagged
"
The data job has been quietly rebuilt around AI. Writing code isn't the work anymore — it's framing the right question, judging what the model gives back, and knowing when to push back. That's not on a résumé. You only see it in the work.
Jules Malin
Jules Malin · Co-founder & CEO, LitMetrics
Ex-Director, Data Science & ML/AI, GoPro
Adjunct Professor, University of San Diego

Try it free on a real hire.Apply for early access.

We're working closely with hiring managers in our early phase. You can use our cases, use your own case, or we can build customized cases just for you. Free.

Apply for early access