Building Research-Critic: A Tool Development Story

TL;DR

We built a multi-model AI system that applies harsh methodology critique to research designs. While the tool proved useful for our own work, proper validation would require blinded benchmarks and expert adjudication - work we haven't done yet. The tool's real validation came when it caught the flaws in our own validation claims.

The Problem We Faced

After spending time on flawed research methodologies (consciousness detection that measured text style, tool selection studies with circular reasoning), we realized we needed external validation to catch methodology errors before investing significant effort.

Our Solution: A Multi-Model Critique System

What We Built

Python tool using GPT-5 (primary), Gemini, and Claude
Structured prompts targeting methodology flaws, circular reasoning, gameability, and falsifiability
Configurable "reasoning effort" for GPT-5's analysis depth
Saves critique results for comparison across iterations

How It Works

Takes research methodology documents as input
Applies specialized critique prompts to multiple AI models
Returns detailed analysis of potential flaws and improvements
Aggregates results (currently informal consensus)

What We Learned From Building It

Tool Development Process

Started with GPT-4, upgraded to GPT-5 with reasoning_effort parameter
Fixed API issues: max_completion_tokens instead of max_tokens
Added multi-model support for diverse perspectives
Discovered GPT-5 provides most reliable critique; others less consistent

Real Usage Results

Applied to our own methodologies:

AI tool selection research: Caught single-session sampling, unfalsifiable preferences
Improved study design: Found arbitrary sample sizes, undefined metrics
Success pattern analysis: Revealed confounding variables, inadequate power
"Honest" pilot study: Identified that acknowledging limitations doesn't fix design flaws

Technical Insights

External critique catches issues internal review consistently misses
Iterative methodology improvement is harder than expected - new flaws emerge
AI can provide sophisticated technical criticism with specific suggestions
Multi-model ensemble adds perspective but isn't true independence

The Tool's Ultimate Validation

Plot twist: We applied the research-critic to our own lab note draft claiming the tool was "validated." The GPT-5 critique was brutal and accurate - it identified our claims as "anecdotal self-validation" with unfalsifiable assertions. This forced us to rewrite with intellectual honesty.

This is the tool's real validation - it caught methodology flaws in our own methodology claims.

Current Limitations and Validation Gaps

What We Haven't Done (But Should)

No ground truth: Never verified whether flagged issues are real problems
No expert comparison: Haven't compared against human methodology reviewers
No baseline testing: Didn't test against simple checklists or single-model alternatives
No blinded evaluation: All testing on our own work by our own team
No outcome validation: Don't know if critique actually improves research quality

Known Technical Issues

Overreliance on GPT-5 (proprietary, may drift over time)
Potential style bias - may reward verbose, checklist-style writing
No formal aggregation rules for multi-model disagreements
Vulnerable to prompt injection and rhetorical gaming
Categories may miss important methodology issues outside our five focus areas

Honest Assessment: What We Actually Validated

What We Proved

The tool can be built and runs reliably
It generates plausible-sounding methodology critique
Using it felt helpful for our team's decision-making
Multiple iterations show consistent harsh evaluation
It catches flaws even in our own validation claims

What We Didn't Prove

That its critiques are more accurate than alternatives
That it actually saves time or improves research quality
That it works outside our specific domain/team
That the multi-model approach adds value over single models

Proper Validation Would Require

Minimum Viable Validation

Blinded benchmark of methodologies with expert-labeled flaws
Comparison against human experts and standard checklists
Inter-model agreement analysis and calibration metrics
Adversarial testing with deliberately flawed but well-written documents

Comprehensive Validation

Prospective randomized trial across research teams
Outcome measures: design changes, acceptance rates, replication success
Cost-benefit analysis including false positive/negative rates
Cross-domain testing beyond AI research methodologies

Why Share This Incomplete Work?

Tool Development Value

Demonstrates feasibility of automated methodology critique
Shows how to integrate multiple AI models for research feedback
Documents technical implementation (GPT-5 reasoning_effort, API patterns)
Provides working code others can extend

Meta-Research Learning

External validation exposed our own validation failures
Self-application revealed the tool's potential and limitations
Honest negative results contribute to methodology literature
Shows how sophisticated tools can still measure nothing meaningful

Try It Yourself (With Appropriate Skepticism)

Installation

pip install openai google-generativeai anthropic
# Set API keys as environment variables

Usage

python research-critic.py methodology.md --type methodology --reasoning-effort medium

What To Expect

Harsh but potentially useful critique of research designs
Technical suggestions for methodology improvements
Identification of common research design pitfalls
Not validated effectiveness - use as one input among many

Conclusion: Useful Tool, Unvalidated Claims

This tool development succeeded in creating external methodology validation that proved useful for our team. However, our claims about its effectiveness are based on limited self-use rather than rigorous evaluation.

Bottom Line: We built something that works and seems helpful, but proper validation remains future work. Use with appropriate skepticism as one methodology review input, not a research quality oracle.

The Most Honest Thing We Can Say

Building better research tools is valuable even when the tools themselves need better validation. Sometimes the process teaches as much as the product. The research-critic system prevented us from continuing down flawed research paths, but whether it actually improved our research quality remains an open question requiring proper evaluation.

Resources and Next Steps

Code: research-critic.py - GPT-5 enabled, full implementation
Future Work: Blinded benchmark development, expert comparison studies
Validation Roadmap: External team testing, outcome measurement, baseline comparisons