Building Research-Critic: A Tool Development Story

TL;DR

We built a multi-model AI system that applies harsh methodology critique to research designs. While the tool proved useful for our own work, proper validation would require blinded benchmarks and expert adjudication - work we haven't done yet. The tool's real validation came when it caught the flaws in our own validation claims.

The Problem We Faced

After spending time on flawed research methodologies (consciousness detection that measured text style, tool selection studies with circular reasoning), we realized we needed external validation to catch methodology errors before investing significant effort.

Our Solution: A Multi-Model Critique System

What We Built

How It Works

  1. Takes research methodology documents as input
  2. Applies specialized critique prompts to multiple AI models
  3. Returns detailed analysis of potential flaws and improvements
  4. Aggregates results (currently informal consensus)

What We Learned From Building It

Tool Development Process

Real Usage Results

Applied to our own methodologies:

Technical Insights

The Tool's Ultimate Validation

Plot twist: We applied the research-critic to our own lab note draft claiming the tool was "validated." The GPT-5 critique was brutal and accurate - it identified our claims as "anecdotal self-validation" with unfalsifiable assertions. This forced us to rewrite with intellectual honesty.

This is the tool's real validation - it caught methodology flaws in our own methodology claims.

Current Limitations and Validation Gaps

What We Haven't Done (But Should)

Known Technical Issues

Honest Assessment: What We Actually Validated

What We Proved

  • The tool can be built and runs reliably
  • It generates plausible-sounding methodology critique
  • Using it felt helpful for our team's decision-making
  • Multiple iterations show consistent harsh evaluation
  • It catches flaws even in our own validation claims

What We Didn't Prove

  • That its critiques are more accurate than alternatives
  • That it actually saves time or improves research quality
  • That it works outside our specific domain/team
  • That the multi-model approach adds value over single models

Proper Validation Would Require

Minimum Viable Validation

Comprehensive Validation

Why Share This Incomplete Work?

Tool Development Value

Meta-Research Learning

Try It Yourself (With Appropriate Skepticism)

Installation

pip install openai google-generativeai anthropic
# Set API keys as environment variables

Usage

python research-critic.py methodology.md --type methodology --reasoning-effort medium

What To Expect

Conclusion: Useful Tool, Unvalidated Claims

This tool development succeeded in creating external methodology validation that proved useful for our team. However, our claims about its effectiveness are based on limited self-use rather than rigorous evaluation.

Bottom Line: We built something that works and seems helpful, but proper validation remains future work. Use with appropriate skepticism as one methodology review input, not a research quality oracle.

The Most Honest Thing We Can Say

Building better research tools is valuable even when the tools themselves need better validation. Sometimes the process teaches as much as the product. The research-critic system prevented us from continuing down flawed research paths, but whether it actually improved our research quality remains an open question requiring proper evaluation.

Resources and Next Steps

tool development methodology validation external critique research tools intellectual honesty