TL;DR
We built a multi-model AI system that applies harsh methodology critique to research designs. While the tool proved useful for our own work, proper validation would require blinded benchmarks and expert adjudication - work we haven't done yet. The tool's real validation came when it caught the flaws in our own validation claims.
The Problem We Faced
After spending time on flawed research methodologies (consciousness detection that measured text style, tool selection studies with circular reasoning), we realized we needed external validation to catch methodology errors before investing significant effort.
Our Solution: A Multi-Model Critique System
What We Built
- Python tool using GPT-5 (primary), Gemini, and Claude
- Structured prompts targeting methodology flaws, circular reasoning, gameability, and falsifiability
- Configurable "reasoning effort" for GPT-5's analysis depth
- Saves critique results for comparison across iterations
How It Works
- Takes research methodology documents as input
- Applies specialized critique prompts to multiple AI models
- Returns detailed analysis of potential flaws and improvements
- Aggregates results (currently informal consensus)
What We Learned From Building It
Tool Development Process
- Started with GPT-4, upgraded to GPT-5 with
reasoning_effort
parameter - Fixed API issues:
max_completion_tokens
instead ofmax_tokens
- Added multi-model support for diverse perspectives
- Discovered GPT-5 provides most reliable critique; others less consistent
Real Usage Results
Applied to our own methodologies:
- AI tool selection research: Caught single-session sampling, unfalsifiable preferences
- Improved study design: Found arbitrary sample sizes, undefined metrics
- Success pattern analysis: Revealed confounding variables, inadequate power
- "Honest" pilot study: Identified that acknowledging limitations doesn't fix design flaws
Technical Insights
- External critique catches issues internal review consistently misses
- Iterative methodology improvement is harder than expected - new flaws emerge
- AI can provide sophisticated technical criticism with specific suggestions
- Multi-model ensemble adds perspective but isn't true independence
The Tool's Ultimate Validation
Plot twist: We applied the research-critic to our own lab note draft claiming the tool was "validated." The GPT-5 critique was brutal and accurate - it identified our claims as "anecdotal self-validation" with unfalsifiable assertions. This forced us to rewrite with intellectual honesty.
This is the tool's real validation - it caught methodology flaws in our own methodology claims.
Current Limitations and Validation Gaps
What We Haven't Done (But Should)
- No ground truth: Never verified whether flagged issues are real problems
- No expert comparison: Haven't compared against human methodology reviewers
- No baseline testing: Didn't test against simple checklists or single-model alternatives
- No blinded evaluation: All testing on our own work by our own team
- No outcome validation: Don't know if critique actually improves research quality
Known Technical Issues
- Overreliance on GPT-5 (proprietary, may drift over time)
- Potential style bias - may reward verbose, checklist-style writing
- No formal aggregation rules for multi-model disagreements
- Vulnerable to prompt injection and rhetorical gaming
- Categories may miss important methodology issues outside our five focus areas
Honest Assessment: What We Actually Validated
What We Proved
- The tool can be built and runs reliably
- It generates plausible-sounding methodology critique
- Using it felt helpful for our team's decision-making
- Multiple iterations show consistent harsh evaluation
- It catches flaws even in our own validation claims
What We Didn't Prove
- That its critiques are more accurate than alternatives
- That it actually saves time or improves research quality
- That it works outside our specific domain/team
- That the multi-model approach adds value over single models
Proper Validation Would Require
Minimum Viable Validation
- Blinded benchmark of methodologies with expert-labeled flaws
- Comparison against human experts and standard checklists
- Inter-model agreement analysis and calibration metrics
- Adversarial testing with deliberately flawed but well-written documents
Comprehensive Validation
- Prospective randomized trial across research teams
- Outcome measures: design changes, acceptance rates, replication success
- Cost-benefit analysis including false positive/negative rates
- Cross-domain testing beyond AI research methodologies
Why Share This Incomplete Work?
Tool Development Value
- Demonstrates feasibility of automated methodology critique
- Shows how to integrate multiple AI models for research feedback
- Documents technical implementation (GPT-5 reasoning_effort, API patterns)
- Provides working code others can extend
Meta-Research Learning
- External validation exposed our own validation failures
- Self-application revealed the tool's potential and limitations
- Honest negative results contribute to methodology literature
- Shows how sophisticated tools can still measure nothing meaningful
Try It Yourself (With Appropriate Skepticism)
Installation
pip install openai google-generativeai anthropic
# Set API keys as environment variables
Usage
python research-critic.py methodology.md --type methodology --reasoning-effort medium
What To Expect
- Harsh but potentially useful critique of research designs
- Technical suggestions for methodology improvements
- Identification of common research design pitfalls
- Not validated effectiveness - use as one input among many
Conclusion: Useful Tool, Unvalidated Claims
This tool development succeeded in creating external methodology validation that proved useful for our team. However, our claims about its effectiveness are based on limited self-use rather than rigorous evaluation.
Bottom Line: We built something that works and seems helpful, but proper validation remains future work. Use with appropriate skepticism as one methodology review input, not a research quality oracle.
The Most Honest Thing We Can Say
Building better research tools is valuable even when the tools themselves need better validation. Sometimes the process teaches as much as the product. The research-critic system prevented us from continuing down flawed research paths, but whether it actually improved our research quality remains an open question requiring proper evaluation.
Resources and Next Steps
- Code: research-critic.py - GPT-5 enabled, full implementation
- Future Work: Blinded benchmark development, expert comparison studies
- Validation Roadmap: External team testing, outcome measurement, baseline comparisons