← Back to blog

Your Editing Pipeline Has a Blind Spot. It's Bigger Than You Think.

I was scrolling through a Facebook thread the other day. Indie authors sharing their AI editing setups. Custom skills, multi-pass pipelines, track changes, agents that run your manuscript through three or four rounds of analysis. Everyone contributing their workflow. Everyone confident.

And nobody... not one person in a thread with dozens of replies... was asking about accuracy.

I get it. I really do. I spent months building my own pipeline. The output reads well. It sounds smart. It catches things you missed on your fifth read-through. When the feedback comes back and it mentions your POV shift in chapter seven or your pacing drag in act two, it FEELS right. You nod. You fix it. You move on.

But "feels right" and "is accurate" are two very different things. And the gap between them is where I want to spend some time today, because I went looking for it in my own tool and what I found was... not great.

The setup that feels like it's working

If you're building your own editing pipeline (and a LOT of writers are right now), it probably looks something like this: you've got a system prompt with craft instructions, maybe a custom skill or two, maybe an agent that runs multiple passes. You feed it your manuscript or a chapter at a time. It comes back with feedback, maybe tracked changes, maybe a report.

And the output is genuinely impressive. It references show vs. tell. It talks about scene structure. It uses the right vocabulary. If you've read any craft books, the feedback sounds like someone who's read them too.

So you trust it. Why wouldn't you?

Because the failure modes aren't in the parts you can see. They're in the parts you'd have to go looking for. And almost nobody goes looking.

It's being nice to you

LLMs are trained to be helpful. Helpful, in practice, means encouraging. There's a name for this in the research... sycophancy. The model defaults to telling you what you want to hear.

How bad is it? A 2024 study (CoBBLer, ACL Findings) found that 40% of pairwise comparisons showed bias indicators, and the models' agreement with actual human quality judgments landed at 44%. Basically a coin flip.

But the really ugly number comes from a separate study on LLM evaluation consistency: the True Positive Rate (correctly confirming quality) was above 96%. The True Negative Rate (correctly identifying PROBLEMS) was below 25%.

Read that again. Your pipeline is great at telling you when something works. It's terrible at telling you when it doesn't.

So when your feedback comes back mostly positive with a few gentle suggestions... that's not necessarily because your manuscript is strong. It might be because the model is constitutionally incapable of delivering bad news with the same conviction it delivers good news.

Your middle chapters are getting worse analysis

There's a well-documented phenomenon called "lost in the middle." Researchers at Stanford (Liu et al., 2023) tested how well models handle information at different positions in a long input. The finding: accuracy drops by MORE THAN 30% when the relevant information sits in the middle of the context rather than at the beginning or end.

It's a U-shaped curve. The model pays the most attention to what it sees first and what it sees last. Everything in between gets compressed.

Think about what that means for manuscript analysis. Your opening chapters and your closing chapters get the sharpest read. Chapters twelve through twenty? The model is squinting.

And you'll never notice, because the feedback for those middle chapters still SOUNDS confident. The model doesn't know its own attention is degraded. It just... quietly gets less accurate.

When it quotes your prose, it's not quoting your prose

This one made me angry when I found it, because it's invisible unless you go line by line.

When an LLM "quotes" your text back to you in its feedback, it's not copying and pasting. It's RECONSTRUCTING the quote from a compressed internal representation. And the reconstruction isn't exact.

I tested this in my own system. The paraphrase rate was around 30%. Three out of every ten "verbatim" quotes had been subtly altered. Pronoun swaps. Synonym substitutions. Punctuation changes. Sometimes the meaning shifted. Sometimes entire phrases were fabricated and presented in quotation marks as if they were yours.

The research confirms this is architectural (a 2024 study on verbatim memorization found that over 50% of memorized tokens are produced using general computation, not specialized copy mechanisms). The model doesn't have a "copy" function. It has a "reconstruct something close enough" function.

So when your feedback says "In the line 'She turned and walked away,' you're telling rather than showing"... that might not be your line. You might have written "She turned. Walked out." The feedback is evaluating a sentence you never wrote.

And you're nodding along because it's close enough that you don't catch it.

Self-checking doesn't work

This is the one that killed my confidence in the common "multi-pass" approach.

A lot of writers (and a lot of pipeline builders) add a verification step. The model analyzes your chapter, then you ask it to review its own findings. "Check if these are accurate." "Verify these citations." Seems reasonable.

Huang et al. published a paper at ICLR 2024 titled "Large Language Models Cannot Self-Correct Reasoning Yet." The core finding: without external feedback, self-correction actually DEGRADES performance. The model talks itself out of correct answers and talks itself INTO incorrect ones. With the same confidence either way.

Your validation pass isn't catching errors. It's laundering them.

(I tested this in my own tool. Asked the model to verify its own citations against the source text. It confirmed fabricated quotes as accurate. Repeatedly.)

"Just use a better prompt"

I hear this constantly. Someone points out an accuracy issue with LLM feedback and the response is always the same. Tune your prompt. Add more constraints. Be more specific. Tell it to be critical. Tell it not to paraphrase.

I've written hundreds of prompts at this point. Spent months refining them. And yes, prompting matters. A good prompt is better than a bad one. Obviously.

But prompting doesn't fix architectural problems. You can't prompt away the attention curve. You can't prompt away the reconstruction behavior that causes paraphrasing. You can't prompt away sycophancy that's baked into the training process itself. You can't prompt the model into successfully self-checking when the research says self-checking degrades accuracy.

The architecture AROUND the prompt matters more than the prompt itself. How you chunk the input. How you verify the output. How you handle citations. Whether your validation is adversarial or confirmatory. These are engineering decisions, not prompting decisions. And they're the decisions almost nobody in that Facebook thread was thinking about.

I haven't solved all of it

Listen, I'm not writing this from some perch where I've got it all figured out. I built FirstReader specifically to do manuscript craft analysis, and I've been fighting these problems for months.

Some of them I've mapped and built around. The paraphrasing problem? I use a citation system that pulls verbatim text directly from your manuscript instead of letting the model reconstruct quotes from memory. That dropped the fabrication rate to near zero. The self-validation failure? I replaced self-review with adversarial verification... a separate analysis pass that tries to DISPROVE findings rather than confirm them, because the research shows adversarial framing doesn't degrade accuracy the way confirmatory framing does. The lost-in-the-middle problem? Chapter-level analysis with hierarchical aggregation, so no single pass is trying to hold the whole manuscript in view.

Some of them I'm still working on. Sycophancy is a beast. I use binary detection rubrics (research shows binary "present or not" hits 76% accuracy vs. 57% for severity scales) and self-consistency checks (running the same analysis multiple times and only keeping findings that show up consistently). It helps. It doesn't fully solve it.

And that honesty is the point. If someone tells you they've got a prompt that fixes all of this... they haven't checked.

Why I built a tool instead of sharing prompts

That Facebook thread reminded me why I stopped trying to package this as a workflow other writers could copy.

A prompt can't carry the architecture. You can share a really good system prompt for manuscript analysis, and someone will use it, and the output will SOUND great, and they'll trust it. And 30% of the quotes will be paraphrased, and the middle chapters will get less attention, and the self-check pass will confirm the errors, and nobody will ever know.

Because nobody checks.

And the LLM can't do it alone. That's the part that gets lost in these conversations. The model is one piece. Around it, you need deterministic code... validation logic, citation verification, input chunking, output filtering, scoring calibration. Code that doesn't hallucinate, doesn't get sycophantic, doesn't lose focus in the middle. I've written somewhere in the neighborhood of 30,000 lines of code around the LLM calls, and I am STILL iterating. Still refining. Still finding edge cases. That's not a prompt you can paste into Claude and get the same result. That's a product.

And that product is good. Very good, if you believe the beta testers who've had their manuscripts analyzed. But it's not perfect. It's still at about 95%. Which means one in twenty findings might miss the mark. I know that because I check. Constantly. And that 5% is what keeps me up at night, not the 95% that works.

If you're building your own pipeline and it's working for you, genuinely, keep going. But check the receipts. Pull five quotes from your last feedback session and verify them against your actual manuscript word for word. You might be surprised.

And if you'd rather have someone else worry about the plumbing... that's what I built FirstReader for.

If you enjoyed this, please leave a comment below to let me know. If you DIDN'T enjoy it, well, I'd like to hear from you too.

Get chapter-by-chapter craft feedback on your manuscript

319 principles from Browne & King, McKee, Swain, and more. Every finding traced to its source. Launching soon.

Questions? Comments? hello@firstreader.app