Skip to content
206 changes: 206 additions & 0 deletions apps/web/content/articles/testing.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
---
meta_title: "Untitled"
author: "John Jeong"
featured: false
date: "2026-02-13"
---

# The $5,000 AI Coding Experiment: What 1,000 Devin Tasks Taught Us

Display title: Is Devin AI Worth It? We Spent $5,000 to Find Out

Meta description: Real results from spending $5,000 on Devin AI: how we automated migrations, enabled non-technical teams to ship code, and cut maintenance work in half.

---

In the last two months, my two-person team spent over $5,000 running roughly 1,000 tasks in [Devin](https://www.google.com/url?q=https://devin.ai/&sa=D&source=editors&ust=1770998346314312&usg=AOvVaw2pNO7Pfv5Jm5rPPrNyiJWu), the AI software engineer. This isn't vibe coding hype or spinning up 10 Claude Code instances to burn tokens as fast as possible. This is what actually happened when we integrated AI agents into our real-world workflow while building Hyprnote.

Here's what we learned.

## Not a Reader? Watch the Video Instead

<>[[a]](#cmnt1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Malformed JSX syntax will cause MDX parsing to fail. The opening fragment <> has no closing tag </>, which will break page rendering in production.

Fix:

<>[[a]](#cmnt1)</>

Or remove the fragment entirely if not needed:

[[a]](#cmnt1)

Spotted by Graphite Agent

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.


*Timestamps throughout this post link to specific examples in the video.*

## Running AI Agents Where Your Team Already Works

The single most powerful decision we made was running Devin inside Slack, not our IDE. Here's why this matters:

Slack is where discussions already happen. We're already getting alerts from Zendesk, Sentry, and Discord. Being able to launch an agent directly inside a thread where the problem is being discussed is incredibly valuable.

Real example: [[b]](#cmnt2)John, my co-founder, identified an issue with our AI prompts and tagged Devin to fix it. Devin fixed it, but the approach was non-optimal. Since AI prompting is something I work on, I jumped into the same thread with more context. Devin figured it out based on my additional input, finished the PR, and it got merged.

This is collaborative debugging without context switching. No copying issues into a separate tool. No explaining the same problem twice.

&rarr; Watch in video: [https://www.youtube.com/watch?v=UojsNSbhm6o&t=52s](https://www.google.com/url?q=https://www.youtube.com/watch?v%3DUojsNSbhm6o%26t%3D52s&sa=D&source=editors&ust=1770998346317516&usg=AOvVaw1hP_tQHdNlIRwXEILSEA04)

Another example: I[[c]](#cmnt3) tagged Devin about our 404 page not rendering properly. John, who works on our webpage primarily, pointed out reference files to look at in the same thread. Based on his input, we got a PR and merged it.

The agent isn't replacing us—it's joining the conversation where it's already happening.

&rarr; Watch in video: [https://www.youtube.com/watch?v=UojsNSbhm6o&t=74s](https://www.google.com/url?q=https://www.youtube.com/watch?v%3DUojsNSbhm6o%26t%3D74s&sa=D&source=editors&ust=1770998346318805&usg=AOvVaw2Q5dQjaIMOR40u91WFoI5J)

## AI Agents Enable Non-Technical Teams to Ship Code

Having an agent accessible from Slack opened up tasks that don't necessarily require technical skills. For instance, understanding what we're tracking in analytics or making small adjustments to better understand user behavior.

John attached some PostHog docs and asked questions about what we're tracking and what we should be tracking long-term. Devin made the changes. Now both John and I know we have analytics updates—super helpful for staying aligned.[[d]](#cmnt4)

&rarr; Watch in video: [https://www.youtube.com/watch?v=UojsNSbhm6o&t=108s](https://www.google.com/url?q=https://www.youtube.com/watch?v%3DUojsNSbhm6o%26t%3D108s&sa=D&source=editors&ust=1770998346320440&usg=AOvVaw3qOW9ebWwR4PM8NY9KVL7w)

Since we use GitHub as a CMS for Hyprnote, we can even update landing pages or blog content directly from Slack. John attached a PDF from an internal discussion, and Devin updated our docs based on the actual conversation we had.

&rarr; Watch in video: [https://www.youtube.com/watch?v=UojsNSbhm6o&t=132s](https://www.google.com/url?q=https://www.youtube.com/watch?v%3DUojsNSbhm6o%26t%3D132s&sa=D&source=editors&ust=1770998346321177&usg=AOvVaw3x-dP1viDobhtoHJUab3DK)

## Three Types of Tasks to Delegate to Devin AI

As a small early-stage startup, there's always a lot going on. We're handling day-to-day work while thinking about what's next—new features, product direction, how the codebase should evolve. That's why it's extremely helpful to dump all of this into an async coding agent and let it figure things out.

Here are three types of tasks that represent different degrees of relevance and urgency:

### Degree 1: Exploration (Not Shipping Anytime Soon)

This is work that isn't planned for the immediate future. We're not going to ship it or even merge it right now, but it's still valuable to explore so we can understand what the work would look like, how complex it is, and roughly how long it might take.

Example: Even though we're focusing on our macOS desktop app, we had ideas around building a Chrome extension. I asked Devin to research how to make a Chrome extension that works with a desktop app. We learned how 1Password does it and got a rough plan.[[e]](#cmnt5)

Then we cloned the repository of a popular Chrome extension framework. Based on the docs and actual code examples, we implemented it to see how it would look. We didn't even merge it, but it's still useful to see how it'll look in the future.

&rarr; Watch in video: [https://www.youtube.com/watch?v=UojsNSbhm6o&t=177s](https://www.google.com/url?q=https://www.youtube.com/watch?v%3DUojsNSbhm6o%26t%3D177s&sa=D&source=editors&ust=1770998346323297&usg=AOvVaw3MhEtgrPWL_3gXWWxB_aut)

### Degree 2: Preparation (Relevant, But Not Right Now)

This is work we'll likely merge, but I won't pull it into my IDE yet.

Example: Someone asked whether Hyprnote could import data from Apple Notes. That feels like a feature we could support in the future, but it's not a core focus at the moment.[[f]](#cmnt6)

We did research to see if there was any existing work on that. There was, so we cloned it, ported the test cases, and let Devin implement it. Tests passed, so we safely merged it for a future feature.

&rarr; Watch in video: [https://www.youtube.com/watch?v=UojsNSbhm6o&t=227s](https://www.google.com/url?q=https://www.youtube.com/watch?v%3DUojsNSbhm6o%26t%3D227s&sa=D&source=editors&ust=1770998346324469&usg=AOvVaw1d-D2KJ9G8y-u1X-vl3mOo)

### Degree 3: Production (Very Relevant Right Now)

This is work I'll definitely look at, but I'm spawning the agent right now because I'm in the middle of something and want to avoid context switching. Or maybe I'm traveling or about to go to sleep.

Example: We needed to update test cases around our Tinybase utils—very relevant and important work. We asked Devin to clone the repo, inspect the codebase, and write the test cases.[[g]](#cmnt7)

One interesting trick: we asked Devin to use the Claude CLI that we already installed on Devin's machine. This way we can offload some AI inference to our Anthropic account and use some credits.

Pro tip: We encoded this knowledge as an "offload agent" on how to use Claude CLI. Mentioning "consult smart friend" (something Devin uses as a prompt internally) helps Claude CLI get called at the right timing.

&rarr; Watch in video: [https://www.youtube.com/watch?v=UojsNSbhm6o&t=263s](https://www.google.com/url?q=https://www.youtube.com/watch?v%3DUojsNSbhm6o%26t%3D263s&sa=D&source=editors&ust=1770998346325929&usg=AOvVaw3yMv1xXMhnUFf6HSfyQegA)

## Good Documentation Enables AI Agents to Ship Code Faster

In Hyprnote, we focus on supporting multiple providers for language and speech model inference as part of our open-source effort. Early on, we spent time designing and documenting flexible, clean interfaces. This worked well for future contributions and community involvement.

It turns out these same choices are incredibly helpful when working with coding agents.

Example: ElevenLabs Support[[h]](#cmnt8)

We support both WebSocket-based real-time transcription and file upload-based batch transcription. We had a very detailed prompt on how models should be handled, how language should be handled, and other API references in the docs.

Since we have end-to-end testing support in place, we sent the ElevenLabs API key as credentials (this can be passed in the prompt or through Infisical CLI). With all the documentation, test cases, and API key in place, Devin implemented it in almost one shot, and we safely merged it.

&rarr; Watch in video: [https://www.youtube.com/watch?v=UojsNSbhm6o&t=349s](https://www.google.com/url?q=https://www.youtube.com/watch?v%3DUojsNSbhm6o%26t%3D349s&sa=D&source=editors&ust=1770998346327494&usg=AOvVaw3V_pizLB10YeMrbNskYbCv)

Example: Mistral Support[[i]](#cmnt9)

Same story for language models—even easier because there's no WebSocket involved. Since we have infrastructure to support any language provider, Mistral was supported in less than 10 minutes.

&rarr; Watch in video: [https://www.youtube.com/watch?v=UojsNSbhm6o&t=392s](https://www.google.com/url?q=https://www.youtube.com/watch?v%3DUojsNSbhm6o%26t%3D392s&sa=D&source=editors&ust=1770998346328328&usg=AOvVaw3NTT95hIWWwEQpSNzROYpA)

Example: OpenAI Support[[j]](#cmnt10)

This one was a little harder. We had errors in the client, so we passed the error message and credentials. After a few minutes—since we had API keys and test cases in place—Devin figured out that OpenAI only supports 24kHz sample rate. That's why it was failing. We fixed it without any engineering resources invested.

The pattern is clear: good docs + clean interfaces + test infrastructure = AI agents that actually ship code.

&rarr; Watch in video: [https://www.youtube.com/watch?v=UojsNSbhm6o&t=406s](https://www.google.com/url?q=https://www.youtube.com/watch?v%3DUojsNSbhm6o%26t%3D406s&sa=D&source=editors&ust=1770998346329675&usg=AOvVaw3ZtnMuXaU1E2AFD1gBOx4z)

## Automating Code Maintenance with AI Agents

If you're a developer, you know that once a codebase reaches a certain size and age, maintenance work alone can take a lot of engineering time and slow the team down. With coding agents, we can offload a lot of that work.

### Single-Prompt Migrations[[k]](#cmnt11)

One common example is doing migrations that have clear documentation. In Hyprnote, we recently completed:

- AI SDK version 6 migration in a single prompt
- Tailwind v3 to v4 migration in a single prompt

&rarr; Watch in video: [https://www.youtube.com/watch?v=UojsNSbhm6o&t=447s](https://www.google.com/url?q=https://www.youtube.com/watch?v%3DUojsNSbhm6o%26t%3D447s&sa=D&source=editors&ust=1770998346331531&usg=AOvVaw2EQazVNk4b5_PL_o5jGDB2)

### Concurrent Multi-PR Migrations[[l]](#cmnt12)

Things can get more complicated and may require multiple PRs or concurrent work.

A good example is applying Vercel's recent React best practices agent skills. We attached Vercel's React best practices document, and Devin figured out what changes should be done. But since there was a lot of isolatable work, we prompted Devin to do this concurrently by spawning concurrent Devin sessions.

One way to do this is to ask Devin to make actual API calls. But there's a better way: use the analyze-session task. This lets you spawn concurrent Devin sessions to run work concurrently and generate separate PRs per task.

&rarr; Watch in video: [https://www.youtube.com/watch?v=UojsNSbhm6o&t=463s](https://www.google.com/url?q=https://www.youtube.com/watch?v%3DUojsNSbhm6o%26t%3D463s&sa=D&source=editors&ust=1770998346333195&usg=AOvVaw0nuQylcKbpeIqKNCbmWw3k)

### Daily Automated Linting[[m]](#cmnt13)

This is all very useful, but we're not doing migrations or receiving new guidelines every day. However, if you pair an agent with an automated linting tool, this approach can be applied daily.

In Hyprnote, we have a large Rust codebase, and since Cargo Clippy is pretty good, we set up a GitHub Action to run Cargo Clippy daily and spawn Devin to apply any fixes based on the output.

Since it takes a lot of time to run Clippy and Cargo check, we save a lot of time applying these guidelines or Clippy warnings in an async manner.

&rarr; Watch in video: [https://www.youtube.com/watch?v=UojsNSbhm6o&t=528s](https://www.google.com/url?q=https://www.youtube.com/watch?v%3DUojsNSbhm6o%26t%3D528s&sa=D&source=editors&ust=1770998346334746&usg=AOvVaw2Qf8uDBB4v-TKhnVGQzrrs)

## Final Verdict: Is Devin AI Worth It?

If you're expecting AI agents to replace developers, you'll be disappointed. We're not there yet.

But if you're looking to meaningfully extend what a small team can accomplish? Absolutely worth it.

Devin AI is worth the investment when you:

- Have well-documented code with clean interfaces and test coverage
- Need to explore features before committing engineering time
- Want to offload maintenance work (migrations, linting, updates)
- Have non-technical team members who need to ship small changes
- Run concurrent work that would otherwise bottleneck your team

Devin AI is NOT worth it if you:

- Have poorly documented, tightly coupled code
- Expect it to understand context without clear instructions
- Want it to make architectural decisions
- Need it to work in complete isolation without human oversight

The key insight after 1,000 tasks: You're not buying code generation. You're buying collaboration at scale.

The best ROI came from tasks we could delegate async (exploration work at 2 AM, maintenance work during travel, migrations while focusing on core features). The agent didn't replace our judgment; it multiplied our capacity to act on it.

Our recommendation: Start with one well-defined use case (like automated linting or simple migrations), measure the time saved, then expand. Don't try to use it for everything on day one.

[[a]](#cmnt_ref1)embed this video here: [https://www.youtube.com/watch?v=UojsNSbhm6o](https://www.youtube.com/watch?v=UojsNSbhm6o)

[[b]](#cmnt_ref2)screenshot 1

[[c]](#cmnt_ref3)screenshot 2

[[d]](#cmnt_ref4)screenshot 3

[[e]](#cmnt_ref5)screenshot 4

[[f]](#cmnt_ref6)screenshot 5

[[g]](#cmnt_ref7)screenshot 6

[[h]](#cmnt_ref8)screenshot 7

[[i]](#cmnt_ref9)screenshot 8

[[j]](#cmnt_ref10)screenshot 9

[[k]](#cmnt_ref11)screenshot 10

[[l]](#cmnt_ref12)screenshot 11

[[m]](#cmnt_ref13)screenshot 12
Loading