We Need to Have a Talk about Software Troubleshooting

So today, my Internet connection went out. The router got stuck overnight, and I rebooted it, and no big deal. Windows, however, still shows this, even though I have a perfectly fine network connection:

If you Google it, you’ll find lots of people have the problem, across multiple versions of Windows, going back years. The solutions vary from “just reboot” to complicated registry hacks to “reinstall Windows.” 🤦‍♂️

I can’t even.

I hear anecdotal stories about “weird problems” like the one above all the time: My friend’s father can’t get the printer to work without reinstalling the drivers every time he uses it. Your cousin’s word processor crashes every time she clicks the “Paste” button and there’s an image on the clipboard. My colleague’s video glitches, but only in a video call with more than three people. And invariably, the solutions are always the same: Reinstall something. This one weird registry hack. Try my company’s cleaning software!

So I’d like to let all of the ordinary, average, nontechnical people in the room in on a little secret:

This is bullshit.

All of it is bullshit. Start to finish. Nearly every answer you hear about how to “fix” your bizarre issues is lies and garbage.

How do I know that? I’ve spent a lot of years designing and building software, and supporting it, and maintaining it. I’ve worked with product teams that usher new features out the door, and I’ve been the “engineer” that your customer-support person talks about who’s actually fixing “something” behind the scenes. I know the process for building software, and I know what it takes to fix it, too. Real fixes, in a debugger, with the source code and tracing and diagnostic and inspection tools at your disposal.

We’re conditioned to hear fixes in software like medicine: You have this unusual disease, but there’s a rare, experimental drug that targets it. This herbalist says you just aren’t taking enough vitamin supplements. Your humors are unbalanced. Or you need surgery to remove the tumor. And when it doesn’t work, medicine, they say, isn’t an exact science, as if that somehow makes the failure better.

But computer science is an exact science.

It’s exact the way math is exact. It’s not even exact the way a hard science like physics is exact: Computer science is math that moves. If you’re getting the wrong results out, then the wrong information went in — period! X plus two is somehow coming out as “5” when it should be “4?” Then X is three, not two. There’s no magical pill or one weird trick: It’s just right or wrong.

So let me share some rules in plain English for software troubleshooting:

It’s not broken because of something you did. It’s not your fault. Software companies have done a brilliant job at convincing the general public that it’s usually the customer’s fault, but it’s almost never the customer’s fault.
There’s always a precise explanation for why it’s broken. Always. The company that made the software might not be willing to share that explanation, or they might not know that explanation, but it exists, and it’s always definite and solid and precise — something like “this line of code should say ‘x’ instead of ‘y.'” The real explanation is never “the temp directory is full” or “the registry needs to be cleaned.” Computers are deterministic: The same inputs produces the same outputs, always, so if you’re getting the wrong outputs, you need to investigate the inputs, because one of them is wrong.
It’s always caused by something the programmer did. It’s not caused by “cosmic rays” or “some weird thing on your computer.” It’s not happening because you didn’t defragment your hard drive often enough. A real human being, often a single human being, who worked at the company that made the software, is responsible for it.
Real troubleshooting — debugging — involves finding the real explanation. If there’s no electricity in a room in your house because the wire got cut in half, you don’t feed your house “electricity pills.” You find the cut in the wire, and then you patch it or you replace it. Real software debugging is all about tracing through the code to find the one exact place where it’s wrong — about finding the cut in the wire and fixing it.
Real fixing involves changing the code. It’s the code, not the data or settings, that’s wrong. (If you’re doing “defensive programming” properly, your code should be able to handle any data or settings!) No “registry cleaning” or “one weird trick” or “just tweak this setting” actually fixes the real issue, any more than cough syrup or herbal supplements can cure cancer.

If you read through those rules, you might be tempted to ask, “Well, why is my big expensive piece of software broken then, if it’s always the code? Didn’t they fix the bugs in the code when they made it?”

No.

No they didn’t.

Every piece of software you use has bugs in its code. Some just has more bugs than others.

There are good reasons for why software has these bugs. Let’s go over some of the most common ones:

Programming is hard. It’s one of the most difficult things a human brain can do, like playing four-dimensional chess while being hung upside-down off the edge of a cliff. Most programmers are actually pretty bad at it. (And some are much, much worse than others.) And to top it off, there’s actually math that says it’s impossible to find all the bugs.
Fixing bugs is expensive. It takes time and money to find and fix each problem in the software, sometimes a lot of time and money. This means that some bugs will take a long time to be fixed, and some might never get fixed. If companies waited to sell the software until the last bug was fixed, they’d never be able to sell it at all.
Software companies prioritize. It’s not unusual at a large company to find a backlog with thousands of bugs in it, and if your problem is at the bottom of the list, well, sorry about that. Good companies (like Mozilla in the link I just gave) will at least admit how many bugs there are, and even share their priorities with you. (Bad companies, on the other hand, will insist that “It passed QA!” and “It’s bug-free!”)
Bug fixes don’t sell. This, really, is on all of us. We buy a piece of software because it has “27 new features!” — and not because it has 200 bugs fixed since the last version. If all of us insisted on higher-quality software, companies would put their time and money and energy there. Instead, we all ooh and aah over new and shiny features, so that’s what the companies spend their time making. And every new feature adds more bugs.
Nobody often cares. Seriously. In a lot of companies, the employees are just playing leapfrog, trying to get to the next job up the chain. You climb the ladder by impressing people; in many corporate environments, people are only incentivized to add new “initiatives” and “features.” You don’t get promoted or get a raise for fixing bugs — in some companies, fixing bugs will even lose you credibility, because nobody there wants to believe the software is bad.

So given all that, when I see this, what do you think I see?

This image has an empty alt attribute; its file name is image.png — *Oh, Windows.*

I see a bug that Microsoft didn’t really care that much about: A corner case that probably came up in testing, but that was too hard to demonstrate and reproduce to be worth the effort to find and fix. It’s not really hurting anything — after all, the computer still has an Internet connection. There’s no doubt a bug ticket deep in a Microsoft database that says, “Investigate cause of network status mismatch,” but there are 10,000 higher-priority tickets in front of it.

There’s very likely just three or four lines of code somewhere that aren’t quite right, but among the rest of the 17,000,000 lines of code in Windows, nobody’s going to find them without looking to find them.

Everything else people propose just addresses the symptoms, not the cause. Certain configurations or settings, or combinations of settings, may be more likely to trigger that broken piece of code than others, and answers like “Hack your registry!” aren’t really attacking the root cause — they’re just making the conditions for triggering the bug less likely.

So now, if you’ve been with me this far, I want to change how you think — just a little tiny bit:

When you see your computer misbehaving, or a piece of software acting glitchy, I want you to shift the blame where it belongs: It’s not something you did. It’s not something mysteriously wrong with your computer. It’s a bug that the software company either didn’t find or didn’t fix. And ultimately, they’re the responsible party — not you. The only proper way to react to a bug is to either ignore it, find a way to avoid it, or to insist the company fixes it.

But whatever you do, don’t listen to the snake-oil salesmen who will sell you “one weird trick,” and don’t blame yourself — blame the programmer for making the bug, and blame his company for not spending the money to find and fix it.

One Response to We Need to Have a Talk about Software Troubleshooting

Thierry JAMET

May 12, 2022 at 3:17 am

hello
First : sorry for my bad English 🙂
I agree with you but it is still necessary to propose a solution to this bug of which I was also a victim.
for solve this :
1 – go to “network card properties”
2 – Disabled all Checksum Offload
IPv4 Checksum Offload
TCP Checksum Offload (IPv4)
TCP Checksum Offload (IPv6)
UDP Checksum Offload (IPv4)
UDP Checksum Offload (IPv6)
good continuation
Thierry

We Need to Have a Talk about Software Troubleshooting

One Response to We Need to Have a Talk about Software Troubleshooting

Résumé

Related Links

Recent Posts

Archives

Meta