You Have to Break It
Over time, many applications begin to accrue seemingly-dead cruft and detritus that people are afraid to touch for fear of breaking the program. I’d suggest that this cruft is among the most insidious forms of technical debt in that it’s held in place by a lack of understanding and fear to modify the program, rooted in superstition rather than reason. You may break the program in the process, but you have to remove it – for your own sanity and for the good of the application.
A few weeks ago, a service in our infrastructure for which I’m largely responsible began to exhibit unusual behavior. CPU usage in one of the threads would spike to 100% seemingly at random, occasionally subside, but most often remain, leaving behind a troubling Munin graph that taunted me with each spike. Some searching led me to explore two red herrings which masked the ultimate cause for a few days.
While on vacation, I spent a significant amount of my time with a node in our production cluster attempting to understand this problem (which only presented in production above a certain level of concurrency, and was not replicable with our test client even at loads 10x as high). Wracking my brain at 2am and 6am, I littered the code with debug logging, senseless exception handling despite the fact that no stacktrace was ever thrown, and a few other behavioral patterns consistent with a confused, worried thrashing. Eventually, I discovered a race condition that allowed an SO_WRITE interest to be set on a socket but never cleared. This resulted in a standard blocking Selector.select() call returning immediately inside a while(true) loop rather than waiting politely for a network IO event to occur, which ultimately manifested as a spinning thread causing 100% CPU usage. Thankfully, it did not affect the service’s ability to operate. After making the change to resolve this (it was literally just cutting a line and pasting it two lines below its original location), the service performed admirably and I enjoyed the rest of my Labor Day weekend in San Francisco.
You Have to Break It.
Upon returning to the office and offering up one heck of a mea culpa to my very patient, helpful, and supportive coworkers, I confronted a surprisingly strong post-traumatic temptation to leave the unnecessarily verbose logging, senseless exception handling, and other thrashing-induced machinations in the program for fear of touching any of it. Reason departed, and I was left with fear of changing a program that worked, despite being filled with superstitious cruft.
Programming is something of a science in that our systems are built on top of principles, contracts, and APIs (mathematics, POSIX, a language like Java). By trading what I knew each of these layers guaranteed for baseless speculation and fear, I allowed myself to become paralyzed. Superstition is not terribly useful in programming – ask anyone who’s spent more than a day attempting to track down a memory leak in a garbage-collected language.
The task was not to “figure out why it’s acting strangely,” so much as it was to discover which rule of the language was triggering the undesirable behavior. Departing from a mindset of fear, misunderstanding, and compiler-blaming empowers one to remember that languages and computer programs have rules, and that they are very good at following them. And of course in the end, my program was behaving exactly as I’d asked it to. As reason and my understanding of its flow returned, I gained control over it once again.
If you are afraid of your program, you have to break it until you are not afraid of it anymore.
[ If you’re interested in working with a strong, supportive team, hop into #urbanairship on Freenode. We’d love to chat. ]