Why Your Computer Crashes
By, Amir MajidimehrEver wonder why your computer “crashes?” What is a crash anyway? To
understand that, we need to first step back and understand the
architecture of our PC.
When you turn on your machine, the hardware automatically executes
one program: we call this the operating system or amongst people who
write the code for it, the “kernel.” As the name implies, the kernel
is the core of your machine. It sits between the hardware and the
programs that run on top of it. Examples are Windows, MacOS, Linux, and iOS
(Apple mobile operating system).
The kernel’s job is to provide an environment where the application
programs can run on top of it and with it, hide the complexity and
differences of the hardware below it. For example when you ask
your word processor to open a document, the exact same code in it
opens the file whether it is stored on the hard disk of flash thumb
drive. These are very different hardware devices yet from an
application point of view, or how you might use them to browse
files stored on them, they appear identical. This sharply reduces the work
for the application developers or your efforts to manage your files.
So at high level you have three pieces stacked on top of each other.
The hardware is at the bottom. The kernel sits on top of that. And
all the applications run above the kernel. There is one piece of
hardware and one kernel but many applications. By the way, your
desktop, the thing that shows all your files and such, is also an
application, albeit, one that ships with the operating system and
always runs.
The next important concept is to realize that there is no way to
write perfect software that is of any complexity. The permutations
in any computer program are infinite in scope so there is no way
that all the possible paths can be verified to be correct before
software is released. Further, software may access other components
in the operating system or elsewhere which may have flaws or “bugs”
as we call them. This is an aptly named problem as anyone who has
tried to chase bugs to kill them knows that you can get most of
them, but invariably a few get away. :)
To make you feel even better, modern audio/video electronics
has also gotten so complex that many devices such as TV, Blu-ray
players, cable and satellite set-top boxes run an operating system
(usually a variant of Linux). So don’t be surprised if those devices
also crash like your computer can!
On top of the software bugs, we also have to deal with hardware that can
have faulty software embedded them (called “firmware”) or design. They can
also flat out break; something that thankfully our software doesn’t
do.
A hard disk that fails may stop responding all of a sudden in which
case your program which is trying to save its file to it hangs
indefinitely. Or it may corrupt data told to write to its media and
keep going as if nothing has happened. This doesn’t happen often but
can. And when it does, figuring out that it occurred can be
incredibly tough if not impossible. But again, this is not a common
occurrence so don’t lose sleep over it.
Failures then can occur up and down the “stack” of hardware, kernel
and applications. The failure manifests itself very differently
however depending on where it exists.
Let’s start with the easy part and look at what happens when the
problem is in the applications. As an example, assume we have a
program that expects a number from 1 to 9 to be input to it and you
instead put in a name. The program attempts to use that string of
characters as a number and things go bad from there on. One of two
situations manifest themselves at this point:
- Your program keeps going but does the wrong thing (including
hanging which means chasing its tail forever, not responding to you).
It is your job then to realize something has gone wrong and not
trust the output of the application.
Important thing here is that nothing crashes and the system keeps going.
- The program crashes (is removed from the system) with the
operation system putting up a notice. We call this an “exception” or
“fault.”
Understanding the second failure mode requires a deeper dive into the
system architecture. Your application runs in a box created for it
by the hardware (CPU) and the operating system. It is given a
private space to execute its code. Access to anything outside of
this area is forbidden as to provide protection against one program
snooping on another or the operating system, or corrupting these
others due its own bugs.
The engine that does any work in your computer is the Central
Processing Unit or CPU. The CPU runs both application code and that
of the operating system. In the case of above errant program, the
CPU happily executes what it is told in the form of code in that
program. During this operation however, it is always checking to see
if the application is doing something it should not be doing such as
going outside its bounds. Should it attempt to do so, the CPU halts
executions of your application at that precise moment, and calls
special code in the operating system to complain. That code then
verifies what has occurred, and pops up the crash message saying the
application has done something wrong and it is being terminated.
See some examples for MacOS and Windows to the right.
So let’s review again. Your program is running at full speed at
potentially billions of instructions per second. But on every
instruction a check is made to make sure it is not attempting on
purpose or accidently accessing anything that it is not his. The
latter is the key here: when a program has bugs, sooner or later it
starts to execute random or incorrect instructions. That code
invariably generates requests to data that is outside of its bounds
(or “illegal” such as attempting to write on top of its own code).
The CPU stops on that precise instruction and reports to the kernel
that something has gone wrong, resulting in the crash message
displayed by the operating system with the program in question
named.
Now here is the good news. Application programs are partitioned
enough that they cannot take the computer down with them when they
crash (there are some notable exceptions to this but for now, let’s
go with this simplification). So in essence then, your computer
cannot crash because a program has done something wrong. So don’t go
reinstalling your program hoping it would fix something. Likely it
would not.
Now let’s take what we just learned and apply it to a situation
where the system does actually crash. Even though the kernel is
“king” so to speak and has lots of power in your system, it also
lets the CPU monitor its behavior just as it does for applications.
As with user applications, the kernel has its own boundaries of
where its code and data exist and it allows the CPU to warn it if
its own code attempts to access what it should not.
Now imagine an errant piece of code in the operating system that
gets triggered because you did something unusual. Let’s say it is
plugging in a device into the USB jack of your computer which has
faulty “driver” (a piece of kernel code that interfaces with that
piece of hardware). As soon as you plug in the cable, the bug gets
triggered. Let’s say that causes an incorrect access to occur to a
location outside of the kernel code. The CPU dutifully catches that
event and reports it to the same piece of code it used when an
application crashed.
The behavior is radically different now. The operating system
examines the nature of this “fault” and realizes it is its own code
that was the source of the problem. Fearing that continuing to run
may lead to more drastic failures such as corrupting user data, and
importantly, losing track of what has gone wrong, it attempts to
commit suicide by popping up the message that every user in the
world hates: the system has crashed. In Windows, this is the Blue
Screen of Death which is often abbreviated to BSOD. A sample is on the right.
MacOS also has a crash message contrary to popular belief of its
lack of existence as seen below.

What happens next is that the kernel will attempt to take a snapshot
of critical memory data so that it can be analyzed later to
potentially find the cause of the crash. I say potentially because
while the failure endpoint is known, what got us there may be
totally obscured. An operating system bug may corrupt some data that
is not used hours or even days later leading to the visible crash.
The snapshot of the system at crash point then has little useful
information as to why we got there as so much has happened since.
Operating system companies like Microsoft collect crash data (for
both applications and the kernel) and work on resolving them based
on frequency of occurrence. So be sure to give consent to have the
computer upload such information to them after you have restarted
your computer. Additional crash “dumps” also helps the engineer
triangulate the problem better resulting in higher odds that the
solution is found.
Having spent years tracing through crash dumps to find and fix
operating system bugs, I can speak firsthand to the difficulty of
detective work required to back trace the problem to
its root cause. Some bugs literally took months of intense code
review and crash analysis to unravel. So don’t be surprised if there
is no quick resolution to your problem from the system provider for
these crashes.
As end users, you can also attempt to troubleshoot what may have
caused the system to crash. That goes beyond the scope of this
introductory article but know that there is a bit of self-help
available. Suffice it to say, you may be able to find out if it was
indeed the broken device or driver for that printer which caused it.
There is a common myth that your computer crashes because it runs
out of memory. That just doesn’t happen! It almost doesn’t matter
how much memory your computer has; you cannot exhaust it. No, you
read that right. There is no relationship between the two. I can
have a computer with two Gigabytes of memory and run eight Gigabytes
worth of programs and nothing will crash!
Reason for that is that the operating system uses the hard disk as
an extension of system memory. So as long as you have hard disk
space, you can keep running programs. And since hard disk is much
larger than your computer memory, you essentially have unlimited
ability to use more memory by running as many applications as you
like. Now, if you
reach the limit of free hard disk space, the operating system will
complain but usually in the form of not wanting to run more programs
or existing programs crashing as they fail to get space to store
their data. But the operating system will almost invariably stay
operational. You can stop some programs, recover space and keep
going. The technical term for this feature is “virtual memory.”
Likewise, running out of disk space should just result in error
messages and not outright system crash. So don’t go adding memory or
disk space to your computer to stop system crashes. It will not help
(although sometimes changes the system behavior enough to make it
act differently).
So there. You may not know how your operating system runs things,
but now know a bit about what makes it not do that! :)
Back to Articles