If something’s true for an integer , and if it being true for the number implies it is true for the number , then it is true for all integers .

This is a more-or-less “obvious” statement. It has two components: the **base case** and the **inductive step**. The base case establishes truth for one number. The inductive step does not actually establish truth of the statement for any particular value. Instead, it establishes the truth of a logical implication. In formal terms, the above box reads:

If is a proposition for the integer , then if is true and , then is true for all integers .

Think of induction as a string of dominos. Establishing the base case is like knocking down the first domino. Establishing the inductive step is like setting up the dominos so that if the -th domino falls, it knocks over the -th domino. Thinking of it this way, it should be obvious that once these two things are established, then all the dominos will fall. This is one of the concepts that is best learned through examples.

Let’s prove a basic result using induction. We will prove that . We begin with the base case. Is it true for ? Yes, because the left-hand side is and the right-hand side is . Now we establish the induction step. If it’s true for , is it then true for ? Let’s see. We begin with . Since we are assuming that it’s true for , we can rewrite the first terms as . Then our sum becomes

which is the formula we wanted for , so the formula is true for all positive integers . Let’s go over why this works again. We showed the formula is correct when . We then showed that if the formula is correct when , then it is also correct when . Why does this mean the formula is correct for any ? Well, it’s correct for . By the inductive step, since it’s correct for , it’s also correct for . By the inductive step again, since it’s correct for , it’s also correct for , and so on.

Now we’ll prove a more interesting result using induction. Consider the question: how many subsets (including the empty set) of are there that contain no consecutive integers? For problems like these, it’s always nice to look for a pattern first, in case the formula for the answer is nice. For , both the empty subset and the entire set work, so the answer is 2. For , the subsets all work, so the answer is 3. For , the subsets work, so the answer is 5. Continuing in this way, we’ll see that the answer for the first few values of are . These look like the Fibonacci numbers! We know that the Fibonacci numbers are defined by a recurrence relation , so maybe the answer we’re looking for also satisfies this recurrence. That is, we **conjecture** (or guess) that , where is the answer we’re looking for and .

To prove our conjecture, we can use induction, so we just need to show that , , and . As an exercise, convince yourself why this is enough to answer our original question. The base cases we just did: and . Now, let’s establish the relation . Let’s call a set “happy” if it contains no consecutive integers (so our original question is to find the number of happy subsets of ). Suppose we have a happy subset of . What do we know about ? Well, it either contains the number or it doesn’t. If it doesn’t, then it’s just a happy subset of , and we know there are of those. If does contain , then it can’t contain or else it wouldn’t be happy. Then, if we ignore , then the rest of is a happy subset of , of which we know there are . Therefore, and we are done!

To recap, you should have learned the following from this post:

- What the principle of mathematical induction is and how it works
- How to use induction to prove mathematical statements

I look forward to contributing programming articles to this blog, as well as occasional math related topics.

]]>

Let’s begin by showing that the probability that you guess my number right is zero. Let be the probability in question. The idea is to show that for any positive real number . We know that , and if it’s smaller than ANY positive number, then it has to be zero! The argument is as follows. Let’s call the number I randomly picked . Imagine that the interval is painted white. Pick any positive real number . Then there is an sub-interval of length within the interval containing . Imagine that this sub-interval is painted black, so now we have a black strip of length on the original white strip, and the number I chose was in the black strip. What’s the probability that your guess lands on the black strip? It has to be , since that’s the proportion of the white strip that is covered. But in order for your guess to equal my number , it has to land in the black strip, so your probability of guessing can’t be larger than the probability of guessing a number on the black strip! Therefore .

You should now be convinced that this event indeed has zero probability of happening, but it’s still true. This phenomenon is because of the following geometric fact: **it’s possible to have a non-empty set with zero “volume”.** The term “volume” depends on the context; in the case of the point on the interval, “volume” is length. The probability of an event measured on the interval is equal to its length, and a single point on the interval has zero length, yet it’s still a non-empty subset of the interval! Probability is basically a measure of “volume” where the entire space has “volume” equal to 1. By defining probability in this way, we can prove all kinds of neat facts using something called **measure theory**.

To recap, you should have learned the following from this post:

- The probability of randomly choosing a specific number in the interval is equal to zero
- Events that have zero probability are still possible

There is no way to avoid having our digital messages altered on the way to their recipients. Instead, we use an idea which we use all the time when we want to make sure our vocal messages are heard properly. The idea is to add redundancy to the message so that the receiver can check if the received message was altered or not. This is called **error detecting**. If, in addition, the receiver can deduce what the intended message was supposed to be, then this is called **error correcting**.

For example, let’s say I was shouting to you from far away during a windy day. What if I fell into a river and was shouting “Help” but you heard this as “Hello”? That would be quite an unfortunate end for me. In order to ensure that you recognize when my message was altered, we could agree beforehand that all messages should be shouted twice in a row. Therefore, if you hear two consecutive messages that are different, you can conclude that my message was altered. Therefore, if I got in trouble, I would shout “Help! Help!” If there was wind interference, you might hear something like “Hello! Help!” You would know right away that my original message was altered, so you have **detected** an error. However, you still cannot correct it, because you don’t know which one was my intended message. In fact, both of them could have been altered! Maybe my original message was “Hell! Hell!” What if I wanted to shout “Help! Help!” but you heard it as “Hello! Hello!”? Well, in that case I would be doomed once again, so this protocol doesn’t work perfectly. However, note that the probability that both messages are altered in the exact same way is quite small. I can make that probability however small I want by making the number of repetitions large enough.

Now, let’s talk about the idea behind error correction. The idea behind error correction is actually used all the time by our brains subconsciously whenever we listen to people speak or read poorly typed comments on YouTube. The idea is so important it deserves a special box:

Only a subset of all possible messages we see or hear make sense (or are

valid), and so when we see or hear a message that doesn’t make sense, in our minds we replace it withthe closest valid message.

Now we give a digital example. Let’s say I want to send messages to you one bit at a time, so a sample message could be 0 or 1. An error correcting protocol could be to triple all my messages, so that my sample message would be sent as 000 or 111. Among the 8 possible binary strings of length 3 that you could possibly receive, only the two (000 and 111) are valid. Our protocol then therefore **detect up to 2 errors** in any given message. If there are three errors, then 000 would become 111 and vice versa, so we wouldn’t know if there was an error. However, if we receive a message like 011, we’ll know something went wrong. Furthermore, our protocol can **correct up to 1 error**. If there was only 1 error, then we know the message is whatever the majority bit is. If we get the message 001, then the original message had to have been 000 if there was only 1 error. To summarize, if there are 2 or fewer errors, our system can detect it. If there is 1 error, then our system can correct it.

It is enlightening to think about error correcting geometrically. We measure distance between strings by the number of bits in which they differ, so for example 010 and 001 have a distance of 2 because they differ in the 2nd and 3rd bits. This can be easily seen by viewing the 8 binary strings of length 3 as vertices of a cube, where the strings dictate the coordinates, so 000 corresponds to the point (0,0,0) and 011 corresponds to (0,1,1):

The distance measure is called the **Hamming distance**, named after Richard Hamming. The distance can be easily seen as the smallest number of steps it takes to travel between two points by traveling along the edges of the cube. Note that our two valid messages are opposite vertices of the cube and that they are a distance of 3 apart. Therefore, as long as there is 1 error (meaning we’ve strayed from a valid message by a distance of 1), then it is clear what the original message is by choosing the **nearest valid codeword**. In this case, since the two valid messages are a distance of 3 apart, an invalid message can only be a distance of 1 away from exactly one of the valid messages. In technical lingo, the set {000, 111} of valid messages forms the **code**, the individual valid messages are **codewords**, and the **distance** of the code is the smallest distance between two points in the code. In this case, there are only two points in the code, and the distance of the code is 3. In general, if the distance of the code is , then the code can

- detect up to errors; this is because the shortest distance between two valid codewords is , and so traveling up to away from a valid codeword will land you on an invalid codeword, but traveling a distance from a valid codeword
*could*land you on another valid codeword - correct up to errors (where is the ceiling function which rounds up to the nearest integer) ; this is because if you travel up to away from a valid codeword, that codeword is still the nearest valid codeword, since the shortest distance between two valid codewords is , but traveling a distance from a valid codeword
*could*land you on a midpoint between two valid codewords, in which case it is ambiguous which one is the nearest neighbor

This also explains why I chose to triple instead of double my messages. Had I chosen to double my messages instead, then my code would be {00, 11} out of the set of possible words {00, 01, 10, 11}, and the distance of my code would be 2. By the conclusion above, this code can detect up to 0 errors; that is, it’s not even an error correcting code!

In future posts, we’ll look at more complicated error correcting codes. Instead, let’s take a look at ISBN, which is an error-detecting code in practice. The 13-ISBN is a 13-digit code for books which specify information about them, such as author, publisher, language, etc. When scanning the barcode, it’s possible that the scanner reads the code erroneously. Or if a person is entering an ISBN by hand, he might make a mistake. However, ISBN is designed so that the last digit is a **check digit**, that is, it’s determined by the previous 12 digits. In fact, given the first 12 digits , the 13th digit is

The mod 10 just means we take only the rightmost digit of the sum. This method is useful for two reasons. First, to detect an error, one can check if the sum of the digits, with alternate weights of 1 and 3, has rightmost digit 0. If not, then there is an error. Furthermore, the weights are designed so that if we transpose two adjacent digits, an error can be detected. Had we weighted all the digits evenly, then transposing two adjacent digits would not change the sum. This is also an example of an error-detecting code which is used outside of digital message transmission.

To recap, you should have learned the following from this post:

- Why we need error detection and correction
- The basic idea of error detection and correction
- Geometric intuition behind error detection and correction
- An example of error detection in a common setting

Since all we care about is what happens in the long run, the definition of asymptotic notation involves limits. There are five primary notations: big-O, little-o, big-Omega, little-omega, and big-Theta. Don’t be intimidated by the large number of definitions. These are all defined very similarly, so once you understand one of them, you can understand them all. For that reason, we will focus on understanding big-O, and merely describe the differences from the others.

Big-O notation is used to describe upper bounds. Intuitively, the function is , pronounced “big-O of ,” if in the long run grows no faster than . We’ll briefly take a look at the technical definition, and then explain it in English:

A function if there is a constant such that whenever , for some .

The phrase “whenever , for some ” can be interpreted as “when gets large enough.” The can be seen as the “cutoff point,” where any larger will cause . Why do we want the constant to be there? Intuitively, constant multipliers don’t affect asymptotic growth. That is, we want to say that grows faster than , no matter how small is or how large is. These details only matter if you want to know the precise definition of big-O. If you just want to understand intuitively what big-O means, it means “grows no faster than.” Some properties and examples:

- ; obviously, no function grows faster than itself
- ; this is a consequence of how we designed big-O so that constant multipliers don’t matter
- If and , then ; this can be checked by the definitions, but it also makes sense intuitively—if grows no faster than , which grows no faster than , then of course grows no faster than !
- If $f(n) = O(g(n))$, then $O(f(n) + g(n)) = O(f(n))$

The last property is extremely important for polynomials. It basically means that when we look at polynomials, in big-O only the highest degree term matters, and we don’t care about the constants. This makes finding big-O of polynomials super easy. For example:

This comes in especially handy when estimating running times for algorithms. Let’s say we have an algorithm which, given points in the plane, finds the closest pair of points, and it does so by brute-force, comparing every pair of points. There are possible pairs, and each comparison takes some constant amount of time (this is why we don’t care about constants, because they change with advances in hardware), so in total the algorithm’s running time is .

Now, the other asymptotic notations are defined similarly. Intuitively, all you need to know is the following:

- A function is little-o of , written , if grows
**strictly slower**than . In other words, little-o is to big-O as is to . - A function is big-Omega of , written , if grows
**at least as fast**as . In other words, big-Omega is to big-O as is to . - A function is little-omega of , written , if grows
**strictly faster**than . In other words, little-omega is to big-O as is to . - Finally, a function is big-Theta of , written , if and . This means that grows
**asymptotically at the same rate**as .

To recap, you should have learned the following from this post:

- What asymptotic notation is
- How to compute big-O of polynomials
- The use of asymptotic notation

Modern electronic computers actually follow a theoretical model invented by Alan Turing in the 1930s. This theoretical model, now called a **Turing machine**, can be informally described quite simply. A Turing machine is composed of three parts: states, a tape (consisting of blocks or cells), and a tape head, which points to a cell. The modern computer counterpart to the tape is memory, and the counterpart to the tape head is the program counter. In addition, the Turing machine has a transition function, which tells it which state to go to and what to do on the tape, given its current state and tape contents. Think of it this way: you have a set of states such as “hungry”, “sad”, “happy”, etc. and you have a row of infinitely many sheets of paper laid side by side on the ground, but they’re so big that you can only see one of them at a time by walking over it, and each sheet can only contain one character (you have to write them so large so that aliens from space can see them). A transition function would be like a look-up table you carry in your pocket, which says stuff like “if you’re hungry and the paper you’re on contains a 1, then you become angry, change the 1 to a 0 on your sheet of paper, and move to the right by one sheet of paper.” Moreover, one of the states of the Turing machine is designated the initial state (the state which it starts in every time it’s run), some of the states are designated **accepting** states, and some of them are designated **rejecting** states. A Turing machine is completely specified by its states along with the above designations and its transition function. When you think Turing machine, think “theoretical computer.” Turing machines don’t have to be simulated by electronic computers. Anything that can do what a Turing machine does can act like a computer. Instead of electronic circuits acting as logic gates, you can use billiard-balls bouncing inside a box instead. This model has actually been proposed.

Now, every program in your computer acts like a Turing machine, except for one difference: the tape is not infinitely long, i.e., you have a finite amount of work space. For now, let’s only consider programs which answer Yes/No questions. For example, consider a program which decides if a given positive integer is prime. Many graphing calculators, such as the TI-89 graphing calculator, have such a function; on the TI-89, the function is called isPrime(). The calculator doesn’t magically know if a number is prime or not. When you tell it to do isPrime(123), it follows a specific set of instructions, or algorithm, which was programmed into it. The algorithm corresponds to the transition function of the Turing machine. When you tell the calculator or Turing machine to do isPrime(123), it first converts the input, 123, into the language with which it operates, for example, binary numbers. In this case, the Turing machine reads as its input the binary of 123, which is 1111011 and writes that down on the first 7 cells of its tape. Then it faithfully follows its transition function and performs steps and finally halts in either an accepting state or rejecting state. If it ends on an accepting state, it spits out the answer True, and if it ends on a rejecting state, it spits out the answer False. In our case, the machine will end up on a rejecting state, since 123 = 3 x 41.

However, programmers nowadays don’t have to worry about the nitty-gritty details of coming up with an appropriate transition function for a Turing machine that will correctly decide if its input is prime and then putting together an electronic circuit that will simulate the Turing machine. Thankfully, all that hard work has been done decades ago, and now everything’s been abstracted. We can write programs in programming languages like C++ or Java, and the compiler does all the work of translating our code into machine code, which the machine then reads in and performs the necessary steps to simulate the appropriate Turing machine.

Now, how does a computer actually check if a number is prime? That is, what is a sequence of instructions such that, when given a positive integer n, will always output the correct answer as to whether n is prime or not? Often one can write an algorithm by thinking about how one would do it by hand. If you were asked to determine if a number is prime or not, how would you do it? Well, one can directly use the definition of primality. The number n is prime if and only if some integer d > 1 evenly divides n. So, one way would be to start from 2 and check if every number up to n-1 divides n. If none do, then n is prime, otherwise n is not. This may seem like a stupid way to do it, and in some sense it is, but it works and you have to keep in mind that computers crunch numbers a lot faster than humans do. The algorithm just described can be written in Python as follows:

def isPrime(n): if n < 2: return False else: for d in range(2, n): if n % d == 0: return False return True

In case you’re not familiar with Python syntax, this is what the code does: it defines a function isPrime(n) which takes n as its input. Then, it first checks if n < 2. If so, then it’ll spit out False, since there are no primes less than 2. If not, then the code goes to the else branch. The for loop basically iteratively sets d to be 2, 3, 4, etc. all the way until n-1. For each value, it’ll test if d divides n. The n % d means taking the remainder when dividing n by d, and == 0 is checking if the result is equal to 0. If this condition is true, then d divides n, so n is not prime, so we return False. Finally, after running through all these values of d, if no divisors were found, then return True.

You may notice that this algorithm takes a linear amount of time in n. That is, if you double n, then the number of divisors the algorithm checks roughly doubles as well. We say that this is an -time algorithm, where is the input. Note that we can modify the algorithm so that it only checks divisors d up to . We can do this because if n had a divisor , then it would have to have a corresponding divisor . Therefore we could make the algorithm an -time algorithm. Making algorithms more efficient is a huge area of study in computer science, and deserves at least a post on its own.

To recap, you should have learned the following from this post:

- An intuitive idea of what a Turing machine is, and how modern electronic computers relate to them
- What an algorithm is
- The relationship between algorithms and Turing machines
- An algorithm for determining the primality of positive integers

Probably the most famous sequence defined by a linear recurrence is the **Fibonacci sequence** of numbers , defined by

that is, each term is the sum of the two previous terms. The first few terms are 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 81. These numbers arise in nature. For example, consider modeling the population of some sort of asexually reproducing single-celled organism. Let’s suppose this species takes 1 day to reach “maturity,” after which it begins to asexually reproduce, and it takes 1 day to actually divide. Then, if the population on day is , then , since on the previous day there are cells but only of them are mature enough to start reproducing. For example, on day 1 there is 1 baby cell. On day 2, the baby cell becomes an adult. On day 3, the 1 adult has given birth, so there are 2 cells, 1 adult and 1 baby. On day 4, the baby has just matured and the adult has given birth again, yielding 3 cells, 2 adults and 1 baby. On day 5, the new baby matures and the 2 adults give birth, yielding 3 adults and 2 babies, and so on.

Now, it would be nice to be able to find a closed formula for , so that if we wanted to find , we don’t have to start from and build our way up. For example, if we had a sequence defined by and , then the numbers would be 1, 2, 4, 8, 16, etc. and it would be easy to see that a closed formula for this sequence is .

Intuitively, things whose growth rate is related to their current size grow exponentially. For example, the growth rate for our is , that is, the amount increases by each time is equal to how many there are right now. This is a model of cells dividing which do not require time to mature. The rate of growth for Fibonacci numbers is , so the growth rate is equal to the population one generation ago, so it’s “delayed” in some sense, but our intuition still tells us that this should be close to exponential. Of course, it isn’t actually exponential, otherwise the ratio would be constant. Nevertheless, the ratio approaches , commonly known as the **golden ratio**. That means that in the long run, behaves more or less like the function . However, we want an **exact** formula for , so let’s see how we go about finding that.

One way is by using generating functions. A **generating function** for a sequence is just a power series whose coefficients record information about that sequence, usually the actual numbers. For example, let’s take a look at our beloved sequence . The terms are 1, 2, 4, 8, 16, 32, and so on, so the generating function for the sequence is

where the coefficient of is simply . Notice that we can write more compactly. Since we know that , we have

Now, the function is pretty nice, and if we remember anything about infinite geometric series, then we know that if , then this function converges and we can write it as . But when we’re talking about generating functions, we don’t care about converging, so we can just say since

Now we notice that is just with replaced with , so we can write

.

Since we have in this nice form, we can expand it to get and find out that . Wait a minute. Didn’t we *use* the fact that to get the nice form? This is true, but there is another way to get the nice form of without knowing the explicit formula for , and it just involves a little algebraic trickery.

We know that . Therefore,

Furthermore, if we multiply by , we get a similarly looking thing:

In fact, what we have is

If we move all the terms involving to the left-hand side and factor out by , we get

from which it is easy to see that .

Now let’s try doing this with the generating function

and see what happens. Our recurrence relation means

meanwhile

and so

which we can rearrange to get

and so

We can use the quadratic formula to find the roots of which happen to be and . Therefore, we can write

.

Moreover, we can use partial fraction decomposition to write

To solve for and , we note that

which gives us the system of linear equations

Solving for yields

However, we want our denominators to be of the form , so that we can expand it into , so we rewrite

and finally we note that

Moreover, and . Let’s call these numbers and , respectively. Then what we have is

Therefore, we have just concluded that

Now that we have worked so hard to get this formula, let’s take a step back and appreciate it. This formula is useful in several ways. For starters, it allows us to compute much faster. For example, using the method of building up from all the previous terms, we have to do a linear number of operations, or in asymptotic notation. However, using the formula, we can compute and by repeatedly squaring and , which is logarithmic asymptotically, or . Furthermore, this formula shows us the exponential growth in . Not only that, it is crystal clear from this formula where the golden ratio comes from! Note that , so as gets large, the term becomes negligible, which is why for large , is approximately .

To recap, from this post you should have learned the following:

- What generating functions are and how they can be used to find closed formulae for sequences given by linear recurrences
- What the Fibonacci numbers are and how they arise in nature
- How to find a closed formula for the Fibonacci numbers using generating functions
- Why knowing the closed formula allows us to compute terms faster
- How the closed formula for the Fibonacci numbers sheds light on its growth rate and where the golden ratio comes about

Most likely, if you’ve taken algebra in high school, you’ve seen something like the following:

Your high school algebra teacher probably told you this thing was a “matrix.” You then learned how to do things with matrices. For example, you can add two matrices, and the operation is fairly intuitive:

You can also subtract matrices, which works similarly. You can multiply a matrix by a number:

Then, when you were taught how to multiply matrices, everything seemed wrong:

That is, to find the entry in the -th row, -th column of the product, you look at the -th row of the first matrix, the -th column of the second matrix, you multiply together their corresponding numbers, and then you add up the results to get the entry in that position. In the above example, the 1st row, 2nd column entry is a because the 1st row of the first matrix is , the 2nd column of the second matrix is , and we have . Moreover, this implies that matrix multiplication isn’t even commutative! If we switch the order of multiplication above, we get

How come matrix multiplication doesn’t work like addition and subtraction? And if multiplication works this way, how the heck does division work? The goal of this post is to answer these questions.

To understand why matrix multiplication works this way, it’s necessary to understand what matrices actually are. But before we get to that, let’s briefly take a look at why we care about matrices in the first place. The most basic application of matrices is solving systems of linear equations. A linear equation is one in which all the variables appear by themselves with no powers; they don’t get multiplied with each other or themselves, and no funny functions either. An example of a system of linear equations is

The solution to this system is . Such equations seem simple, but they easily arise in life. For example, let’s say I have two friends Alice and Bob who went shopping for candy. Alice bought 2 chocolate bars and 1 bag of skittles and spent $3, whereas Bob bought 4 chocolate bars and 3 bags of skittles and spent $7. If we want to figure out how much chocolate bars and skittles cost, we can let be the price of a chocolate bar and be the price of a bag of skittles and the variables would satisfy the above system of linear equations. Therefore we can deduce that a chocolate bar costs $1 and so does a bag of skittles. This system was particularly easy to solve because one can guess and check the solution, but in general, with variables and equations instead of 2, it’s much harder. That’s where matrices come in! Note that, by matrix multiplication, the above system of linear equations can be re-written as

If only we could find a matrix , which is the inverse of the matrix , so that if we multiplied both sides of the equation (on the left) by we’d get

The applications of matrices reach far beyond this simple problem, but for now we’ll use this as our motivation. Let’s get back to understanding what matrices are. To understand matrices, we have to know what vectors are. A **vector space** is a set with a specific structure, and a **vector** is simply an element of the vector space. For now, for technical simplicity, we’ll stick with vector spaces over the real numbers, also known as **real vector spaces**. A real vector space is basically what you think of when you think of space. The number line is a 1-dimensional real vector space, the x-y plane is a 2-dimensional real vector space, 3-dimensional space is a 3-dimensional real vector space, and so on. If you learned about vectors in school, then you are probably familiar with thinking about them as arrows which you can add together, multiply by a real number, and so on, but multiplying vectors together works differently. Does this sound familiar? It should. That’s how matrices work, and it’s no coincidence.

The most important fact about vector spaces is that they always have a basis. A **basis** of a vector space is a set of vectors such that any vector in the space can be written as a linear combination of those basis vectors. If are your basis vectors, then is a linear combination if are real numbers. A concrete example is the following: a basis for the x-y plane is the vectors . Any vector is of the form which can be written as

so we indeed have a basis! This is not the only possible basis. In fact, the vectors in our basis don’t even have to be perpendicular! For example, the vectors form a basis since we can write

.

Now, a **linear transformation** is simply a function between two vector spaces that happens to be **linear**. Being linear is an extremely nice property. A function is linear if the following two properties hold:

For example, the function defined on the real line is not linear, since whereas . Now, we connect together all the ideas we’ve talked about so far: matrices, basis, and linear transformations. The connection is that **matrices are representations of linear transformations**, and you can figure out how to write the matrix down by seeing how it acts on a basis. To understand the first statement, we need to see why the second is true. The idea is that any vector is a linear combination of basis vectors, so you only need to know how the linear transformation affects each basis vector. This is because, since the function is linear, if we have an arbitrary vector which can be written as a linear combination , then

Notice that the value of is completely determined by the values , and so that’s all the information we need to completely define the linear transformation. Where does the matrix come in? Well, once we choose a basis for both the domain and the target of the linear transformation, the columns of the matrix will represent the images of the basis vectors under the function. For example, suppose we have a linear transformation which maps to , meaning it takes in 3-dimensional vectors and spits out 2-dimensional vectors. Right now is just some abstract function for which we have no way of writing down on paper. Let’s pick a basis for both our domain (3-space) and our target (2-space, or the plane). A nice choice would be for the former and for the latter. All we need to know is how affects , and the basis for the target is for writing down the values concretely. The matrix for our function will be a 2-by-3 matrix, where the 3 columns are indexed by and the 2 rows are indexed by . All we need to write down are the values . For concreteness, let’s say

Then the corresponding matrix will be

The reason why this works is that matrix multiplication was designed so that if you multiply a matrix by the vector with all zeroes except a 1 in the -th entry, then the result is just the -th column of the matrix. You can check this for yourself. So we know that the matrix works correctly when applied to (multiplied to) basis vectors. But also matrices satisfy the same properties as linear transformations, namely and , where are vectors and is a real number. Therefore works for all vectors, so it’s the correct representation of . Note that if we had chosen different vectors for the basis vectors, the matrix would look different. Therefore, matrices are not natural in the sense that they depend on what bases we choose.

Now, finally to answer the question posed at the beginning. Why does matrix multiplication work the way it does? Let’s take a look at the two matrices we had in the beginning: and . We know that these correspond to linear functions on the plane, let’s call them and , respectively. Multiplying matrices corresponds to **composing** their functions. Therefore, doing is the same as doing for any vector . To determine what the matrix should look like, we can see how it affects the basis vectors . We have

so the first column of should be , and

so the second column of should be . Indeed, this agrees with the answer we got in the beginning by matrix multiplication! Although this is not at all a rigorous proof, since it’s just an example, it captures the idea of the reason matrix multiplication is the way it is.

Now that we understand how and why matrix multiplication works the way it does, how does matrix division work? You are probably familiar with functional inverses. The **inverse** of a function is a function such that for all . Since multiplication of matrices corresponds to composition of functions, it only makes sense that the multiplicative inverse of a matrix is the compositional inverse of the corresponding function. That’s why not all matrices have multiplicative inverses. Some functions don’t have compositional inverses! For example, the linear function mapping to defined by has no inverse, since many vectors get mapped to the same value (what would be? ? ?). This corresponds to the fact that the 1×2 matrix has no multiplicative inverse. So dividing by a matrix is just multiplication by , if it exists. There are algorithms for computing inverses of matrices, but we’ll save that for another post.