Let us start the topic by reviewing the concept of independent events. After all, as the name implies, the “test of independence” tests whether two events are independent or not.
- Independent Events
We have previously defined two events A and B are independent if P(A) = P(A|B), that is, the chance of A is the same with the chance of A given B. The example we used involving calculation of P(red card) and P(odd number card). We found out that P(odd number card) = P(odd number card | red card). The probability of picking an odd number card is the same regardless you are picking it 1) from an entire deck of cards – P(odd number card) or 2) from red card from the deck – P(odd number card | red card).
You can test the conclusion by using the equation.
- P(odd number card) = 28/52
- P(odd number card | red card) = 14/26
Table 1. Cross-tabulation of Even/Odd Number and Color of Poker Cards
Odd number |
Even number |
Row Total |
|
Red |
14 |
12 |
26 |
Black |
14 |
12 |
26 |
Column Total |
28 |
24 |
52 |
When we say two events are independent, it means the marginal probabilities (i.e., the probability of the event by itself) of two events do not interfere with each other.
In test of independence, we determine whether two categorical variables are independent by summarizing them into a cross-tabulation format. This statement mentions several boundaries and conditions attached to test of independence:
- At the current stage, we will only conduct tests that involves two dimensions (e.g., black or red and odd or even) or events;
- The variables involved will be categorical data;
- You need to summarize your sample data in the cross-tabulation format.
Please see below for a snippet of sample dataset that is ideal for this type of test.
Figure 1. A snippet of sample data
The sample data involves two events (or dimensions), office and Yes/No. “Office” has three values: Office 1, Office 2, and Office 3. Yes/No event has two values: Yes and No. Therefore, the cross-tabulation for this data can take following form.
Office 1 |
Office 2 |
Office 3 |
Total Yes/No |
|
Yes |
x1-yes |
x2-yes |
x3-yes |
\Sigmaxyes |
No |
x1-no |
x2-no |
x3-no |
\Sigmaxno |
Total Office |
\Sigmax1 |
\Sigmax2 |
\Sigmax3 |
\Sigmax |
(* You will be provided with a data file that contains two or more variables that can be used to conduct a test of independence. Please refer later section for how to use Pivot table to make a cross-tabulation like the one above.)
- Expected Form of Cross-Tabulation when Two Events (dimensions) are Independent
There are some books saying that we are testing independence of “two variables”. I think it is confusing because sometimes there are more than two variables (i.e., like the previous “ideal” dataset sample) involved. It is really two events or two dimensions that are under consideration for testing.
Following table is intended to show whether two events – degree and income are independent. x represents frequency of each group out of four combinations. We know what the column and row totals are, but do not know specific x.
High Income |
Low Income |
Row Total |
|
Graduate Degree |
xg-high |
xg-low |
200 |
Undergraduate Degree |
xu-high |
xu-low |
400 |
Column Total |
240 |
360 |
600 |
WHEN TWO EVENTS ARE INDEPENDENT, it forms a certain pattern that even though we do not know specific values of each x, as long as we know the column and row total, we can correctly infer those x values. Please refer following calculation.
If degree and income are independent, based on the previously learned probability calculation, we have:
- P(High Income) = P(High Income | Graduate Degree) = 240/600 = xg-high /200
- xg-high = 80
- We see the proportion of high income column total (240) over grand total (600) is the same with the graduate degree and high income count (xg-high = 80) over the row total of graduate degree (200). Also, the proportion of graduate degree column total (200) over grand total (600) is the same with the xg-high over the column total of High income.
- You can find out the patterns for all remaining three x.
Do you sense (or notice) the pattern and do you understand why?
- If income and degree have nothing to do with each other, the pattern (or the distribution) of each income group (low and high) within each degree (undergraduate and graduate) should not differ drastically.
Based on this expected pattern of cross-tabulation of two independent events, we can conduct the test of independence. In this test, instead of relying on previously used sample statistics (i.e., mean, standard deviation, and proportion), we use the frequency pattern of each group.
- Measuring the Differences between Expected and Observed Pattern using Chi-square
Here is the completed expected frequency shown in a cross-tabulation form when two events are independent.
High Income |
Low Income |
Row Total |
|
Graduate Degree |
80 |
120 |
200 |
Undergraduate Degree |
160 |
240 |
400 |
Column Total |
240 |
360 |
600 |
Following table shows the actual (or observed) frequency for each group. (Remarks: It is just an example I made up.)
High Income |
Low Income |
Row Total |
|
Graduate Degree |
160 |
40 |
200 |
Undergraduate Degree |
80 |
320 |
400 |
Column Total |
240 |
360 |
600 |
Let us now see how much deviations occurred in the observed frequency when compared to the expected frequency. To do so, I am using a test statistic called chi-square to measure the deviations.
Chi-square = (80-160)^2/160 + (120-40)^2/40 + (160-80)^2/80 + (240-320)^2/320 = 300