Test of Independence Study Material

Let us start the topic by reviewing the concept of independent events. After all, as the name implies, the “test of independence” tests whether two events are independent or not.

  1. Independent Events

We have previously defined two events A and B are independent if P(A) = P(A|B), that is, the chance of A is the same with the chance of A given B. The example we used involving calculation of P(red card) and P(odd number card). We found out that P(odd number card) = P(odd number card | red card). The probability of picking an odd number card is the same regardless you are picking it 1) from an entire deck of cards – P(odd number card) or 2) from red card from the deck – P(odd number card | red card).

You can test the conclusion by using the equation.

  • P(odd number card) = 28/52
  • P(odd number card | red card) = 14/26

Table 1. Cross-tabulation of Even/Odd Number and Color of Poker Cards

Advertisements
 

Odd number

Even number

Row Total

Red

14

12

26

Black

14

12

26

Column Total

28

24

52

 

When we say two events are independent, it means the marginal probabilities (i.e., the probability of the event by itself) of two events do not interfere with each other.

In test of independence, we determine whether two categorical variables are independent by summarizing them into a cross-tabulation format. This statement mentions several boundaries and conditions attached to test of independence:

  1. At the current stage, we will only conduct tests that involves two dimensions (e.g., black or red and odd or even) or events;

  1. The variables involved will be categorical data;
  2. You need to summarize your sample data in the cross-tabulation format.

Please see below for a snippet of sample dataset that is ideal for this type of test.


Figure 1. A snippet of sample data

The sample data involves two events (or dimensions), office and Yes/No. “Office” has three values: Office 1, Office 2, and Office 3. Yes/No event has two values: Yes and No. Therefore, the cross-tabulation for this data can take following form.

 

Office 1

Office 2

Office 3

Total Yes/No

Yes

x1-yes

x2-yes

x3-yes

\Sigmaxyes

No

x1-no

x2-no

x3-no

\Sigmaxno

Total Office

\Sigmax1

\Sigmax2

\Sigmax3

\Sigmax

 

(* You will be provided with a data file that contains two or more variables that can be used to conduct a test of independence. Please refer later section for how to use Pivot table to make a cross-tabulation like the one above.)

  1. Expected Form of Cross-Tabulation when Two Events (dimensions) are Independent

There are some books saying that we are testing independence of “two variables”. I think it is confusing because sometimes there are more than two variables (i.e., like the previous “ideal” dataset sample) involved. It is really two events or two dimensions that are under consideration for testing.

Following table is intended to show whether two events – degree and income are independent. x represents frequency of each group out of four combinations. We know what the column and row totals are, but do not know specific x.

 

High Income

Low Income

Row Total

Graduate Degree

xg-high

xg-low

200

Undergraduate Degree

xu-high

xu-low

400

Column Total

240

360

600

 

WHEN TWO EVENTS ARE INDEPENDENT, it forms a certain pattern that even though we do not know specific values of each x, as long as we know the column and row total, we can correctly infer those x values. Please refer following calculation.

If degree and income are independent, based on the previously learned probability calculation, we have:

  • P(High Income) = P(High Income | Graduate Degree) = 240/600 = xg-high /200

  • xg-high = 80
  • We see the proportion of high income column total (240) over grand total (600) is the same with the graduate degree and high income count (xg-high = 80) over the row total of graduate degree (200). Also, the proportion of graduate degree column total (200) over grand total (600) is the same with the xg-high over the column total of High income.
  • You can find out the patterns for all remaining three x.

Do you sense (or notice) the pattern and do you understand why?

  • If income and degree have nothing to do with each other, the pattern (or the distribution) of each income group (low and high) within each degree (undergraduate and graduate) should not differ drastically.

Based on this expected pattern of cross-tabulation of two independent events, we can conduct the test of independence. In this test, instead of relying on previously used sample statistics (i.e., mean, standard deviation, and proportion), we use the frequency pattern of each group.

  1. Measuring the Differences between Expected and Observed Pattern using Chi-square

Here is the completed expected frequency shown in a cross-tabulation form when two events are independent.

 

High Income

Low Income

Row Total

Graduate Degree

80

120

200

Undergraduate Degree

160

240

400

Column Total

240

360

600

 

Following table shows the actual (or observed) frequency for each group. (Remarks: It is just an example I made up.)

 

High Income

Low Income

Row Total

Graduate Degree

160

40

200

Undergraduate Degree

80

320

400

Column Total

240

360

600

 

Let us now see how much deviations occurred in the observed frequency when compared to the expected frequency. To do so, I am using a test statistic called chi-square to measure the deviations.

Chi-square = (80-160)^2/160 + (120-40)^2/40 + (160-80)^2/80 + (240-320)^2/320 = 300

This entry was posted in Data Analytics. Bookmark the permalink.

Leave a Reply