BA540
STATISTICAL METHODS


COURSE POLICIES / COURSE SCHEDULEGrades (as of 5/5) / 

OTHER LINKS
American Statistical Association/ Journal of Statistics Education/ Fed Stats/ Teaching Applets /
http://www.statsoftinc.com/textbook/stathome.html /  Directory of Online Statistics Sources/ 

Dr. Brian Goff/414 Grise Hall
Phone (502)745-3855/brian.goff@wku.edu
Last Modified: January 12, 2007
Western Kentucky University






COURSE POLICIES

CONTACT INFORMATION
414 Grise Hall/ 745-3855
email: brian.goff@wku.edu Office Hours: 8-9, 10-11 MWF; 2-5 W
(I am in my office or on campus most days from around 8-4 except around noon)

TEXTS/SOFTWARE
Managerial Statistics: A Case-Based Approach by Klibanoff and others  (Thomson-Southwestern)
Excel & SPSS (both on WKUnet or personal copy)

GRADING
Reading Quizzes             30%
Weekly Assignments      40%
Final Exam                     20%

Assignments turned after the beginning of the designated class period will receive a 10% per business day reduction.   Quizzes will be in-class multiple choice questions over assigned readings. "Mini" assignments are short in/out of class projects on which you will receive either full or no credit with one "grace."  I will discuss the final exam later in the course.

MISCELLANEOUS
Last day to drop course with a "W" or change from credit to audit is listed on Topnet.  Any students requiring special consideration under the provisions of the ADA should register with the ADA Compliance Office.  If you are not fluent in English or are weak in your writing abilities, you should utilize a writing "consultant" to examine your written reports before turning them in.  The WKU Writing Center is one option.  Undergraduate students willing to offer tutorial services (for a fee or free) are another. 

ATTENDANCE/MISSED ASSIGNMENTS
The combination of students with job responsibilities and a course which only meets once per week can present problems. The policy below attempts to strike a balance of accommodation while maintaining legitimate standards. Under special circumstances and my approval discussed in advance, one assignment may be missed and the other assignments/final weighted more heavily contingent upon my approval of the reason. Failure to complete more than two assignments or absence from more than 2 full sessions (or equivalent) will result in a student being dropped from the course regardless of the reason.

COURSE OBJECTIVES
The goal is to enhance your ability to envision, design, conduct, and report data-oriented or technical projects with business related applications.    It is a course concerned with data analysis but building your ability to solve problems with data analysis and to communicate results are also emphasized. 



COURSE SCHEDULE (check back for updates & changes)

Week 1:  Getting Started -- Data, Software, & Brain-Power
                Week 1 Assignment
Week 2:  Data Descpription & Basic Tests     Reading Quiz = Chapter 1-2; Statsoft -- Elementary Concepts
                Week 2 Assignment
Week 3:  Using Basic Data Tools      Reading Quiz = Statsoft -- Data MiningStatsoft -- QualityControl Charts
               Week 3 Assignment                                      
Week 4:  Writing Data-Oriented Reports (Reading Quiz = UN--Making Data MeaningfulBrief PowerPoint on Writing; )
                Week 4 Assignment      (In Class:  St. Louis Fed Review  2006)  
Week 5:  Basic Regression (Reading Quiz = Chapter 3-4)
                Week 5 Assignment                      
Week 6:   Regression with Slope and Dummy Variables  (Reading Quiz = Chapter 5)  
                Week 6 Assignment
Week 7:  Regression with Non-Linear Applications (Reading = Chapter 6, 8)
                Week 7 Assignment
               
Week 8:  Spring Break -- No Class  

Week 9:  Regression and Other Methods for Limited Dependent Variables --(LPM & Logistic Regression Powerpoint)
                Week 9 Assignment
Week 10:  General Modelling Issues with Regression (Reading Quiz = Chapters 7-8)
                 Week 10 (No Assignment)
Week 11: Time-Oriented Regression Applications (Reading Quiz = Chapter 9; Reading Supplement)
                Week 11 Assignment
Week 12:  Writing a Regression-Oriented Report
                 Week 12 Assignment
Week 13:  Experimental Design; Surveys (Reading Quiz = Statsoft-Experimental Design;   Ex Design SupplementSurveys Supplement
                Week 13 Assignment
Week 14:  Simulations;  Factor Analysis (Reading Quiz = Simulations Supplement  Statsoft--Factor AnalysisFactor Link
                 Week 14 Assignment
Week 15:  Reminders  (No Reading Quiz)  -- Avoiding Big Errors; Regression Examples 
                              
Week 16:  Final Exam
 




READING SUPPLEMENT -- Time-Oriented Regression & Forecasting
Determining Patterns (components) Time Series Data

The analysis of data collected at regular time intervals represents an important areas of business statistics because of two inter-related reasons: 1) it permits patterns from past changes in a variable to examined; and 2) these past patterns can be used as a tool to forecast the future values. Unlike regression analysis where data must be gathered on both dependent and independent variables, time series methods can be used even if data have been collected only on the variable of interest. This provides a major advantage because it reduces the amount of data required. Chapter 14 in Siegel introduces some of the concepts of time series analysis. The discussion in this supplement covers a few of the things that Chapter 14 omits or leaves a bit fuzzy. Some of the key terms are:

Y(t) = values for variable Y (such as sales) measured at regular time intervals (t) such as monthly. Y(t) is also referred to as "the level of Y" or a "time series on Y."

Y(t-1) = value for the variable Y in the prior period. For instance, with monthly sales data, Y(t-1) refers to the prior month's sales. Y(t-1) is also called the "lagged" value of Y of the "lag of Y."

Y(t) - Y(t-1) = difference in Y; it is also known as the "first difference" of Y as well as the "change in Y" from period to period. It measures how much the variable changed from its value period. For monthly sales, it shows how much sales changed from the prior month.

Structural Models = regression models using time-based data where several other variables (X-variables) are used as explanatory variables in predicting values of Y;

Time Series Models = equations which use past values of Y, past differences in Y, and information drawn from these past values to predict current values;

Time Series Components = patterns in the past movements of Y such as a trend components, a cyclical component, a seasonal component, or a random component.

We will use weekly Sales for Company X from 1990 through 2001 to illustrate our points. The simplest (but sometimes misleading way) to determine patterns in the past history of sales is by looking at a graph of monthly sales plotted over time (Sales on the Y-axis and weeks & years on the X-axis) to see if there are any obvious trends, repetitive cycles, big jumps or dips during certain periods, and the like.


 

At first glance, the graph above appears to show that there may be an upward trend (a trend component)over the time frame of about $10 thousand per year. Also, there are some repetitive "ups and downs" (a cyclical component). If you look closely, you can also make out big dips of $20 to $30 thousand around the end of each year and an increase of a few thousand during the summer weeks (seasonal components).

While looking at graphs like the one above is a good first step, it leaves the size of the patterns to a lot of guesswork. To more precisely quantify the patterns in weekly sales, we can estimate an equation from the data that looks similar to a regression equation but uses on information on the variable under study or time. The data in our file would appear as below where "t" represent a particular week:
Week (t) Sales (in thousands of $)
1 200
2 208
3 215

In the equation estimated below we will use the following definitions:

Sales(t) = weekly sales in thousands of dollars;
Trend = time variable counting the weeks; it starts at 1 and increasing by 1 unit each week of the sample; this looks for a linear "trend" component (pattern) in the data;
Sales(t-1) = Sales in the prior week; this is the simplest means of looking for past "cycles" in the data;
XMAS = a seasonal indicator variable equal to 1 if a week included December 25 and 0 otherwise;
SUMMER = a seasonal indicator variable 1 for weeks from Memorial Day to Labor Day and 0 otherwise.

The results of estimating an equation to predict weekly Sales with these components appears below:
 
 
Variable Coefficient Std. Error t-value p-value
Constant 100 6.2 15.0 0.001
Trend 0.075 0.006 13.0 0.001
Sales(t-1) 0.50 0.03 15.00 0.001
Xmas -32.0 3.5 9.00 0.001
Summer 6.0 1.3 4.00 0.001

R-square = 0.82
Durbin-Watson = 2.00
Box-Pierce Q(12 lags) = 10.2 (p-value = 0.65)
Mean Weekly Sales = $250 (000); Standard deviation Sales= $30 (000)


In equation form, these regression results would be written

Sales(t) = 1.00 + 0.075*Trend + 0.50*Sales(t-1) - 32*XMAS + 6*SUMMER + error(t)

The coefficients in this equation are interpreted the same way as regression coefficients. For instance, the Trend coefficient of 0.075 means that for each week, sales increases by about $0.075 thousand ($75). Around Christmas, Sales drops by about $32 thousand. The "Durban-Watson" statistic (2.01 in this case) is a measure of whether the errors in our model are dependent on each other (a bad thing). Values for the Durbin-Watson between about 1.6 and 2.4 are viewed as indicating independence of the errors (a good thing). The "Box-Pierce Q-Statistic" is another measure of this same thing.


Forecasts from Time Series Models

We can generate forecasts from this equation fairly easily. Suppose that for the last week of the 12 years of data (t=624), actual sales were 300. The equation would forecast sales for first week of 2002 (t=week 625 in the data) and the second week of 2002 (t=626) to be

Forecast Sales (t+1) = 100 + 0.075*(Trend = week 625) + 0.5*(Sales week t = 300) - 0.32*(XMAS = 0) + 6*(Summer = 0)
= 296.8

Forecast Sales (t+2) = 100 + 0.075(Trend = week 626) + 0.5 *(Sales week t+1) - 0.32*(XMAS = 0) + 6*(Summer = 0)
= 295.3

The goal in forecasting is to make accurate predictions. Errors = Actual Values - Forecasted values. As in regression analysis, R-square is a measure of how well the patterns in the model account for past movements in the series. In addition to R-square, other measures are commonly used evaluate how good the model is at forecasting (predicting) sales. Some of these are

Root Mean Square Error = a measure of the average size of the errors of the forecasts; it is the square root (sum of the squared errors divided by the number of errors);
Mean Absolute Error = a measure of the average size of the errors of the forecasts; it is the sum of the absolute value of the errors divided by the number of the errors;
Mean Absolute Percent Error = the average error size as a percent of the mean of the variable being forecasted; mean absolute error divided by the mean.

"Static Forecasts" estimate the forecasts and errors using the data for the sample used to compute the time series equation. "Dynamic Forecasts" use data from outside the original sample to compute forecasts and errors.


For the model above, the forecasting diagnostics were as follows:

(Static) Forecast Diagnostics
Root Mean Square Error = $7500; Mean Absolute Error = $7000; Mean Absolute Percent Error = 2.8%
In words, the average weekly error in our forecasts was $7000 to $7500 or about 2.8% of total sales.

More Complex Patterns:
The example above investigated some of the simpler patterns to be found in a time series. More complicated patterns can be investigated in a number of ways. For one, trends are not necessarily linear. Using a squared Trend term can sometimes account for this. Sometimes the trends are much more complex, requiring special methods. Second, cyclical patterns may be much more subtle and complex requiring a second or third "lagged" value or even lagged error terms. Sometimes, the errors from the forecasts can be used to improve future forecasts (another cyclical pattern). Third, seasonal patterns may not be as straightforward as we estimated above.

Also, other variables can be added to time series equation. In the example above, the existence, type, or amount of advertising done in the prior week might be included as an explanatory variable. Finally, one of the most subtle but most important issues in looking for time series patterns is to realize that some times patterns can be misleading. What appears to be a trend or a cycle may be nothing more than a series of random steps. For example, in class we will see a graph of a variable that appears to display a downward trend and possibly repetitive cycles around this downward trend, it is really nothing more than a series of random movements.


 
 
 

These kinds of time series that are a series of random steps are called "random walks." Chapter 14 discusses the idea a little. Random walks can masquerade as seeming trends or cycles in data. A random walk contains random movements from one step to the next -- if you selected a starting point, say 50, then drew a ball at random (say 2), and then put your new mark 2 places above your starting mark so you are now at 52, then you draw a new ball at random (say -5) and place the next mark -5 places above your prior mark so you are now at 47. This makes the change in your position random. That is exactly how the graph shown above was generated.

Such a random walk (series of random steps) looks very different from a series where balls are drawn from a hopper where 50 is the mean and balls with numbers ranging from 45 to 55 are in the hopper and drawn at random. A series generated by this procedure is a "random series." It is also called a "white noise" series. Its graph probably appears more like the one people have in mind when they think of randomness.


 

Distinguishing a random series from one with patterns is not too difficult. The graph above has no obvious pattern. More precisely, an equation for Stock Price(t) that included lagged stock price, trends, or seasons would all have coefficients near zero. Distinguishing random walks is a bit trickier. As the graph for the random walk shows, there seems to be a trend. If and equation for stock price (t) were estimated with a Trend, the Trend coefficient would also appear large. The key is the coefficient on lagged stock prices (t-1). A random walk will have a lagged coefficient near 1.0 (usually 0.9 to 1.0). The random walk above has this equation: Stock Price(t) = 0.17 + 0.98*Stock Price(t-1). The coefficient of 0.98 means that there is a random walk component in this series. Further analysis should be conducted using changes in the stock prices instead of the original levels.



SUPPLEMENTAL READING- EXPERIMENTAL DESIGN

Introduction: Experimental design refers to actively controlling the process by which data is generated so that the effects of one or more variables can be better isolated and more accurately estimated.  Experimental design  methods are common place in natural and life sciences, where many experiments are conducted within laboratories and most of the variables influencing outcomes can be controlled. However, the methods are also useful in other scientific and business settings where only some of the factors are controllable. In business, experimental design has been most widely used  in production management settings to test different techniques or machinery. However, the same methods are adaptable to almost any managerial or personnel setting whether a physical product is being produced or a service is being offered.  The applications can range from very simplistic methods to very complicated designs.  Some of the key terms and concepts are defined below.

Factor(s): a variable(s) that wholly or partly determines changes in another variable (usually designated as the "response variable"); settings or "levels" of these factors refer to the different possible values the factor can take; in many situations, these factor levels are qualitative

Experimental Data: Data that is generated where one or more of the factors influencing the outcomes is actively manipulated so as to better isolate or eliminate its effect or the effect of other factors;

Observational Data: Data that is collected without any active manipulation of the factors that influence the outcomes

Experimental Design: The plan for manipulating factors when generating and observing outcomes; the plan may range from a simply a change in a setting (levels) of one factor -- a simple "intervention"-- or a very extensive design that holds some factors constant while changing the settings (levels) of other factors

Overview of Steps in an Experimental Design: While the specifics of a designs vary based on the details of the industry, company, and particular issue, an organized approach to setting up the experiment should, more or less, follow the items presented below.
(Adapted from Coleman and Montgomery, Systematic Approach to Planning for a Designed Industrial Experiment, Technometrics, 1993).

1. Objectives of the experiment: a precise statement of the purpose of the experiment.  The statement should explain the factor-response relationship the experimenter desires to isolate and why.  The statement should be to the point, measurable, and relevant
2. Background: existing theoretical or statistical knowledge concerning the response variable or factors, if any, as well as how the current experiment fits in with this background
3. Response variable: identify how the variable is measured and the typical operating means and ranges (if known)
4. List factors & determine settings/controls:
a) Factors of main interest -- identify the variables influencing the response variable that the design is intended to help isolate their effects or their combined effects (interactions); determine the desired settings (levels) of these variables during the experiment including desired interactions;
b) Factors to be held constant -- identify other variables (not of main interest) that likely to influence the response variable but their settings can be held the same throughout the experiment.  Although these should be "held constant," small changes may be permissable.  If so, specify these "allowable" ranges of variation
c) Other controllable variables -- identify other variables (not of main interest) likely to influence the response variable that can be actively manipulated; determine the strategy for filtering out their effects (such as randomizing the variables of main interest among different settings of these factors or
d) Non-controllable factors -- identify the other variables (not of main interest) that influence the response variable and that cannot be held constant or actively controlled; if these variables can be measured, determine the specific measurement strategy; if these cannot be measured, attempt to identify the expected impact, if any on the experimental outcomes
5. Restrictions: identify and list cost, legal, managerial, or other limitations placed on the ability to actively control or manipulate factors in the experiment
6. Oversight and setup: identify responsibilities of personnel in the experiment and whether or not trial runs should be conducted
7. Analysis techniques: if possible, identify the most likely statistical methods that will be used to analyze the data produced from the experiment (such as regression, plots, ANOVA, ...)



SUPPLEMENTAL READING -- SAMPLING AND NON-SAMPLING ERROR IN SURVEYS

Error: difference between the measured/estimated value of a parameter and its accurate value
Sampling error: error which arises because a parameter is estimated with less than the full population of values; this error would not exist if the entire population were used
Random sampling: allows for the sampling error to be minimized, and the likelihood of the size of the error to be estimated; random samples provide unbiased sampling, that is, they give every member of the population the same chance of selection into the sample;
"Effective" random sampling: sampling methods that may not give every element a truly identical chance at selection but come so close as to give equivalent results
 Examples:
 Cluster Sampling -- grouping population elements and sampling within  selected groups;
 Representative Sampling -- selecting randomly within groups based on proportions of  known characterstics of the samples such as race, gender, income, political affiliation,  experience....; representative sampling may even improve purely random samples which  suffer from non-response problems.
 Systematic Sampling -- sampling every nth person; effectively random if the elements of  the target population are mixed together very well.
Biased sampling: sampling which makes inclusion of some elements of the population more likely than other elements; such methods make estimation of the likely characterstics of the sampling error impossible
 Examples;
 Self-selected samples: sampling where the population members themselves determine  whether they will be included in the sample;
 Systematic sampling -- sampling every nth person but where the target population is not  mixed well;
Non-sampling error: any error which is present when the entire is used or if the entire population were used

Sources of Error:
 Respondent-based: Inaccurate responses (lying; fatigue; embarrasment; memory ...)
 Researcher-based: Ambigious or confusing wording/phrasing; poorly designed response  options; recording error (data entry; machine error); missing the target population ...
 Dealing with Non-sampling Error
 i)  Controlled studies of size/distribution of a particular error -- do the errors average to  zero or introduce bias?  Are errors arising because of bias?
 ii) Repeating questions with altered format of question or response
 iii) Assuring respondent anonymity
 iv) Drafting/redrafting questions/responses
 v) Memory prompts
 vi) Question ordering
 vii) Pilot studies: checking for wording problems; checking for fatigue; checking for         response options; .....


SUPPLEMENTAL READING--SIMULATIONS

Introduction:
Simulations are a growing tool used in both academic and business settings due to advances in computational power with computers. Even through most of the 1980s, most simulations of any sophistication were conducted by academics, a few governmental agencies such as the U.S. Department of Defense, and very large business such as Bell Labs employing people with significant mathematical/statistical training. With various point-and-click software applications, complex and powerful simulations can be conducted without relative ease.
Simulations, in general, are studies where a set of assumptions are combined with data to determine what outcomes would be found under those conditions. In many settings, simulations are go by the name "what if" analysis because the investigator is considering what will happen if a set of hypothetical conditions or data hold true. In statistics, the predicted values from a regression are a type of simulation where the equation and coefficients are used with the data for the X-variables to generated the predicted (forecasted) values for Y. Forecasted values from a time series model are another example of simulated values. Cost estimates for a project based on identifying costs of similar projects or appraisal values for a house based on average values of similar houses are examples of very simple, simulated outcomes.
Simulations (other than just guesses) all share two parts

Simulated Value = Model & Data.

Simulated values are generated by assuming different "what if" scenarios for either the model, the data or both.

Simple Example
A simulation might start with a model as simple as the basic accounting definition for net worth which subtracts one thing from another and then generates a what if scenario by merely multiplying assets by 2 times their actual values:

Model: Net Worth = Assets - Liabilities

Data (to be used in the model): Liabilities = actual values
Assets = 2*actual values

In this simple example, the only simulated or hypothetical part is the asset data values we plug in since the actual values for liabilities are used, and the model is a basic definition and not a hypothetical relationship. Spreadsheet software such as Excel make doing these and a little more complex simulations relatively easy by permitting various columns of numbers to be combined together in formulas determined by the user as well as permitting users.
The complexity of a simulation increases as the model becomes more complex and more of the data and model are hypothetical. Still, no matter how sophisticated the simulation, the simulated outcomes are driven by a model (one or more equations) and data (real or hypothetical input provided by the user). Even simulations in the form of computer games -- such as Microsoft Flight Simulator -- that present pictures to the users are really just combinations of equations (model) and data.

Stochastic Simulations:
The example described above is more technically called a "deterministic" simulation because the all of the numbers used are fixed at the outset, even the hypothetical values. A different class of simulations are where one the data or parameters can take on different values that, to some extent, are random. These kinds of simulations are called "stochastic" simulations or "Monte Carlo" simulation. They improve simulations by permitting the user to incorporate uncertainty more explicitly into the hypothetical scenarios.
In stochastic simulation, the idea is not to just say we don't know the future, therefore, lets just pick any number from 1 to 1 million at random. Instead, users assume that they can describe the likelihood of different outcomes but with some lingering uncertainty about the specifics. Therefore, the typical procedure is to pick some probability distribution, such as the normal distribution, that the user thinks describes the likelihood of outcomes, and then let the computer software generate hypothetical values at random that fit that probability distribution.

Example -- A Deterministic Simulation:
Suppose we are constructing a new house. To simplify matters, suppose we also know the final expense is driven by the size of the house (sq. ft heating & cooled) and the quality of the house (Premium = 1; Standard = 0). We also know that jumps in lumber prices (% change from current date on 2x4 prices). Based on past experience and analysis, suppose we have the following relationship (MODEL):

Housing Expense = $10,000 + 80*(sq. ft) + 20*(Premium*sq. ft) + 20,000*(% Change Lumber)

Data: we could plug in values for (sq. ft.), (premium), and (lumber) to simulate an outcome. If we use 3000 sq. ft., Premium = 1, and Lumber = 1%, the simulated price would be
$330,000 = 10,000 + 80*(3000) + 20*(1*3000) + 20,000*(1).

This is just like the predicted values we have generated with regression analysis and is another example of a "deterministic" simulation where the hypothetical parts of the simulation are fixed in advance.

Example-- Simulation with Stochastic Data Values
Now suppose everything else about the housing expense model and the data are the same, but we are not really sure how about the changes in lumber prices. Our best guess is that the average change will be zero, but they have often varied 1 or 2 percent up or down and occasionally a lot more. Rather than just plugging in a number such as 1% as we did above, we decide to conduct a stochastic simulation where we generate 100 different housing expenses for a 3000 sq. ft, premium house but assume that lumber price changes for these 100 cases are drawn from a normal distribution with a mean of 0% and a standard deviation of 2% designated as Normal (0,2). So Now the setup is

Housing Expense = $10,000 + 80*(sq. ft=3000) + 20*(Premium*sq. ft=1*3000) +
20,000*(% change lumber prices: Normal(0, 2))

After the software computes housing expenses for these 100 cases, we can examine them and find out what was the average housing expense, what was the highest and lowest expense, and what was the typical range of expense. This kind simulation provides much more information on which we could base our decisions.

Example -- Simulation with Stochastic Data Values & Model Parameters (coefficients)
To incorporate reality a little better, we could also assume that the coefficients (also called the parameters) in the model (80, 20, 20000) are, themselves, just estimates. The exact relationships are not known with certainty and can change to some extent. Suppose we think that the coefficient for sq. ft. is on average 80 with but may differ by a standard deviation of 5, we think the premium coefficient is 20 on average with a standard deviation of 2 and the coefficient on lumber prices is 20,000 on average with a standard deviation of 1000. Now our simulation setup with a 3000 sq. ft premium house is

Housing Expense = $10,000 + Normal(80, 5)*(Sq. ft. = 3000)
+ Normal(20, 2)*(premium*sq. ft. = 3000*1)
+ Normal(20000, 1000)*(%change lumber prices: Normal (0, 2))

In this case, the parameters (coefficients) in the model are generated by drawing numbers from normal distributions as are the data values for lumber price changes. We could again generate, say, 100, cases and find the average housing expense, the highest and lowest expense, and they typical range of expenses





Week 1 Assignment 
Accessing Excel and SPSS -- Using Klibanoff Data Files

PART I
1.  Access Excel.  (Either on your own PC or by logging on to WKUnet and double-clicking the Excel icon on the standard student network desktop.)
2.  Retrieve the file by clicking on this link: s&p1980a.xls   The file contains monthly values for the Standard & Poor’s 500 Index from 1980 through March 2002.  ****** Note that if you can open s&p1980a.xls into Excel by clicking on the link.  If the data analysis features of Excel will not work.  Save the file to a disk, close Excel, reopen Excel from the desktop icon, and then open the file into Excel. ************
3.  Create a new variable, labeled PCT, that will be percentage changes in the monthly S&P 500 Index.
(To do this, type PCT in Cell D1.  Click the cursor on Cell D3 and type the following formula to compute the percent change in the S&P 500 Index -- include the = sign and parentheses
 = ((c3 - c2)/c2)*100
Now, hit enter.  Go back and highlight Cell D3 again, right click the mouse, and click "Copy".  Click on Cell D4 but keep holding down the left mouse button and drag the cursor until Column D is highlighted all the way down to the bottom of the data. Release the left mouse button and right click the mouse and select "Paste."  You should now have a column of numbers showing monthly percentage changes.)
3.a. Compute descriptive measures for PCT. ( Plug in the appropriate column  into the "Input Range," click "Labels", and click "Summary Statistics.")
b. Now, go back to the data worksheet and calculate standardized values for PCT.  To do this with Excel, you will need the mean and standard deviation from the Summary Statistics output.  Then, on the original worksheet, type a label (such as std. values) at the top of an empty column.  Highlight the cell in row 3 under the title and type the formula = (mean - D3)/standard deviation) where  the mean and standard deviation are the numbers from your Summary Statistics output.
c.  Print the output (after you resize the output columns to fit properly and reformat the output column to present only 1 decimal place).


1.  Access SPSS  (For help, see SPSS Help)
2.  Retrieve the file Employee data.sav ( look under c: program files/ spss/employee data.sav).  It provides data on 474 employees of a particular company.  If you have trouble finding the data, just  employeedata.xls and open into SPSS -- remember to tell SPSS that it is an Excel file.
(Note:  you can see descriptions of the variables by pointing at the variable labels in the top row or by clicking on "Variable View" at the bottom of the spreadsheet).
2.  Use Transform>Compute to create a new variable (Totexp) as is the sum of Jobtime and Prevexp.
3.  Generate a histogram, boxplot, and stem-and-leaf plot for Current Salary (Salary) and Total Experience (Totexp) by clicking Analyze>Descriptive Statistics> Explore.  Place Salary and Totexp in the "Dependent List" box.  Under the "Display" selections, choose "Plots".  Next, click on the "Plots" button (between Statistics and Options) and select "Histogram" along with Boxplot and stem-and-leaf which should already by selected.
4.  In the output window, change the title from  Explore to  Graphs for Salary and Experience and print your output.


PART II
Depending on the first letter of your surname, go to the website of the statistical software vendor listed.  Briefly describe at list 1 product offered and at least 1 examples of application of one of the products.

A-D    SPSS (SPSS, Inc., www.spss.com)
E-H    SAS (SAS Institute, Inc., www.sas.com)
I-M    STATA (Stata Corporation, www.stata.com)
T-V    EVIEWS (Quantitative Micro Software, www.eviews.com).


Week 2 Assignment 
Understanding Descriptive Statistics & Simple Tests

1.  In Excel, open the file (restaurantstocks.xls -- Chapter 2) on the data CD with the Klibanoff book.  The file contains monthly stock returns (1984-1994) for five restaurant stocks (measured as % above over below the Treasury Bill interest rate).  Follow the instructions for #6 on page 62 to complete parts a-f.  Instead of doing all 5 stocks, pick just two.  (On part (a), go ahead and compute all descriptive statistics;  Name the output table and format it for easy reading).  Print your output.  Write (print) answers to the questions next to your output.

2. Open the same file into SPSS (remember, change file type to .xls).  (Click "Variable View" and change the number of decimals places to 2). 
    a. Create a variable (Split) equal to 0 for the first 66 observations and equal to 1 for observations 67-132. 
    b. Compute descriptive statistics for Dairy Queen and Wendy's for the first half and second half of the observations  using Analyze>Descriptives>Explore.  Put the new variable in the "Factor List" box.
        Select the button for "Statistics" (only).
   c.  Test for whether the means for the first and second half of the observations for DQ and Wendy's are the same using Analyze>Compare Means>Independent Samples t-test.  Use the split variable
          as the "Grouping Variable" 
    d.  Print the output. 
            d1. On your output, write down the difference in means for both stocks between the first and second parts of the time period.
            d2. On your output, write down the statistic that tests whether the first half and second half are equal.  What does it show?


Week 3 Assignment
Making Application of Basic Statistical Concepts

1.  Open Excel and compute probabilities for the following situations:
a.  Suppose that accounting errors due to erroneous data entry have a 1% (0.01)  probability on any given entry.  Also, suppose that  data entry errors are a binomial variable (error or no error).  Compute the likelihood of zero errors given 150 entries.
(In Excel, highlight cell A1, click the fx icon on tool bar; select “Function Category” = statistical; “Function name = BINOMDIST;  fill in the following values in the blanks. Number_s = 0, Trials = 150, Probability_s = 0.01, Cumulative = True.

b. Suppose that the average amount of time that customers are kept on hold by AOL customer support is 8 minutes with a standard deviation of 1.5 minutes.  If wait time is normally distributed, calculate the probability of waiting less than 10 minutes.
(In Excel, highlight cell A3, click click fx icon on tool bar; select “Function Category” = statistical; “Function name = NORMDIST;  fill in the following values in the blanks. X = 10, Mean = 8, Standard_Dev = 1.5, Cumulative = True.

c.  Print the Excel Spreadsheet. 

2.  Open Hotchoc.sav into SPSS. 
a.  Test whether the mean of the afternoon samples is equal to 140
b.  Test whether the mean of the afternoon samples is equal to the mean of the morning samples
c.  Compute the standard deviation of each afternoon sample as we did in class for the morning samples
d.  Produce  X-Bar and std. dev. charts for the afternoon temperature samples
 e.  Print the output.  On your output, provide an answer to (a) and (b) and reasons for these answers.  Also, assess whether the the afternoon temp samples are "in control."  List the steps you considered in arriving at your decision. 


Week 4 Assignment
(This assignment should be typed, single-spaced and no more than 3 pages in length).

Read the following article in the 2006 Q4 issue of the Philadelphia Federal Reserve' Business Review.  Phil. FR--Housing Boom or Bubble

1. Outline the article's main points
2.  What audience level does the author seem to be writing for? (Provide some evidence of this)
3. List 6 things that the author does well with respect to the writing guidelines in the reading.
4. Explain features of the Graphs and Tables provided by the author that make them easy (or difficult) to read and follow in the text.
5. Are there examples of where the author does a poor job of following the guidelines from the reading. 



Week 5 Assignment 
Objective:  Practice with scatterplots, correlation analysis, and regression analysis
1.  Open the file airfares.xls into Excel.  The file includes several variables regarding roundtrip air fares to several major cities based on a sample of 21 cities collected on a particular day for morning weekday departures with Saturday night stay-over.  The variables in the file are City, Fare (in $), Distance (to destination city in miles), Direct SWA (=1 if Southwest Airlines flies the route directly), and Fare per Mile (Fare divided by Distance).
2.  Generate a scatterplot between Fare and Distance Click on the chart icon button, select XY Scatter for chart type, and then use the chart wizard to generate a scatterplot that has distance on the x-axis and fare on the y-axis.
    a) Print this scatterplot. 
    b) Draw a regression line in by hand that would best fit through the middle of the data;
    c) (Extra:  see if you can write out the equation (Fare = intercept + slope*Distance) that would go with your regression line -- this does not have to be exact.)
3. Go back to the data worksheet and click Tools>Data Analysis>Regression.  Use the pop-up windonw to generate a regression analysis with Fare as the dependent variable (y-variable) and Distance as the independent variable (x-variable).  Remember to click the "Labels" box and also click the "Residuals" box.
    a) Print the regression output (after formatting the table and cells to reduce decimal places and make the columns appear properly)
    b) To the side of the output.  Write out the regression output in equation form.  Does this equation match yours from (2c)?
    c) Calculate the first residual by hand.  Use the Excel-generated residuals to check your calculation (allowing for rounding differences)
    d)  What does a 1 mile increase in distance predict for Fare?  What about a 100 mile increase? What would Fare be if distance were 0?  Why is this really only a "hypothetical" value?



Week 6 Assignment
Objective:  Practice with binary variables

Using the pizza.xls file on the Kiblanoff CD, do the following questions on page 125:
1-a,b,e
2-a,b,c,e
4-a,b,c,d


Week 7 Assignment 
Objective:  Practice with non-linear regression

1.  Using the electricitycosts.xls file, complete #4 on page 147-148.  (Make sure that as part of this, you estimate a quadratic model ).

2. Compute the (natural) logarithms of the cost variable.  Then estimate a semi-log regression (that is, use the log of costs as the dependent variable). 

3.  Compare the quadratic model results with the semi-log model results: 
    a.  How accurate are the predictions of one versus the other?
    b.  Roughly Plot regression lines-curves for both models (either manually or using the software)
    c.  If their predictive performance differs, why might one perform better than the other






Week 9 Assignment
Objective: Examining regression relationships with a qualitative dependent variable

1.  Open the SPSS file HOWNER

Howner contains the following variables
Howner = 1 if own home and 0 if not;
Income = income per person in household in thousands of dollars;

1) Generate linear regression estimates (OLS) using Howner as the dependent variable and Income as the independent variable.  Save the predicted values by clicking Save and selecting “Unstandardized predicted values”).
3) Generate logistic regression estimates  of the same model (ANALYZE/REGRESSION/BINARY LOGISTIC) and fill in the boxes.  (Note that “covariates” are the same thing as independent variables.)  Again, save predicted values by using SAVE and then check “probabilities.”
4) Generate a scatter diagram (use Graphs/Scatter/Overlay/Define)  that shows three Y-X pairs (howner-income; linear predicted values-income; logistic predicted values-income)
5) Using Transform>Recode into Different Variable, generate a categorical variable from your income variable, so that anyone with an income above $18 (thousand) is in category A (high income), anyone from $10 to $18 (thousand) is in category B (medium income), and anyone below $10 (thousand) is in category C (low income).
6) Conduct a crosstabulation (Analyze>Descriptives>Crosstabs) using income category as the row variable and home owner as the column variable.  Under "Statistics" select "Chi-Square" and under "Cells" select "Expected" along with "observed."

7) Print your output. 
8) On the back of your output
 a.  Explain the meaning of the slope coefficient from the linear regression;
 b.  Explain any problem arising from the use of linear regression in this case;
 c.  Explain how how logistic regression helps solve this problem.  (This need not be fancy; just, draw arrows to key parts of the graph from your explanations above, below, or beside.)
 d.  Explain the meaning of the "percent correct" statistics generated by the logistic regression


Week 11 Assignment
Objective: Understanding and explaining time series regression and forecasting results

A regression study of (monthly) checking account balances was estimated based on data from January 1990 through December 2005.  The variables included in the analysis were:

Balances = end of month checking account balances in millions of dollars;
Income  = change in national GDP from prior month in percent at annual rates (monthly figures interpolated from quarterly figures);
Trend = trend variable starting at 1 in January 1990 and increasing by 1 unit each month;
Jan = 1 if January and 0 otherwise;
Nov = 1 if November and 0 otherwise;
Dec = 1 if December and 0 otherwise;

Variable        Coefficient          Std. Error       p-value
Constant          30.0                   10.0                3.0
Income             0.03                   0.001             0.001
Balances(-1)    0.7                      0.1                0.001
Trend               0.30                   0.1                 0.01
Jan                 -19.0                    2.0                0.01
Oct                  13.0                    1.0                0.01
Nov                  14.0                   1.0                0.01

R-square = 0.92;      Durbin-Watson = 1.84;      Box-Pierce Q(12 lags) = 10.2 (p-value = 0.65)
Mean Balances  = $150 million;       Standard deviation of = $5 million

Forecasts Statistics Based on 1990-2005 model forecasts for 2006:
Static Forecast Diagnostics                                                            Dynamic Forecast 
Root Mean Square Error = $4.3 million                                         Root Mean Square Error = $6 million
Mean Absolute Error = $3.5 million                                              Mean Absolute Error = $5.7 million
Mean Absolute Percent Error = 1.7%                                           Mean Absolute Percent Error = 3.4%
                                                 

Answer these questions about the results:

1.  How much would monthly balances increase by if income grew at its long run average of 3% for 12 months in a row?
2.  What is meant by the term Balances(-1)?
3.  Draw a time plot (time on X-axis, Balances on Y axis) that depicts the relationships implied by the coefficients on Balances(-1) and Trend.
4.  Provide a quantitative estimate of how well this regression model  explains monthly balances.
5.  Explain the meaning of the coefficients on January, October, and November.
6.  Explain, in words understandable to someone who doen't know statistics, what the 4.3, 3.5, and 1.7 mean for the static forecast diagnostics.
7.  Using either a graph, a timeline, or an equation, simply desribe the difference between how the static and dynamic forecasts are produced.


Week 12 Assignment
Objective:  Reporting regression results in a business context

Based on the information provided in the file  (Commercial Real Estate Analysis) write a  report for management of Company Y.  The report should consist of two parts:

1)   a 1-page summary for top managers that addresses the key questions.

2)   a technical summary (3 page maximum) to explain in more detail the nature of the results.  

The technical summary should be written to explain key aspects of the data analysis  for individuals who have some knowledge of statistics and regression analysis.  If you desire, you may include a hand-drawn figure to as part of your summary.   






Week 13 Assignment 
Objective: Developing skill in experimental design

Create and explain plans for generating data in a (semi) controlled experimental situation.  Choose a business or personal application of interest to you.  Structure your report as follows:

1.  Objective: (Describe the objective(s) of the experiment)
2.  Background: (Give some background – why is this important; what are the known relationships between factors and responses; how)
3.  Response Variable: (Specifically, describe the response variable and how it will be measured; be clear about the “experimental units” – who or what is the measurements of the response variable is being measured such people, car, machine, ....)
4.  Factors on Response Variable:
Factor(s) of Main Interest: (provide information about the variable(s) that the experiment is designed to isolate; describe the specifics of its measurement )
Other Factors to be Actively Controlled: (list other variables that influence the response variable which you will be able to manipulated in the design of the experiment below and, specifically, how each will be measured.  If one or more will be held constant, provide information on tolerable variation.  Also, list potential or suspected interactions between any one of these factors and the factor of main interest.
Other Factors not Actively Controlled: (provide information on other variable likely to influence the response variable but that cannot be manipulated in the experiment; list those that can be observed, measured and recorded as well as those that cannot be observed or measured.
6.  Restrictions: (Identify major restrictions on your experiment such as time limits, budget constraints, or other practical limitation on the design.
7.  Design of Experiment:
 Briefly explain your preferred design and the reasons for your reason for choosing the design (such as holding factors constant, randomizing factors, using blocks, matched pairs, or multi-way designs).  Remember, the design centers on and explains how you intend to achieve your objective of better isolating the effects of the factor of main interest on the response variable.  Also, describe how your design handles possible interactions of the factor of main interest with other factors.
7.  Analysis of Results: (Briefly propose how the data from your experiment might be analyzed. Will you use ANOVA, regeression, or other methods?  Briefly describe how you will do this)

You will be graded on clarity of your report, how well you identify and explain measurement of the factors, and the degree to which you are able to explain and defend your design and how its meets your stated objective.  Your report should be 3 double-spaced  pages or less.   Any graphic included (table or chart) does not count toward this limit.



Classroom Experimental Design Example (abbreviated) – Ice Cream Production
(This template is provided to help illustrate what the assignment involves.  It is not complete.  It is only intended to get you going in the right direction.)

Objective
We have designed an experiment to determine to whether the pre-freezing storage of ice cream influences the taste quality of the final product ...

Background
Ice cream is produced by both commercial enterprises and households.  In either setting, a large number of factors can influence the quality of the product including ingredients, methods of freezing,  ...

Our client has is interested in whether storing the ice cream liquid at cold temperatures for different lengths of time influences the taste quality.   In general, ice crystal size is related to the taste quality of ice cream ...

Response Variable
The response or dependent variable in this study is the quality of ice cream produced.  Individuals are the unit of measurement.  Each individual in the experiment will their rating of the ice cream and a scale of 1(poor) to ...

Factors on Response Variable
Factor of Main Interest:
 Pre-storage: Once mixed, the pre-freezing storage time for the ice cream liquid can range from 0 up to several days.  It is limited only by the spoilage of ingredients...  Three settings will be used for pre-storage ...

Other Factors to be Actively Controlled:
 Individual Preferences: Because individuals are the experimental unit, their subjective preferences for ice cream vary ... This influence will be filtered out using design strategies discussed in the next section.
Freezing Method: Several different commercial and household freezing techniques are in use.  Because we are focusing on household application, we focus on these.   Three main methods are in use at the household level ...
 Recipe.  Recipes and their ingredients vary considerably and influence taste.  Recipes vary from cooked to uncooked, by different milk products (cream, condensed milk, evaporated milk) ....
 Ambient Temperature: The room temperature in which the freezing is done can influence ice crystal formation.  Therefore, we freeze all samples at 72 (F).  Fluctuations of one or two degrees around 72 are likely ...

Other Factors Not Actively Controlled:
Ingredients: The freshness of ingredients such as cream, vanilla, ... may influence taste quality.  For cream, we will record the number of days before shelf expirations.  For vanilla, we will record a qualitative measure ...

Restrictions
 This experiment will be run on a budget of $100.  This permits about 50 batches of ice cream to be produced, so we will be limited to about 50 individuals in our study.  Individuals will be volunteers ...

Design of Experiment
In order to isolate the effects of Pre-Storage length on taste quality, we developed the following design:

Because Pre-storage settings (Zero, Two, Six) setting may interact with Freezing method settings (HC, MCI, MCCO), we use a 2-way design between these factors permitting all combinations (see Chart 1). .. The 54 individuals will then be randomly assigned to one of the nine combinations of Pre-Storage and Freezer methods.  The goal of randomizing individuals is to filter out the effects of individual preferences. In choosing this strategy, we assume that the individual tastes tastes vary randomly within the population.... We will hold constant Recipe and Ambient Temperature as noted above....

Data Analysis
The data derived from the experiment will be analyzed using regression analysis.  The regression will estimate an equation with taste rating as the dependent variable, pre-storage as an independent variable, freezing method as an independent variable, and the interaction of pre-storage and freezing method (the product of the two) as a variable...





Week 14 Assignment 
Objective: Generating and UsingFactor and Cluster Analyses

1.  Open SPSS and retrieve jobratings.sav  (you may need to save the file to a memory device and then open it into SPSS).  This is file put together by a human resource consulting company  containing information on 389 regarding their jobs.  Each person’s job is rated by several variables such as the knowledge-education required.  Labels are provided for each variable.  The sum of these variables was used by the consultant to evaluate job importance-difficulty and relate it to salary.  Your job is to generate evidence so that these job rating variables can be evaluated.  The questions management would like to know are,

  i) are these job rating variables really capturing distinct characteristics of jobs or are they merely measuring similar attributes in a superficially complicated way?
 ii) can employees (as far as these ratings go) be lumped  into small number of groups, and if so, how many different groups are represented?

2.  Generate a Factor Analysis and a Cluster Analysis as evidence to answer these questionsq.
Use  (Analyze>Data Reduction>Factor) and make the following selections:

         Descriptives (select  "univariate descriptives" and “coefficients” under “correlation matrix”)
         Extraction (maximum likelihood)
         Scores (select “save factor scores matrix”)

For the Cluster Analysis use (Analyze>Classify>Two-Step Cluster) and make the following selections:
       Output (select "Create Cluster Membership Variable")

3.  Print the output.  On the back of the output or a separate sheet, briefly answer questions i) and ii) by making use of specific results from the factor analysis and cluster analysis.

[Remember, you can use SPSS Help as well as the “Results Coach” for assistance.]




Assignment
Objective: To gain “hands on” experience in producing stochastic simulations with statistical software and to gain familiarity with availability and features of statistical software.

PART I
1.  Design a simulation for a variable of your choosing.
    Write down a model (an equation or set of equations that determine outcomes of the variable). 
    Determine how the data values for the variables in the equation(s) are to be determined. 
    Make sure that at least one part of either the model or the data should is stochastic.   
2.  Using Excel or SPSS, generate simulated outcomes for at least 50 observations for the simulated variable.  Generate descriptive statistics and a histogram for the simulated variable.
3.  At the bottom of your output,  NEATLY write down the equations making up your model and identify the data or part of the model for which you generated stochastic values.