COURSE POLICIES / COURSE SCHEDULE / Grades (as of 5/5) /
OTHER LINKS
American Statistical Association/
Journal of Statistics
Education/ Fed Stats/ Teaching Applets /
http://www.statsoftinc.com/textbook/stathome.html
/ Directory
of Online Statistics Sources/
Dr. Brian Goff/414 Grise Hall
Phone (502)745-3855/brian.goff@wku.edu
Last Modified: January 12, 2007
Western Kentucky University
CONTACT INFORMATION
414 Grise Hall/ 745-3855
email: brian.goff@wku.edu Office Hours: 8-9, 10-11 MWF; 2-5 W
(I am in my office or on campus most days from around 8-4 except around
noon)
TEXTS/SOFTWARE
Managerial Statistics: A Case-Based Approach by Klibanoff and
others (Thomson-Southwestern)
Excel & SPSS (both on WKUnet or personal copy)
GRADING
Reading Quizzes
30%
Weekly Assignments 40%
Final Exam
20%
Assignments turned after the beginning of the designated class period will receive a 10% per business day reduction. Quizzes will be in-class multiple choice questions over assigned readings. "Mini" assignments are short in/out of class projects on which you will receive either full or no credit with one "grace." I will discuss the final exam later in the course.
MISCELLANEOUS
Last day to drop course with a "W" or change from credit to audit is
listed on Topnet. Any students requiring special consideration
under the
provisions of the
ADA should register with the ADA Compliance Office. If you are
not fluent in English or are weak in your writing abilities, you should
utilize a writing "consultant" to examine your written reports before
turning them in. The WKU
Writing Center is one option. Undergraduate students willing
to offer tutorial services (for a fee or free) are another.
ATTENDANCE/MISSED ASSIGNMENTS
The combination of students with job responsibilities and a course
which only meets once per week can present problems. The policy below
attempts
to strike a balance of accommodation while maintaining legitimate
standards. Under special circumstances and my approval discussed in
advance, one assignment may be missed and the other assignments/final
weighted more heavily contingent upon my approval of the reason.
Failure to complete more than two assignments or absence from more than
2 full sessions (or equivalent) will result in
a student being dropped from the course regardless of the reason.
COURSE OBJECTIVES
The goal is to enhance your ability to envision, design, conduct, and
report data-oriented or technical projects with business related
applications. It is a course concerned with data
analysis but building your ability to
solve problems with data analysis and to communicate results are also
emphasized.
Week 1: Getting Started -- Data,
Software, & Brain-Power
Week 1 Assignment
Week 2: Data Descpription & Basic Tests
Reading Quiz =
Chapter
1-2; Statsoft
-- Elementary Concepts
Week 2 Assignment
Week 3: Using Basic Data Tools Reading
Quiz = Statsoft
-- Data Mining; Statsoft --
QualityControl Charts ;
Week 3 Assignment
Week 4: Writing Data-Oriented Reports (Reading Quiz = UN--Making
Data Meaningful; Brief PowerPoint on
Writing; )
Week 4 Assignment
(In
Class: St. Louis Fed Review 2006)
Week 5: Basic Regression
(Reading Quiz = Chapter 3-4)
Week 5 Assignment
Week 6: Regression with Slope and Dummy Variables
(Reading Quiz = Chapter 5)
Week 6 Assignment
Week 7: Regression with Non-Linear
Applications (Reading = Chapter 6, 8)
Week 7 Assignment
Week 8: Spring Break -- No Class
Week 9: Regression and Other Methods for Limited Dependent
Variables --(LPM & Logistic
Regression Powerpoint)
Week 9 Assignment
Week 10: General Modelling Issues with Regression (Reading Quiz =
Chapters 7-8)
Week 10 (No Assignment)
Week 11: Time-Oriented Regression Applications (Reading Quiz =
Chapter 9; Reading
Supplement)
Week 11 Assignment
Week 12: Writing a Regression-Oriented Report
Week 12 Assignment
Week 13: Experimental Design; Surveys (Reading Quiz = Statsoft-Experimental
Design; Ex Design
Supplement ; Surveys
Supplement
Week 13 Assignment
Week 14: Simulations; Factor Analysis (Reading Quiz
= Simulations Supplement Statsoft--Factor
Analysis; Factor
Link
Week 14 Assignment
Week 15: Reminders (No
Reading Quiz) -- Avoiding Big Errors;
Regression Examples
Week 16: Final Exam
The analysis of data collected at regular time intervals represents an important areas of business statistics because of two inter-related reasons: 1) it permits patterns from past changes in a variable to examined; and 2) these past patterns can be used as a tool to forecast the future values. Unlike regression analysis where data must be gathered on both dependent and independent variables, time series methods can be used even if data have been collected only on the variable of interest. This provides a major advantage because it reduces the amount of data required. Chapter 14 in Siegel introduces some of the concepts of time series analysis. The discussion in this supplement covers a few of the things that Chapter 14 omits or leaves a bit fuzzy. Some of the key terms are:
Y(t) = values for variable Y (such as sales) measured at regular time intervals (t) such as monthly. Y(t) is also referred to as "the level of Y" or a "time series on Y."
Y(t-1) = value for the variable Y in the prior period. For instance, with monthly sales data, Y(t-1) refers to the prior month's sales. Y(t-1) is also called the "lagged" value of Y of the "lag of Y."
Y(t) - Y(t-1) = difference in Y; it is also known as the "first difference" of Y as well as the "change in Y" from period to period. It measures how much the variable changed from its value period. For monthly sales, it shows how much sales changed from the prior month.
Structural Models = regression models using time-based data where several other variables (X-variables) are used as explanatory variables in predicting values of Y;
Time Series Models = equations which use past values of Y, past differences in Y, and information drawn from these past values to predict current values;
Time Series Components = patterns in the past movements of Y such as a trend components, a cyclical component, a seasonal component, or a random component.
We will use weekly Sales for Company X from 1990 through 2001 to illustrate our points. The simplest (but sometimes misleading way) to determine patterns in the past history of sales is by looking at a graph of monthly sales plotted over time (Sales on the Y-axis and weeks & years on the X-axis) to see if there are any obvious trends, repetitive cycles, big jumps or dips during certain periods, and the like.
At first glance, the graph above appears to show that there may be an upward trend (a trend component)over the time frame of about $10 thousand per year. Also, there are some repetitive "ups and downs" (a cyclical component). If you look closely, you can also make out big dips of $20 to $30 thousand around the end of each year and an increase of a few thousand during the summer weeks (seasonal components).
While looking at graphs like the one above is a good first step, it leaves the size of the patterns to a lot of guesswork. To more precisely quantify the patterns in weekly sales, we can estimate an equation from the data that looks similar to a regression equation but uses on information on the variable under study or time. The data in our file would appear as below where "t" represent a particular week:
| Week (t) | Sales (in thousands of $) |
| 1 | 200 |
| 2 | 208 |
| 3 | 215 |
In the equation estimated below we will use the following definitions:
Sales(t) = weekly sales in thousands of dollars;
Trend = time variable counting the weeks; it starts at 1 and increasing
by 1 unit each week of the sample; this looks for a linear "trend"
component (pattern) in the data;
Sales(t-1) = Sales in the prior week; this is the simplest means of
looking for past "cycles" in the data;
XMAS = a seasonal indicator variable equal to 1 if a week included
December 25 and 0 otherwise;
SUMMER = a seasonal indicator variable 1 for weeks from Memorial Day to
Labor Day and 0 otherwise.
The results of estimating an equation to predict weekly Sales with
these components appears below:
| Variable | Coefficient | Std. Error | t-value | p-value |
| Constant | 100 | 6.2 | 15.0 | 0.001 |
| Trend | 0.075 | 0.006 | 13.0 | 0.001 |
| Sales(t-1) | 0.50 | 0.03 | 15.00 | 0.001 |
| Xmas | -32.0 | 3.5 | 9.00 | 0.001 |
| Summer | 6.0 | 1.3 | 4.00 | 0.001 |
R-square = 0.82
Durbin-Watson = 2.00
Box-Pierce Q(12 lags) = 10.2 (p-value = 0.65)
Mean Weekly Sales = $250 (000); Standard
deviation
Sales= $30 (000)
In equation form, these regression results would be written
Sales(t) = 1.00 + 0.075*Trend + 0.50*Sales(t-1) - 32*XMAS + 6*SUMMER + error(t)
The coefficients in this equation are interpreted the same way as regression coefficients. For instance, the Trend coefficient of 0.075 means that for each week, sales increases by about $0.075 thousand ($75). Around Christmas, Sales drops by about $32 thousand. The "Durban-Watson" statistic (2.01 in this case) is a measure of whether the errors in our model are dependent on each other (a bad thing). Values for the Durbin-Watson between about 1.6 and 2.4 are viewed as indicating independence of the errors (a good thing). The "Box-Pierce Q-Statistic" is another measure of this same thing.
Forecasts from Time Series Models
We can generate forecasts from this equation fairly easily. Suppose
that for the last week of the 12 years of data (t=624), actual sales
were 300. The equation would forecast sales for first week of 2002
(t=week 625 in the data) and the second week of 2002 (t=626) to be
Forecast Sales (t+1) = 100 + 0.075*(Trend =
week 625) + 0.5*(Sales week t = 300) - 0.32*(XMAS = 0) + 6*(Summer = 0)
= 296.8
Forecast Sales (t+2) = 100 + 0.075(Trend =
week
626) + 0.5 *(Sales week t+1) - 0.32*(XMAS = 0) + 6*(Summer = 0)
= 295.3
The goal in forecasting is to make accurate predictions. Errors = Actual Values - Forecasted values. As in regression analysis, R-square is a measure of how well the patterns in the model account for past movements in the series. In addition to R-square, other measures are commonly used evaluate how good the model is at forecasting (predicting) sales. Some of these are
Root Mean Square Error = a
measure of the average size of the errors of the forecasts; it is the
square root (sum of the squared errors divided by the number of
errors);
Mean Absolute Error = a measure of
the average size of the errors of the forecasts; it is the sum of the
absolute value of the errors divided by the number of the errors;
Mean Absolute Percent Error = the
average error size as a percent of the mean of the variable being
forecasted; mean absolute error divided by the mean.
"Static Forecasts" estimate the forecasts and errors using the data for the sample used to compute the time series equation. "Dynamic Forecasts" use data from outside the original sample to compute forecasts and errors.
For the model above, the forecasting diagnostics were as follows:
(Static) Forecast Diagnostics
Root Mean Square Error = $7500;
Mean Absolute Error = $7000; Mean
Absolute Percent Error = 2.8%
In words, the average weekly error in our forecasts was $7000 to
$7500 or about 2.8% of total sales.
More Complex Patterns:
The example above investigated some of the simpler patterns to be found
in a time series. More complicated patterns can be investigated in a
number
of ways. For one, trends are not necessarily linear. Using a squared
Trend
term can sometimes account for this. Sometimes the trends are much more
complex, requiring special methods. Second, cyclical patterns may be
much more subtle and complex requiring a second or third "lagged" value
or even lagged error terms. Sometimes, the errors from the forecasts
can be used to improve future forecasts (another cyclical pattern).
Third, seasonal patterns may not be as straightforward as we estimated
above.
Also, other variables can be added to time series equation. In the example above, the existence, type, or amount of advertising done in the prior week might be included as an explanatory variable. Finally, one of the most subtle but most important issues in looking for time series patterns is to realize that some times patterns can be misleading. What appears to be a trend or a cycle may be nothing more than a series of random steps. For example, in class we will see a graph of a variable that appears to display a downward trend and possibly repetitive cycles around this downward trend, it is really nothing more than a series of random movements.
These kinds of time series that are a series of random steps are called "random walks." Chapter 14 discusses the idea a little. Random walks can masquerade as seeming trends or cycles in data. A random walk contains random movements from one step to the next -- if you selected a starting point, say 50, then drew a ball at random (say 2), and then put your new mark 2 places above your starting mark so you are now at 52, then you draw a new ball at random (say -5) and place the next mark -5 places above your prior mark so you are now at 47. This makes the change in your position random. That is exactly how the graph shown above was generated.
Such a random walk (series of random steps) looks very different from a series where balls are drawn from a hopper where 50 is the mean and balls with numbers ranging from 45 to 55 are in the hopper and drawn at random. A series generated by this procedure is a "random series." It is also called a "white noise" series. Its graph probably appears more like the one people have in mind when they think of randomness.
Distinguishing a random series from one with patterns is not too difficult. The graph above has no obvious pattern. More precisely, an equation for Stock Price(t) that included lagged stock price, trends, or seasons would all have coefficients near zero. Distinguishing random walks is a bit trickier. As the graph for the random walk shows, there seems to be a trend. If and equation for stock price (t) were estimated with a Trend, the Trend coefficient would also appear large. The key is the coefficient on lagged stock prices (t-1). A random walk will have a lagged coefficient near 1.0 (usually 0.9 to 1.0). The random walk above has this equation: Stock Price(t) = 0.17 + 0.98*Stock Price(t-1). The coefficient of 0.98 means that there is a random walk component in this series. Further analysis should be conducted using changes in the stock prices instead of the original levels.
Introduction: Experimental design refers to actively controlling the process by which data is generated so that the effects of one or more variables can be better isolated and more accurately estimated. Experimental design methods are common place in natural and life sciences, where many experiments are conducted within laboratories and most of the variables influencing outcomes can be controlled. However, the methods are also useful in other scientific and business settings where only some of the factors are controllable. In business, experimental design has been most widely used in production management settings to test different techniques or machinery. However, the same methods are adaptable to almost any managerial or personnel setting whether a physical product is being produced or a service is being offered. The applications can range from very simplistic methods to very complicated designs. Some of the key terms and concepts are defined below.
Factor(s): a variable(s) that wholly or partly determines changes in another variable (usually designated as the "response variable"); settings or "levels" of these factors refer to the different possible values the factor can take; in many situations, these factor levels are qualitative
Experimental Data: Data that is generated where one or more of the factors influencing the outcomes is actively manipulated so as to better isolate or eliminate its effect or the effect of other factors;
Observational Data: Data that is collected without any active manipulation of the factors that influence the outcomes
Experimental Design: The plan for manipulating factors when generating and observing outcomes; the plan may range from a simply a change in a setting (levels) of one factor -- a simple "intervention"-- or a very extensive design that holds some factors constant while changing the settings (levels) of other factors
Overview of Steps in an
Experimental Design: While the specifics of a designs
vary based on the
details of the industry, company, and particular issue, an organized
approach
to setting up the experiment should, more or less, follow the items
presented below.
(Adapted from Coleman and Montgomery, Systematic Approach to
Planning
for a Designed Industrial Experiment, Technometrics, 1993).
1. Objectives of the experiment: a precise
statement of the purpose of the experiment. The statement should
explain the
factor-response relationship the experimenter desires to isolate and
why.
The statement should be to the point, measurable, and relevant
2. Background: existing theoretical or
statistical knowledge concerning the response variable or factors, if
any, as well as how the current experiment fits in with this background
3. Response variable: identify how the
variable is measured and the typical operating means and ranges (if
known)
4. List factors & determine settings/controls:
a) Factors of main interest -- identify the
variables influencing the response variable that the design is intended
to help isolate their effects or their combined effects (interactions);
determine the desired settings (levels) of these variables during the
experiment including desired interactions;
b) Factors to be held constant -- identify
other variables (not of main interest) that likely to influence the
response variable but their settings can be held the same throughout
the experiment.
Although these should be "held constant," small changes may be
permissable.
If so, specify these "allowable" ranges of variation
c) Other controllable variables -- identify
other variables (not of main interest) likely to influence the response
variable that can be actively manipulated; determine the strategy for
filtering out their effects (such as randomizing the variables of main
interest among different settings of these factors or
d) Non-controllable factors -- identify the
other variables (not of main interest) that influence the response
variable and that cannot be held constant or actively controlled; if
these variables can be measured, determine the specific measurement
strategy; if these cannot be measured, attempt to identify the expected
impact, if any on the experimental outcomes
5. Restrictions: identify and list cost,
legal, managerial, or other limitations placed on the ability to
actively control or manipulate factors in the experiment
6. Oversight and setup: identify
responsibilities of personnel in the experiment and whether or not
trial runs should be conducted
7. Analysis techniques: if possible, identify
the most likely statistical methods that will be used to analyze the
data
produced from the experiment (such as regression, plots, ANOVA, ...)
Error: difference between the measured/estimated value of a
parameter and its accurate value
Sampling error: error which arises because a parameter is estimated
with less than the full population of values; this error would not
exist if the entire population were used
Random sampling: allows for the sampling error to be minimized, and
the likelihood of the size of the error to be estimated; random samples
provide unbiased sampling, that is, they give every member of the
population the same
chance of selection into the sample;
"Effective" random sampling: sampling methods that may not give
every element a truly identical chance at selection but come so close
as to give equivalent results
Examples:
Cluster Sampling -- grouping population elements and sampling
within selected groups;
Representative Sampling -- selecting randomly within groups
based on proportions of known characterstics of the samples such
as race, gender, income, political affiliation, experience....;
representative sampling may even improve purely random samples
which suffer from
non-response problems.
Systematic Sampling -- sampling every nth person; effectively
random if the elements of the target population are mixed
together very well.
Biased sampling: sampling which makes inclusion of some elements of
the population more likely than other elements; such methods make
estimation
of the likely characterstics of the sampling error impossible
Examples;
Self-selected samples: sampling where the population members
themselves determine whether they will be included in the sample;
Systematic sampling -- sampling every nth person but where the
target population is not mixed well;
Non-sampling error: any error which is present when the entire is
used or if the entire population were used
Sources of Error:
Respondent-based: Inaccurate responses (lying; fatigue;
embarrasment; memory ...)
Researcher-based: Ambigious or confusing wording/phrasing;
poorly designed response options; recording error (data entry;
machine error); missing the target population ...
Dealing with Non-sampling Error
i) Controlled studies of size/distribution of a
particular error -- do the errors average to zero or introduce
bias? Are errors arising because of bias?
ii) Repeating questions with altered format of question or
response
iii) Assuring respondent anonymity
iv) Drafting/redrafting questions/responses
v) Memory prompts
vi) Question ordering
vii) Pilot studies: checking for wording problems; checking
for fatigue; checking
for response options;
.....
Introduction:
Simulations are a growing tool used in both academic and business
settings due to advances in computational power with computers. Even
through most of
the 1980s, most simulations of any sophistication were conducted by
academics, a few governmental agencies such as the U.S. Department of
Defense, and very large business such as Bell Labs employing people
with significant mathematical/statistical training. With various
point-and-click software applications, complex and powerful simulations
can be conducted without relative ease.
Simulations, in general, are studies where a set of assumptions are
combined with data to determine what outcomes would be found under
those conditions. In many settings, simulations are go by the name
"what if" analysis because the investigator is considering what will
happen if a set of hypothetical conditions or data hold true. In
statistics, the predicted values from a regression
are a type of simulation where the equation and coefficients are used
with
the data for the X-variables to generated the predicted (forecasted)
values
for Y. Forecasted values from a time series model are another example
of
simulated values. Cost estimates for a project based on identifying
costs of similar projects or appraisal values for a house based on
average values of similar houses are examples of very simple, simulated
outcomes.
Simulations (other than just guesses) all share two parts
Simulated Value = Model & Data.
Simulated values are generated by assuming different "what if" scenarios for either the model, the data or both.
Simple Example
A simulation might start with a model as simple as the basic
accounting definition for net worth which subtracts one thing from
another and then
generates a what if scenario by merely multiplying assets by 2 times
their
actual values:
Model: Net Worth = Assets - Liabilities
Data (to be used in the model): Liabilities = actual values
Assets = 2*actual values
In this simple example, the only simulated or hypothetical part
is
the asset data values we plug in since the actual values for
liabilities
are used, and the model is a basic definition and not a hypothetical
relationship. Spreadsheet software such as Excel make doing these and a
little more complex simulations relatively easy by permitting various
columns of numbers to
be combined together in formulas determined by the user as well as
permitting users.
The complexity of a simulation increases as the model becomes more
complex and more of the data and model are hypothetical. Still, no
matter how sophisticated the simulation, the simulated outcomes are
driven by a model (one or more equations) and data (real or
hypothetical input provided by the user). Even simulations in the form
of computer games -- such as Microsoft Flight Simulator -- that present
pictures to the users are really just combinations of equations (model)
and data.
Stochastic Simulations:
The example described above is more technically called a
"deterministic" simulation because the all of the numbers used are
fixed at the outset,
even the hypothetical values. A different class of simulations are
where
one the data or parameters can take on different values that, to some
extent,
are random. These kinds of simulations are called "stochastic"
simulations
or "Monte Carlo" simulation. They improve simulations by permitting the
user to incorporate uncertainty more explicitly into the hypothetical
scenarios.
In stochastic simulation, the idea is not to just say we don't know
the future, therefore, lets just pick any number from 1 to 1 million at
random. Instead, users assume that they can describe the likelihood of
different outcomes
but with some lingering uncertainty about the specifics. Therefore, the
typical
procedure is to pick some probability distribution, such as the normal
distribution,
that the user thinks describes the likelihood of outcomes, and then let
the
computer software generate hypothetical values at random that fit that
probability
distribution.
Example -- A Deterministic Simulation:
Suppose we are constructing a new house. To simplify matters,
suppose
we also know the final expense is driven by the size of the house (sq.
ft
heating & cooled) and the quality of the house (Premium = 1;
Standard
= 0). We also know that jumps in lumber prices (% change from current
date
on 2x4 prices). Based on past experience and analysis, suppose we have
the
following relationship (MODEL):
Housing Expense = $10,000 + 80*(sq. ft) + 20*(Premium*sq. ft) + 20,000*(% Change Lumber)
Data: we could plug in values for (sq. ft.), (premium), and
(lumber) to simulate an outcome. If we use 3000 sq. ft., Premium = 1,
and Lumber
= 1%, the simulated price would be
$330,000 = 10,000 + 80*(3000) + 20*(1*3000) + 20,000*(1).
This is just like the predicted values we have generated with regression analysis and is another example of a "deterministic" simulation where the hypothetical parts of the simulation are fixed in advance.
Example-- Simulation with Stochastic Data Values
Now suppose everything else about the housing expense model and the
data are the same, but we are not really sure how about the changes in
lumber prices.
Our best guess is that the average change will be zero, but they have
often
varied 1 or 2 percent up or down and occasionally a lot more. Rather
than
just plugging in a number such as 1% as we did above, we decide to
conduct
a stochastic simulation where we generate 100 different housing
expenses
for a 3000 sq. ft, premium house but assume that lumber price changes
for
these 100 cases are drawn from a normal distribution with a mean of 0%
and
a standard deviation of 2% designated as Normal (0,2). So Now the setup
is
Housing Expense = $10,000 + 80*(sq. ft=3000) + 20*(Premium*sq.
ft=1*3000) +
20,000*(% change lumber prices: Normal(0, 2))
After the software computes housing expenses for these 100 cases, we can examine them and find out what was the average housing expense, what was the highest and lowest expense, and what was the typical range of expense. This kind simulation provides much more information on which we could base our decisions.
Example -- Simulation with Stochastic Data Values
& Model Parameters (coefficients)
To incorporate reality a little better, we could also assume that
the coefficients (also called the parameters) in the model (80, 20,
20000) are, themselves, just estimates. The exact relationships are not
known with certainty and can change to some extent. Suppose we think
that the coefficient for sq.
ft. is on average 80 with but may differ by a standard deviation of 5,
we
think the premium coefficient is 20 on average with a standard
deviation of
2 and the coefficient on lumber prices is 20,000 on average with a
standard deviation of 1000. Now our simulation setup with a 3000 sq. ft
premium house is
Housing Expense = $10,000 + Normal(80, 5)*(Sq. ft. = 3000)
+ Normal(20, 2)*(premium*sq. ft. = 3000*1)
+ Normal(20000, 1000)*(%change lumber prices: Normal (0, 2))
In this case, the parameters (coefficients) in the model are generated by drawing numbers from normal distributions as are the data values for lumber price changes. We could again generate, say, 100, cases and find the average housing expense, the highest and lowest expense, and they typical range of expenses
Week 1 Assignment
Accessing Excel and SPSS -- Using Klibanoff Data Files
Week 2 Assignment
Understanding Descriptive Statistics & Simple Tests
1. In Excel, open the file (restaurantstocks.xls -- Chapter
2) on the data CD with the Klibanoff book. The file contains
monthly stock returns (1984-1994) for five restaurant stocks (measured
as % above over below the Treasury Bill interest rate). Follow
the instructions for #6 on page 62 to complete parts a-f. Instead
of doing all 5 stocks, pick just two. (On part (a), go ahead and
compute all descriptive statistics; Name the output table
and format it for easy reading). Print your output.
Write (print) answers to the questions next to your output.
2. Open the same file into SPSS (remember, change file type to
.xls). (Click "Variable View" and change the number of decimals
places to 2).
a. Create a variable (Split) equal to 0 for the
first 66 observations and equal to 1 for observations 67-132.
b. Compute descriptive statistics for Dairy Queen
and Wendy's for the first half and second half of the
observations using Analyze>Descriptives>Explore. Put
the new variable in the "Factor List" box.
Select the button for
"Statistics" (only).
c. Test for whether the means for the first and
second half of the observations for DQ and Wendy's are the same using
Analyze>Compare Means>Independent Samples t-test. Use the
split variable
as the "Grouping
Variable"
d. Print the output.
d1. On your
output, write down the difference in means for both stocks between the
first and second parts of the time period.
d2. On your
output, write down the statistic that tests whether the first half and
second half are equal. What does it show?
Week 3 Assignment
Making Application of Basic Statistical Concepts
1. Open Excel and compute probabilities for the following
situations:
a. Suppose that accounting errors due to erroneous data entry
have a 1% (0.01) probability on any given entry. Also,
suppose
that data entry errors are a binomial variable (error or no
error).
Compute the likelihood of zero errors given 150 entries.
(In Excel, highlight cell A1, click the fx
icon on tool bar; select “Function Category” = statistical; “Function
name
= BINOMDIST; fill in the following values in the blanks. Number_s
= 0, Trials = 150, Probability_s = 0.01, Cumulative = True.
b. Suppose that the average amount of time that customers are kept
on
hold by AOL customer support is 8 minutes with a standard deviation of
1.5 minutes. If wait time is normally distributed, calculate the
probability of waiting less than 10 minutes.
(In Excel, highlight cell A3, click click fx
icon on tool bar; select “Function Category” = statistical; “Function
name
= NORMDIST; fill in the following values in the blanks. X = 10,
Mean
= 8, Standard_Dev = 1.5, Cumulative = True.
c. Print the Excel Spreadsheet.
Week
6 Assignment
Objective: Practice with binary variables
Using the pizza.xls file on the Kiblanoff CD, do the following
questions on page 125:
1-a,b,e
2-a,b,c,e
4-a,b,c,d
1. Open the SPSS file HOWNER
Howner contains the following variables
Howner = 1 if own home and 0 if not;
Income = income per person in household in thousands of dollars;
1) Generate linear regression estimates (OLS) using Howner as the
dependent variable and Income as the independent variable. Save
the predicted values by clicking Save and selecting “Unstandardized
predicted values”).
3) Generate logistic regression estimates of the same model
(ANALYZE/REGRESSION/BINARY LOGISTIC) and fill in the boxes. (Note
that “covariates” are the same thing as independent variables.)
Again, save predicted values by using SAVE and then check
“probabilities.”
4) Generate a scatter diagram (use Graphs/Scatter/Overlay/Define)
that shows three Y-X pairs (howner-income; linear predicted
values-income;
logistic predicted values-income)
5) Using Transform>Recode into Different Variable, generate a
categorical variable from your income variable, so that anyone with an
income above $18 (thousand) is in category A (high income), anyone from
$10 to $18 (thousand) is in category B (medium income), and anyone
below $10 (thousand) is in category C (low income).
6) Conduct a crosstabulation (Analyze>Descriptives>Crosstabs)
using income category as the row variable and home owner as the column
variable. Under "Statistics" select "Chi-Square" and under
"Cells" select "Expected" along with "observed."
7) Print your output.
8) On the back of your output
a. Explain the meaning of the slope coefficient from the
linear regression;
b. Explain any problem arising from the use of linear
regression in this case;
c. Explain how how
logistic regression helps solve this problem. (This need not be
fancy; just, draw arrows to key parts of the graph from your
explanations above, below, or beside.)
d. Explain the meaning of the "percent correct" statistics
generated by the logistic regression
Week 11 Assignment
Objective: Understanding and explaining time series regression and
forecasting results
A regression study of (monthly) checking account balances was estimated based on data from January 1990 through December 2005. The variables included in the analysis were:
Balances = end of month checking account balances in millions of
dollars;
Income = change in national GDP from prior month in percent at
annual rates (monthly figures interpolated from quarterly figures);
Trend = trend variable starting at 1 in January 1990 and increasing
by 1 unit each month;
Jan = 1 if January and 0 otherwise;
Nov = 1 if November and 0 otherwise;
Dec = 1 if December and 0 otherwise;
Variable
Coefficient Std.
Error p-value
Constant
30.0
10.0
3.0
Income 0.03
0.001
0.001
Balances(-1)
0.7
0.1
0.001
Trend
0.30
0.1
0.01
Jan
-19.0
2.0
0.01
Oct
13.0
1.0
0.01
Nov
14.0
1.0
0.01
R-square = 0.92; Durbin-Watson =
1.84; Box-Pierce Q(12 lags) = 10.2
(p-value = 0.65)
Mean Balances = $150
million; Standard deviation of =
$5 million
Forecasts Statistics Based on
1990-2005 model forecasts for 2006:
Static Forecast Diagnostics
Dynamic Forecast
Root Mean Square Error = $4.3 million
Root Mean Square
Error = $6 million
Mean Absolute Error = $3.5 million
Mean Absolute Error =
$5.7 million
Mean Absolute Percent Error = 1.7%
Mean
Absolute Percent
Error = 3.4%
Answer these questions about the results:
1. How much would monthly balances increase by if income grew
at its long run average of 3% for 12 months in a row?
2. What is meant by the term Balances(-1)?
3. Draw a time plot (time on X-axis, Balances on Y axis) that
depicts the relationships implied by the coefficients on Balances(-1)
and Trend.
4. Provide a quantitative estimate of how well this regression
model explains monthly balances.
5. Explain the meaning of the coefficients on January, October,
and November.
6. Explain, in words understandable to someone who doen't know
statistics, what the 4.3, 3.5, and 1.7 mean for the static forecast
diagnostics.
7. Using either a graph, a timeline, or an equation, simply
desribe the difference between how the static and dynamic forecasts are
produced.
1) a 1-page
summary for top managers that addresses the key questions.
2) a technical summary (3 page maximum)
to explain in more
detail the nature of the results.
Create and explain plans for generating data in a (semi) controlled experimental situation. Choose a business or personal application of interest to you. Structure your report as follows:
1. Objective: (Describe the objective(s) of the
experiment)
2. Background: (Give some background – why is this
important; what are the known relationships between factors and
responses; how)
3. Response Variable: (Specifically, describe the
response
variable and how it will be measured; be clear about the “experimental
units”
– who or what is the measurements of the response variable is being
measured
such people, car, machine, ....)
4. Factors on Response Variable:
Factor(s) of Main Interest: (provide information about the variable(s)
that the experiment is designed to isolate; describe the specifics of
its measurement )
Other Factors to be Actively Controlled: (list other variables that
influence the response variable which you will be able to manipulated
in the design of the experiment below and, specifically, how each will
be measured. If one or more will be held constant, provide
information on tolerable variation. Also, list potential or
suspected interactions between any one of these factors and the factor
of main interest.
Other Factors not Actively Controlled: (provide information on other
variable likely to influence the response variable but that cannot be
manipulated
in the experiment; list those that can be observed, measured and
recorded
as well as those that cannot be observed or measured.
6. Restrictions: (Identify major restrictions on your
experiment such as time limits, budget constraints, or other practical
limitation on the design.
7. Design of Experiment:
Briefly explain your preferred design and the reasons for your
reason for choosing the design (such as holding factors constant,
randomizing factors, using blocks, matched pairs, or multi-way
designs). Remember, the design centers on and explains how you
intend to achieve your objective of better isolating the effects of the
factor of main interest on the response variable. Also, describe
how your design handles possible interactions of the factor of main
interest with other factors.
7. Analysis of Results: (Briefly propose how the data
from
your experiment might be analyzed. Will you use ANOVA, regeression, or
other
methods? Briefly describe how you will do this)
You will be graded on clarity of your report, how well you identify
and explain measurement of the factors, and the degree to which you are
able to
explain and defend your design and how its meets your stated
objective. Your report should be 3 double-spaced pages or
less. Any graphic included (table or chart) does not count
toward this limit.
Classroom Experimental
Design Example (abbreviated) – Ice Cream Production
(This template is provided to help illustrate what the assignment
involves. It is not complete. It is only intended to get
you going in the right direction.)
Objective
We have designed an experiment to determine to whether the pre-freezing
storage of ice cream influences the taste quality of the final product
...
Background
Ice cream is produced by both commercial enterprises and
households. In either setting, a large number of factors can
influence the quality of the product including ingredients, methods of
freezing, ...
Our client has is interested in whether storing the ice cream liquid at cold temperatures for different lengths of time influences the taste quality. In general, ice crystal size is related to the taste quality of ice cream ...
Response Variable
The response or dependent variable in this study is the quality of ice
cream produced. Individuals are the unit of measurement.
Each individual in the experiment will their rating of the ice cream
and a scale of 1(poor) to ...
Factors on Response Variable
Factor of Main Interest:
Pre-storage: Once mixed, the pre-freezing storage time for
the ice cream liquid can range from 0 up to several days. It is
limited only by the spoilage of ingredients... Three settings
will be used for
pre-storage ...
Other Factors to be Actively Controlled:
Individual Preferences: Because individuals are the
experimental unit, their subjective preferences for ice cream vary ...
This influence
will be filtered out using design strategies discussed in the next
section.
Freezing Method: Several different commercial and household
freezing techniques are in use. Because we are focusing on
household application, we focus on these. Three main
methods are in use at the household level ...
Recipe. Recipes and their ingredients vary
considerably and influence taste. Recipes vary from cooked to
uncooked, by different milk products (cream, condensed milk, evaporated
milk) ....
Ambient Temperature: The room temperature in which the
freezing is done can influence ice crystal formation. Therefore,
we freeze all samples at 72 (F). Fluctuations of one or two
degrees around 72 are likely ...
Other Factors Not Actively Controlled:
Ingredients: The freshness of ingredients such as cream, vanilla, ...
may influence taste quality. For cream, we will record the number
of days before shelf expirations. For vanilla, we will record a
qualitative measure ...
Restrictions
This experiment will be run on a budget of $100. This
permits about 50 batches of ice cream to be produced, so we will be
limited to about 50 individuals in our study. Individuals will be
volunteers ...
Design of Experiment
In order to isolate the effects of Pre-Storage length on taste quality,
we developed the following design:
Because Pre-storage settings (Zero, Two, Six) setting may interact with Freezing method settings (HC, MCI, MCCO), we use a 2-way design between these factors permitting all combinations (see Chart 1). .. The 54 individuals will then be randomly assigned to one of the nine combinations of Pre-Storage and Freezer methods. The goal of randomizing individuals is to filter out the effects of individual preferences. In choosing this strategy, we assume that the individual tastes tastes vary randomly within the population.... We will hold constant Recipe and Ambient Temperature as noted above....
Data Analysis
The data derived from the experiment will be analyzed using regression
analysis. The regression will estimate an equation with taste
rating as the dependent variable, pre-storage as an independent
variable, freezing method as an independent variable, and the
interaction of pre-storage and freezing method (the product of the two)
as a variable...
1. Open SPSS and retrieve jobratings.sav (you may need to save the file to a memory device and then open it into SPSS). This is file put together by a human resource consulting company containing information on 389 regarding their jobs. Each person’s job is rated by several variables such as the knowledge-education required. Labels are provided for each variable. The sum of these variables was used by the consultant to evaluate job importance-difficulty and relate it to salary. Your job is to generate evidence so that these job rating variables can be evaluated. The questions management would like to know are,
i) are these job rating variables really capturing distinct
characteristics
of jobs or are they merely measuring similar attributes in a
superficially
complicated way?
ii) can employees (as far as these ratings go) be
lumped
into small number of groups, and if so, how many different groups are
represented?
2. Generate a Factor Analysis and a Cluster Analysis as
evidence to answer these questionsq.
Use (Analyze>Data Reduction>Factor) and make the following
selections:
Descriptives
(select "univariate descriptives" and “coefficients” under
“correlation matrix”)
Extraction (maximum
likelihood)
Scores (select “save factor
scores matrix”)
For the Cluster Analysis use (Analyze>Classify>Two-Step
Cluster) and make the following selections:
Output (select "Create Cluster
Membership Variable")
3. Print the output. On the back of the output or a
separate sheet, briefly answer questions i) and ii) by making use of
specific
results from the factor analysis and cluster analysis.
[Remember, you can use SPSS Help as well as the “Results Coach” for assistance.]