Professional Documents
Culture Documents
ASSIGNMENT ON
STATISTICS FOR
MANAGEMENT
BY RAHUL GUPTA
Question 1: What do you mean by sample survey?
What are the different sampling methods? Briefly
describe them?
Answer
Introduction:
In statistics, survey sampling describes the process of selecting a sample of elements
from a target population in order to conduct a survey.
A survey may refer to many different types or techniques of observation, but in the
context of survey sampling it most often refers to a questionnaire used to measure the
characteristics and/or attitudes of people. The purpose of sampling is to reduce the cost
and/or the amount of work that it would take to survey the entire target population. A
survey that measures the entire target population is called a census.
Probability Sampling:
RAHUL GUPTA, MBAHCS (1ST SEM), SUBJECT CODE-MB0024, SET-2 Page
1
STATISTICS FOR MANAGEMENT
In a probability sample (also called "scientific" or "random" sample) each member of the
target population has a known and non-zero probability of inclusion in the sample. A
survey based on a probability sample can in theory produce statistical measurements of
the target population that are:
unbiased, the expected value of the sample mean is equal to the population mean
E()=, and
Have a measurable sampling error, which can be expressed as a confidence
interval, or margin of error.
Non-Probability Sampling:
Many surveys are not based on a probability samples, but rather by finding a suitable
collection of respondents to complete the survey. Some common examples of non-
probability sampling are:
In non-probability samples the relationship between the target population and the survey
sample is immeasurable and potential bias is unknowable. Sophisticated users of non-
probability survey samples tend to view the survey as an experimental condition, rather
than a tool for population measurement, and examine the results for internally consistent
relationships
Sampling Methods:
Random sampling is the purest form of probability sampling. Each member of the
population has an equal and known chance of being selected. When there are very large
populations, it is often difficult or impossible to identify every member of the population,
so the pool of available subjects becomes biased.
Systematic sampling is often used instead of random sampling. It is also called an Nth
name selection technique. After the required sample size has been calculated, every Nth
record is selected from a list of population members. As long as the list does not contain
any hidden order, this sampling method is as good as the random sampling method. Its
only advantage over the random sampling technique is simplicity.
to select a sufficient number of subjects from each stratum. "Sufficient" refers to a sample
size large enough for us to be reasonably confident that the stratum represents the
population.
Snowball sampling is a special no probability method used when the desired sample
characteristic is rare. It may be extremely difficult or cost prohibitive to locate
respondents in these situations. Snowball sampling relies on referrals from initial
subjects to generate additional subjects.
Correlation:
Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note
that the correlation reflects the noisiness and direction of a linear relationship (top row),
but not the slope of that relationship (middle), nor many aspects of nonlinear
relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the
correlation coefficient is undefined because the variance of Y is zero. In statistics,
correlation (often measured as a correlation coefficient, ) indicates the strength and
direction of a relationship between two random variables. The commonest use refers to a
linear relationship, but the concept of nonlinear correlation is also used. In general
statistical usage, correlation or co-relation refers to the departure of two random
variables from independence. In this broad sense there are several coefficients,
measuring the degree of correlation, adapted to the nature of the data.
Pearson's product-moment
coefficient:
A number of different coefficients are used for different situations. The best known is the
Pearson product-moment correlation coefficient, which is obtained by dividing the
covariance of the two variables by the product of their standard deviations. Karl Pearson
developed the coefficient from a similar but slightly different idea by Francis Galton.
Regression analysis:
In statistics, regression analysis includes any techniques for modeling and analyzing
several variables, when the focus is on the relationship between a dependent variable and
one or more independent variables. More specifically, regression analysis helps us
understand how the typical value of the dependent variable changes when any one of the
independent variables is varied, while the other independent variables are held fixed.
Most commonly, regression analysis estimates the conditional expectation of the
dependent variable given the independent variables that is, the average value of the
dependent variable when the independent variables are held fixed. Less commonly, the
focus is on a quantile, or other location parameter of the conditional distribution of the
dependent variable given the independent variables. In all cases, the estimation target is a
function of the independent variables called the regression function. In regression
analysis, it is also of interest to characterize the variation of the dependent variable
around the regression function, which can be described by a probability distribution.
Mathematical properties:
The correlation coefficient X, Y between two random variables X and Y with expected
values X and Y and standard deviations X and Y is defined as:
where E is the expected value operator and cov means covariance. A widely used
alternative notation is
Since X = E(X), X2 = E[(X E(X))2] = E(X2) E2(X) and likewise for Y, and since
The correlation is defined only if both of the standard deviations are finite and both of
them are nonzero. It is a corollary of the CauchySchwarz inequality that the correlation
cannot exceed 1 in absolute value.
If the variables are independent then the correlation is 0, but the converse is not true
because the correlation coefficient detects only linear dependencies between two
variables. Here is an example: Suppose the random variable X is uniformly distributed on
the interval from 1 to 1, and Y = X2. Then Y is completely determined by X, so that X
and Y are dependent, but their correlation is zero; they are uncorrelated. However, in the
special case when X and Y are jointly normal, uncorrelatedness is equivalent to
independence.
Sample correlation:
If we have a series of n measurements of X and Y written as xi and yi where i = 1, 2, ...,
n, then the Pearson product-moment correlation coefficient can be used to estimate the
correlation of X and Y . The Pearson coefficient is also known as the "sample correlation
coefficient". The Pearson correlation coefficient is then the best estimate of the
correlation of X and Y. The Pearson correlation coefficient is written:
where and are the sample means of X and Y , sx and sy are the sample standard
deviations of X and Y and the sum is from i = 1 to n. As with the population correlation,
we may rewrite this as
Again, as is true with the population correlation, the absolute value of the sample
correlation must be less than or equal to 1. The above formula conveniently suggests a
single-pass algorithm for calculating sample correlations, but, depending on the numbers
involved, it can sometimes be numerically unstable.
The square of the sample correlation coefficient, which is also known as the coefficient
of determination, is the fraction of the variance in yi that is accounted for by a linear fit of
xi to yi. This is written
Where sy|x2 is the square of the error of a linear regression of xi on yi by the equation y = a
+ bx:
Note that since the sample correlation coefficient is symmetric in xi and yi, we will get
the same value for a fit of yi to xi:
This equation also gives an intuitive idea of the correlation coefficient for higher
dimensions. Just as the above described sample correlation coefficient is the fraction of
variance accounted for by the fit of a 1-dimensional linear sub manifold to a set of 2-
dimensional vectors (xi, yi), so we can define a correlation coefficient for a fit of an m-
dimensional linear sub manifold to a set of n-dimensional vectors. For example, if we fit
a plane z = a + bx + CY to a set of data (xi, yi, zi) then the correlation coefficient of z to
x and y is
The distribution of the correlation coefficient has been examined by R. A. Fisher and A.
K. Gayen.
Geometric interpretation:
For centered data (i.e., data which have been shifted by the sample mean so as to have an
average of zero), the correlation coefficient can also be viewed as the cosine of the angle
between the two vectors of samples drawn from the two random variables.
As an example, suppose five countries are found to have gross national products of 1, 2,
3, 5, and 8 billion dollars, respectively. Suppose these same five countries (in the same
order) are found to have 11%, 12%, 13%, 15%, and 18% poverty. Then let x and y be
ordered 5-element vectors containing the above data: x = (1, 2, 3, 5, 8) and y = (0.11,
0.12, 0.13, 0.15, 0.18).
By the usual procedure for finding the angle between two vectors (see dot product), the
uncentered correlation coefficient is:
Note that the above data were deliberately chosen to be perfectly correlated: y = 0.10 +
0.01 x. The Pearson correlation coefficient must therefore be exactly one. Centering the
data (shifting x by E(x) = 3.8 and y by E(y) = 0.138) yields x = (2.8, 1.8, 0.8, 1.2,
4.2) and y = (0.028, 0.018, 0.008, 0.012, 0.042), from which
As expected.
Then, the equation of the least-squares line can be derived to be of the form:
As we go from each pair to the next pair x increases, and so does y. This relationship is
perfect, in the sense that an increase in x is always accompanied by an increase in y. This
means that we have a perfect rank correlation, and both Spearman's and Kendall's
correlation coefficients are 1, whereas in this example Pearson's product moment
correlation coefficient is 0.456, indicating that the points are far from lying on a straight
line. In the same way if y always decreases when x increases, the rank correlation
coefficients will be 1, while the product moment correlation coefficient may or may not
be close to 1, depending on how close the points are to a straight line. Although in the
extreme cases of perfect rank correlation the two coefficients are both equal (being both
+1 and both -1) this is not in general so, and values of the two coefficients cannot
meaningfully be compared. For example, for the three pairs (1, 1) (2, 3) (3, 2)
Spearman's coefficient is 1/2, while Kendall's coefficient is 1/3.
The Pearson correlation coefficient indicates the strength of a linear relationship between
two variables, but its value generally does not completely characterize their relationship.
In particular, if the conditional mean of Y given X, denoted E (Y|X), is not linear in X, the
correlation coefficient will not fully determine the form of E (Y|X).
The image on the right shows scatter plots of Anscombe's quartet, a set of four different
pairs of variables created by Francis Anscombe. The four y variables have the same mean
(7.5), standard deviation (4.12), correlation (0.816) and regression line (y = 3 + 0.5x).
However, as can be seen on the plots, the distribution of the variables is very different.
The first one (top left) seems to be distributed normally, and corresponds to what one
would expect when considering two variables correlated and following the assumption of
normality. The second one (top right) is not distributed normally; while an obvious
relationship between the two variables can be observed, it is not linear, and the Pearson
correlation coefficient is not relevant. In the third case (bottom left), the linear
relationship is perfect, except for one outlier which exerts enough influence to lower the
correlation coefficient from 1 to 0.816. Finally, the fourth example (bottom right) shows
another example when one outlier is enough to produce a high correlation coefficient,
even though the relationship between the two variables is not linear.
where EX and EY are the expected values of X and Y, respectively, and x and y are the
standard deviations of X and Y, respectively.
RAHUL GUPTA, MBAHCS (1ST SEM), SUBJECT CODE-MB0024, SET-2 Page
11
STATISTICS FOR MANAGEMENT
Regression equation of Y on X
Y-158.8= byx (X-25.25) where byx= N*dxdy dx*dy/N*dx2 (dx)2
byx= 8*130- (-7)(7)/ 8*955- (-7)2
byx= 540+49/ 7640-49
byx = 589/ 7591
byx= 0.07
Y-158.8= 0.07(X-25.25)
Y=0.07X+ 157.0325
Regression equation of X on Y
X-25.25= bxy (X-158.8) where bxy= N*dxdy dx*dy/N*dy2 (dy)2
bxy= 8* 130 (-7)(7)/ 8* 3701 (7)2
bxy= 540 +49 / 29559
bxy= 589/ 29559
bxy = 0.019
Regression equation of Y on X:
Y=0.07X+ 157.0325
Regression equation of X on Y :
X = 0.019Y + 22.2328
Introduction:
Business forecasting has always been one component of running an enterprise.
However, forecasting traditionally was based less on concrete and comprehensive data
than on face-to-face meetings and common sense. In recent years, business forecasting
has developed into a much more scientific endeavor, with a host of theories, methods,
and techniques designed for forecasting certain types of data. The development of
information technologies and the Internet propelled this development into overdrive, as
companies not only adopted such technologies into their business practices, but into
forecasting schemes as well. In the 2000s, projecting the optimal levels of goods to buy
or products to produce involved sophisticated software and electronic networks that
incorporate mounds of data and advanced mathematical algorithms tailored to a
company's particular market conditions and line of business. Business forecasting
involves a wide range of tools, including simple electronic spreadsheets; enterprise
resource planning (ERP) and electronic data interchange (EDI) networks, advanced
supply chain management systems, and other Web-enabled technologies. The practice
attempts to pinpoint key factors in business production and extrapolate from given data
sets to produce accurate projections for future costs, revenues, and opportunities. This
normally is done with an eye toward adjusting current and near-future business practices
to take maximum advantage of expectations.
In the Internet age, the field of business forecasting was propelled by three
interrelated phenomena. First, the Internet provided a new series of tools to aid the
science of business forecasting. Second, business forecasting had to take the Internet
itself into account in trying to construct viable models and make predictions. Finally, the
Internet fostered vastly accelerated transformations in all areas of business that made the
job of business forecasters that much more exacting. By the 2000s, as the Internet and its
myriad functions highlighted the central importance of information in economic activity,
more and more companies came to recognize the value, and often the necessity, of
business forecasting techniques and systems. Business forecasting is indeed big business,
with companies investing tremendous resources in systems, time, and employees aimed
at bringing useful projections into the planning process. According to a survey by the
Hudson, Ohio-based Answer Think Consulting Group, which specializes in studies of
business planning, the average U.S. Company spends more than 25,000 person-days on
business forecasting and related activities for every billion dollars of revenue.
Forecasting systems draw on several sources for their forecasting input, including
databases, e-mails, documents, and Web sites. After processing data from various
sources, sophisticated forecasting systems integrate all the necessary data into a single
spreadsheet, which the company can then manipulate by entering in various projections
such as different estimates of future salesthat the system will incorporate into a new
readout.
a supply hierarchy, a geography hierarchy, and so on. To return a useful forecast, the
system can't simply allocate down each hierarchy separately, but must account for the
ways in which those dimensions interact with each other. Moreover, the degree of this
interaction varies according to the type of business in which a company is engaged. Thus,
businesses need to fine-tune their allocation algorithms in order to receive useful
forecasts.
The second forecasting model is cause-and-effect. In this model, one assumes a cause, or
driver of activity, that determines an outcome. For instance, a company may assume that,
for a particular data set, the cause is an investment in information technology, and the
effect is sales. This model requires the historical data not only of the factor with which
one is concerned (in this case, sales), but also of that factor's determined cause (here,
information technology expenditures). It is assumed, of course, that the cause-and-effect
relationship is relatively stable and easily quantifiable.
The third primary forecasting model is known as the judgmental model. In this case, one
attempts to produce a forecast where there is no useful historical data. A company might
choose to use the judgmental model when it attempts to project sales for a brand new
product, or when market conditions have qualitatively changed, rendering previous data
obsolete. In addition, according to the Journal of Business Forecasting Methods &
Systems, this model is useful when the bulk of sales derive only from a relative handful
of customers. To proceed in the absence of historical data, alternative data is collected by
way of experts in the field, prospective customers, trade groups, business partners, or any
other relevant source of information. Business forecasting systems often work hand-in-
hand with supply chain management systems. In such systems, all partners in the supply
chain can electronically oversee all movement of components within that supply chain
and gear the chain toward maximum efficiency.
The Internet has proven to be a panacea in this field, and business forecasting systems
allow partners to project the optimal flow of components into the future so that
companies can try to meet optimal levels rather than continually catch up to them.
Topics
2. Judgmental methods:
Judgmental forecasting methods incorporate intuitive judgments, opinions and subjective
probability estimates.
Composite forecasts
Surveys
Delphi method
Scenario building
Technology forecasting
Forecast by analogy
3. Other methods:
Simulation
Prediction market
Probabilistic forecasting and Ensemble forecasting
Reference class forecasting
4. Forecasting accuracy:
The forecast error is the difference between the actual value and the forecast value for the
corresponding period.
Where E is the forecast error at period t, Y is the actual value at period t, and F is the
forecast for period t.
One of the most essential elements of being a high-performing manager is the ability to
lead effectively one's own life, then to model those leadership skills for employees in the
organization. This site comprehensively covers theory and practice of most topics in
forecasting and economics. I believe such a comprehensive approach is necessary to fully
understand the subject. A central objective of the site is to unify the various forms of
business topics to link them closely to each other and to the supporting fields of statistics
and economics. Nevertheless, the topics and coverage do reflect choices about what is
important to understand for business decision making. Almost all managerial decisions
are based on forecasts. Every decision becomes operational at some point in the future,
so it should be based on forecasts of future conditions. Forecasts are needed throughout
an organization -- and they should certainly not be produced by an isolated group of
forecasters. Neither is forecasting ever "finished". Forecasts are needed continually, and
as time moves on, the impact of the forecasts on actual performance is measured; original
forecasts are updated; and decisions are modified, and so on.
For example, many inventory systems cater for uncertain demand. The inventory
parameters in these systems require estimates of the demand and forecast error
distributions. The two stages of these systems, forecasting and inventory control, are
often examined independently. Most studies tend to look at demand forecasting as if this
were an end in itself or at stock control models as if there were no preceding stages of
computation. Nevertheless, it is important to understand the interaction between demand
forecasting and inventory control since this influences the performance of the inventory
system. This integrated process is shown in the following figure:
3. What is a System? Systems are formed with parts put together in a particular
manner in order to pursue an objective. The relationship between the parts
determines what the system does and how it functions as a whole. Therefore, the
relationships in a system are often more important than the individual parts. In
general, systems that are building blocks for other systems are called subsystems
4. The Dynamics of a System: A system that does not change is a static system.
Many of the business systems are dynamic systems, which mean their states
change over time. We refer to the way a system changes over time as the system's
behavior. And when the system's development follows a typical pattern, we say
the system has a behavior pattern. Whether a system is static or dynamic depends
on which time horizon you choose and on which variables you concentrate. The
time horizon is the time period within which you study the system. The variables
are changeable values on the system.
5. Resources: Resources are the constant elements that do not change during the
time horizon of the forecast. Resources are the factors that define the decision
problem. Strategic decisions usually have longer time horizons than both the
Tactical and the Operational decisions.
6. Forecasts: Forecasts input come from the decision maker's environment.
Uncontrollable inputs must be forecasted or predicted.
7. Decisions: Decisions inputs ate the known collection of all possible courses of
action you might take.
8. Interaction: Interactions among the above decision components are the logical,
mathematical functions representing the cause-and-effect relationships among
inputs, resources, forecasts, and the outcome.
RAHUL GUPTA, MBAHCS (1ST SEM), SUBJECT CODE-MB0024, SET-2 Page
19
STATISTICS FOR MANAGEMENT
Interactions are the most important type of relationship involved in the decision-
making process. When the outcome of a decision depends on the course of action,
we change one or more aspects of the problematic situation with the intention of
bringing about a desirable change in some other aspect of it. We succeed if we
have knowledge about the interaction among the components of the problem.
There may have also sets of constraints which apply to each of these components.
Therefore, they do not need to be treated separately.
9. Actions: Action is the ultimate decision and is the best course of strategy to
achieve the desirable goal.
The forecast for time period t + 1 is the forecast for all future time periods. However, this
forecast is revised only when new data becomes available. You may like using
Forecasting by Smoothing JavaScript, and then performing some numerical
experimentation for a deeper understanding of these concepts.
Where the weights are any positive numbers such that: w1 + w2 + w3 = 1. A typical
weights for this example is, w1 = 3/ (1 + 2 + 3) = 3/6, w2 = 2/6, and w3 = 1/6.
You may like using Forecasting by Smoothing JavaScript, and then performing some
numerical experimentation for a deeper understanding of the concepts.
An illustrative numerical example: The moving average and weighted moving average of
order five are calculated in the following table.
Moving Averages with Trends: Any method of time series analysis involves a different
degree of model complexity and presumes a different level of comprehension about the
underlying trend of the time series. In many business time series, the trend in the
smoothed series using the usual moving average method indicates evolving changes in
the series level to be highly nonlinear.
In order to capture the trend, we may use the Moving-Average with Trend (MAT)
method. The MAT method uses an adaptive linearization of the trend by means of
incorporating a combination of the local slopes of both the original and the smoothed
time series.
In making a forecast, it is also important to provide a measure of how accurate one can
expect the forecast to be. The statistical analysis of the error terms known as residual
time-series provides measure tool and decision process for modeling selection process. In
RAHUL GUPTA, MBAHCS (1ST SEM), SUBJECT CODE-MB0024, SET-2 Page
21
STATISTICS FOR MANAGEMENT
applying MAT method sensitivity analysis is needed to determine the optimal value of
the moving average parameter n, i.e., the optimal number of period m. The error time
series allows us to study many of its statistical properties for goodness-of-fit decision.
Therefore it is important to evaluate the nature of the forecast error by using the
appropriate statistical tests. The forecast error must be a random variable distributed
normally with mean close to zero and a constant variance across time.
For computer implementation of the Moving Average with Trend (MAT) method one
may use the forecasting (FC) module of WinQSB which is commercial grade stand-alone
software. WinQSBs approach is to first select the model and then enter the parameters
and the data. With the Help features in WinQSB there is no learning-curve one just needs
a few minutes to master its useful features.
Answer
Introduction:
Statistics is considered by some to be a mathematical science pertaining to the
collection, analysis, interpretation or explanation, and presentation of data, while others
consider it to be a branch of mathematics concerned with collecting and interpreting data.
Statisticians improve the quality of data with the design of experiments and survey
sampling. Statistics also provides tools for prediction and forecasting using data and
statistical models. Statistics is applicable to a wide variety of academic disciplines,
including natural and social sciences, government, and business.
data are gathered and correlations between predictors and response are investigated. An
example of an observational study is one that explores the correlation between smoking
and lung cancer. This type of study typically uses a survey to collect observations about
the area of interest and then performs statistical analysis. In this case, the researchers
would collect observations of both smokers and non-smokers, perhaps through a case-
control study, and then look for the number of cases of lung cancer in each group.
Levels of measurement:
There are four types of measurements or levels of measurement or measurement scales
used in statistics:
Nominal.
Ordinal.
Interval.
Ratio.
Characteristics of Statistics:
Some of its important characteristics are given below:
(2) Statistical helps in proper and efficient planning of a statistical inquiry in any field of
study.
(4) Statistics helps in presenting complex data in a suitable tabular, diagrammatic and
graphic form for an easy and clear comprehension of the data.
(6) Statistics helps in drawing valid inference, along with a measure of their reliability
about the population parameters from the sample data.
Limitations of Statistics:
The important limitations of statistics are:
(1) Statistics laws are true on average. Statistics are aggregates of facts. So single
observation is not a statistics, it deals with groups and aggregates only.
(4) It sufficient care is not exercised in collecting, analyzing and interpretation the data,
statistical results might be misleading.
(5) Only a person who has an expert knowledge of statistics can handle statistical data
efficiently.
(6) Some errors are possible in statistical decisions. Particularly the inferential statistics
involves certain errors. We do not know whether an error has been committed or not.
Introduction:
Statistical surveys are used to collect quantitative information about items in a
population. Surveys of human populations and institutions are common in political
polling and government, health, social science and marketing research. A survey may
focus on opinions or factual information depending on its purpose, and many surveys
involve administering questions to individuals. When the questions are administered by a
researcher, the survey is called a structured interview or a researcher-administered
survey. When the questions are administered by the respondent, the survey is referred to
as a questionnaire or a self-administered survey.
Serial surveys:
Serial surveys are those which repeat the same questions at different points in time,
producing time-series data. They typically fall into two types:
Cross-sectional surveys which draw a new sample each time. In a sense any one-
off survey will also be cross-sectional.
Longitudinal surveys where the sample from the initial survey is re-contacted at a
later date to be asked the same questions.
Advantages:
It is an efficient way of collecting information from a large number of
respondents. Very large samples are possible. Statistical techniques can be used to
determine validity, reliability, and statistical significance.
Surveys are flexible in the sense that a wide range of information can be
collected. They can be used to study attitudes, values, beliefs, and past behaviors.
Because they are standardized, they are relatively free from several types of
errors.
They are relatively easy to administer.
There is an economy in data collection due to the focus provided by standardized
questions. Only questions of interest to the researcher are asked, recorded,
codified, and analyzed. Time and money is not spent on tangential questions.
Cheaper to run.
Disadvantages:
They depend on subjects motivation, honesty, memory, and ability to respond.
Subjects may not be aware of their reasons for any given action. They may have
forgotten their reasons. They may not be motivated to give accurate answers; in
fact, they may be motivated to give answers that present themselves in a
favorable light.
Structured surveys, particularly those with closed ended questions, may have low
validity when researching affective variables.
Although the chosen survey individuals are often a random sample, errors due to
no response may exist. That is, people who choose to respond on the survey may
be different from those who do not respond, thus biasing the estimates.
Survey question answer-choices could lead to vague data sets because at times
they are relative only to a personal abstract notion concerning "strength of
choice". For instance the choice "moderately agree" may mean different things to
different subjects, and to anyone interpreting the data for correlation. Even yes or
no answers are problematic because subjects may for instance put "no" if the
choice "only once" is not available.
4. Whether to use data collected from primary or secondary source should be determined
in advance.
5. The organization of investigation is the final step in the process. It encompasses the
determination of number of investigators required their training, supervision work
needed, funds required.
a. Telephone:
Use of interviewers encourages sample persons to respond, leading to higher
response rates.
Interviewers can increase comprehension of questions by answering respondents'
questions.
Fairly cost efficient, depending on local call charge structure.
Good for large national (or international) sampling frames.
Some potential for interviewer bias (e.g. some people may be more willing to
discuss a sensitive issue with a female interviewer than with a male one).
Cannot be used for non-audio information (graphics, demonstrations, taste/smell
samples).
Unreliable for consumer surveys in rural areas where telephone penetration is
low.
Three types:
o traditional telephone interviews
o computer assisted telephone dialing
o computer assisted telephone interviewing (CATI)
b. Mail:
The questionnaire may be handed to the respondents or mailed to them, but in all
cases they are returned to the researcher via mail.
Cost is very low, since bulk postage is cheap in most countries.
Long time delays, often several months, before the surveys are returned and
statistical analysis can begin.
Not suitable for issues that may require clarification.
Respondents can answer at their own convenience (allowing them to break up
long surveys; also useful if they need to check records to answer a question).
No interviewer bias introduced.
Large amount of information can be obtained: some mail surveys are as long as
50 pages.
Response rates can be improved by using mail panels:
o Members of the panel have agreed to participate.
o Panels can be used in longitudinal designs where the same respondents are
surveyed several.
c. Online surveys:
Can use web or e-mail.
Web is preferred over e-mail because interactive HTML forms can be used.
Often inexpensive to administer.
Very fast results.
Easy to modify.
Response rates can be improved by using online panels - members of the panel
have agreed to participate.
If not password-protected, easy to manipulate by completing multiple times to
skew results.
Data creation, manipulation and reporting can be automated and/or easily
exported. into a format which can be read by PSPP, DAP or other statistical
analysis software.
Data sets created in real time.
Some are incentive based (such as Survey Vault or Yoga).
May skew sample towards a younger demographic compared with CATI.
Often difficult to determine/control selection probabilities, hindering quantitative
analysis of data.
Use in large scale industries.
Emotional appeals.
Bids for sympathy.
Convince respondent that they can make a difference.
Guarantee anonymity.
Legal compulsion (certain government-run surveys).
Collected data in the raw form would be voluminous and no comprehensible. Therefore it
should be condensed and simplified for better understanding and usefulness.
Classification is first stage in simplification. It can be defined as a systematic grouping of
the units according to their common characteristics. Each of the group is called class. For
example in survey of Industrial workers of a particular industry, workers can be classified
as unskilled, semiskilled and skilled each of which form a class.
Types of classification:
The very important types are:
Methods of Classification:
Classification is done according to a single attribute or variable, is known as one
way classification.
Classification done according to two attributes or variables is known as two-way
Classification.
Classification done according to more than two attributes or variables is known as
Manifold classification.
Examples:
Where the feature vector input is , and the function f is typically parameterized by some
parameters . In the Bayesian approach to this problem, instead of choosing a single
parameter vector , the result is integrated over all possible thetas, with the thetas
weighted by how likely they are given the training data D:
The third problem is related to the second, but the problem is to estimate the
class-conditional probabilities and then use Bayes' rule to produce
the class probability as in the second problem.
Table:
In relational databases and flat file databases, a table is a set of data elements (values)
that is organized using a model of vertical columns (which are identified by their name)
and horizontal rows. A table has a specified number of columns, but can have any
number of rows. Each row is identified by the values appearing in a particular column
subset which has been identified as a candidate key. Table is another term for relations;
although there is the difference in that a table is usually a multi-set (bag) of rows whereas
a relation is a set and does not allow duplicates. Besides the actual data rows, tables
generally have associated with them some meta-information, such as constraints on the
table or on the values within particular columns. The data in a table does not have to be
physically stored in the database. Views are also relational tables, but their data are
calculated at query time. Another example is nicknames, which represent a pointer to a
table in another database.
Unlike a spreadsheet, the data type of field is ordinarily defined by the schema describing
the table. Some relational systems are less strict about field data type definitions.
Tabulation:
Tabulation follows classification. It is a logical listing of related data in rows and
columns. Objectives of tabulation are:
To simplify complex data.
To highlight important characteristics.
To present data in minimum space.
To facilitate comparison.
To bring out trends and tendencies.
To facilitate further analysis.