2. Ch 6- The Analysis Sample
3. Ch 7- Analyzing and Manipulating Customer Data
Ch 6- The Analysis Sample
• New direct marketing product or promotional
tests (format, creative, price) begin with
obtaining a sample of names from the database
or list broker. They receive the rest promotion,
product or offer and the results are collected
(percentage of responders, payment rate or
• A sample is a subset of customer records and a random
selection from the universe of interest on the direct
marketer’s database. It must be a random and representative
sample of the whole population.
• This is a big part of inferential statistics. We want to estimate
population parameters or test hypotheses about such
• Obtaining information on the entire population is expensive.
• It is impossible to examine every member of the population.
• The only samples eliminated from testing are rollouts or DMA
do not promotes, frauds or credit risk accounts, if there are
cities with strict promotional restrictions or names promoted
for other marketing tests.
• A random sample is one in which every member
of the sample is equally likely to be chosen.
• To ensure this, direct marketers use nth selects
• Ex. Out of 10,000,000 names, we test a random
sample of 10,000 names (you take one name on
the database choosing every 1000th name
thereafter). This is also called systematic
• We test in order to:
• Evaluate new product offerings
• Gauge reaction of price changes
• Determine impact of new promotional format change
on response, payment, or conversion rates
• Identify target market for a new product test by
reviewing characteristics of responders
Creation of Analysis Sample
• To determine the unique characteristics that distinguish
between responders and non responders, we use a frozen
• At the same exact time customers are selected for a
promotion, a special file must be created containing all
customers selected for the promotion along with their
customer data on the database.
• This includes name, address, primary key, all internal RFM
(recency, frequency, and monetary) data elements
(promotion, purchase, payment history and all
• This is a snapshot of each customer’s record, or a frozen
file. This must be done in order to analyze anything.
How to create a frozen file
1. Select names (include all customer data) to be test
2. Use an unique key code to mark the names
3. Create a file of the selected names with their
address and customer id information only and send
it to the letter shop for promotion
4. Create a file of the selected names with all
customer data for later analysis (the frozen file)
5. Once the responses come, update the records with
response info using the unique key code and
update the frozen file with response information.
• This should be created for all product and list
tests (creative, format, price, product and list).
The files will be needed to determine the target
market definitions for large scale rollouts.
• We want to watch out for skewed results and
see if the names selected were representative of
• What are some reasons it could be skewed?
Methods of Saving point in time
• Saving a snapshot for every test promoted is
• Some companies save an entire copy of their
marketing database regularly (quarterly) and
when analysis is required, they create an
analysis sample on the fly.
• What are some problems with this method?
Saving point in time sample data
• Customer data to be appended to the test
names may not reflect customers’ status near
the time of the promotion.
• Don’t merge a promotion file with a version of
the database reflecting the customers’ status
after the promotion. Could be misleading.
Analysis and validation samples
• Before any analysis is done, the sample is split into
two. The analysis is done on two thirds of the sample
and the results validated/calibrated on the remaining
one third of the sample.
• This is used to minimize error variance.
• Ex. We can use the validation sample to check
correlation between variables.
• Based on the results of a frozen file, you can select
names from the entire database meeting your
Ch 7- Analyzing and
Manipulating Customer Data
• Getting to know your data is very important.
• Data dictionaries are a great way to get started
when you take up a position as a database
Good questions to ask
• Does a partial payment for a product order
update the Total Products Paid field? Does it do
for all product lines?
• If a customer cancels immediately, is the order in
the Total Products Paid field eliminated or
• Are you querying with the right specifications?
Ways to analyze frozen files
• Univariate tabulations
• Cross tabulations
• Logic counter variables
• Ratio variables
• Longitudinal variables
• This is the most common form of analysis for
determining a select, building a target model, or
segmenting a customer file.
• This involves looking at one variable.
Rule of thumb: There should at
least be 500 names per category.
Response Rate: Number of
Index to total: Group Response
Rate/Entire sample response rate
What does 175 mean?Promoting Young people less than
30 will give a response rate 75%
higher than if you had gone with
all the groups. This is called gain
• What is it that you want to measure? Assuming a
breakeven of 3% you can take a look at the
response rates to see who to promote.
• Selection criteria depends on objective. We can
lose money to gain more customers.
• The response rates increase as orders do. This
may not always be the case. Some may reflect a
slight bouncing of the response rates but there
should be a pattern. Otherwise must proceed
Now across all lines of Total
number of promotions ever, which
Is a 3% response rate
group will profitably promote?
enough? We need key
info like how long the
customer has been on
file and how many orders
have been generated.
Never use a counter
variable on its own.
Cross tabulations are the key
• This highlights the interrelationships between
two or more variables. It can take a weak data
element (predictive wise) and change it to a
more powerful predictor.
By cross tabulating, in
addition to the 1409
count, we can add
promotions to the 11-15
category (assuming 3%
breakeven) = 1409 + 462
+ 956 = 28.27%
Big change from 14.09%
Logic counter variables
• These are binary variables used to combine
similar data elements to one, strengthen
predictive power of low coverage data elements
like enhancement data, reduce data, and help
create strong models
• These are easy to code. Do you have a pet; 1 if
yes or 0 if no
Fill in the missing blanks
• This is a result of dividing one data element by another. and also the
strongest measurement scale.
• Ex. Total products paid for each customer divided by total products
• Total book products paid divided by total products paid
• Estimate of average payment rate per customer
Who is most likely to respond to
• This allows direct marketers to view a particular data element for
each customers over time. This is similar to time series variables.
• Ex. Customers response (order, pay, silent, etc_ to their last 3
promotions sent which measures action on next promotion sent
• Customers action (pay, return, bad debt) to their last 3 orders placed
which measures estimate performance on next order placed
Who would you select for an
Time alignment of key events
• Direct marketers on the go need to align their time to best leverage
the customer data for businesses like clubs, collectible marketers,
frequent buyer clubs, financial services, etc. The different points of
time can make a big difference on making decisions.
• You should first group all customers on the date of initial enrollment
of service. Over time, measure days to order first product following
enrollment, longitudinal variables measuring the ratio of dollars paid
to total number of promotions, counter variables representing total
number of promotions with no activity. Other factors may include
time to pay the first up sell product, acceptance/nonacceptance of
the first up sell event.
• Failing to align customers to first upsell can result in comparing the
time to pay for the first up sell product for customers who were given
• You can also align based on life stage events like customers who have
paid their mortgage within a same time period and create
longitudinal variables such as ratio of dollars paid on the home
mortgage to the total months of home ownership.
• We want to measure how strong a set of variables can be.
• We want to know which variables are the most correlated with the
action to be predicted (order, payment, renewal)
• The possible values are -1 to 1
• Usually 0-0.4 represents a weak relationship
• 0.4-0.7 represents a moderate relationship
• Anything more than 0.7 is a strong relationship
• Show in terms of hypothesis testing
• P values are also very powerful. Is the observed sample correlation
for each variable significantly different from 0?
• It is a probability of error which should be as close to 0 as possible.
Questions to ask
• Why is Total products paid in the last 36 months positively correlated
with ordering the music title PRUSA?
• Reason: Past behavior is best predictor of future behavior
• Why is the classical category negatively correlated?
• This is a study on pop/rock music
Database Management and Modeling
In addition to the textbook problems, please answer the following on Excel:
1. Sort the dataset by Ascending order on Sales
2. I want some basic summary statistics on at least 2 of the variables. Interpret the results.
a. Now Select all the sales information whose ages were greater than 40. Compare that
with those less than. What do you notice?
b. Select all the franchises where advertising spent was more than $8,200 and sales
were greater than $200,000. Do you think it was worth that money?
c. Make a new column called “Owner age” and label the data where the owners ages
are “very young” if it is between 20-30, “young” if 31-40 and “old” is 41-55
d. Make a new ratio variable and add it as another column based on 2 columns.
e. Create a scatterplot of Franchises Sales and inventory. What is the correlation?
Interpret the results.
FRANCHISE NUMBERSALES SQFT
INVENTORY ADVERTISING FAMILIES STORES OWNER AGE OF FRANCHISE
The data are for each franchise store.
SALES = annual net sales/$1000
SQFT = number sq. ft./1000
INVENTORY = inventory/$1000
ADVERTISING = amount spent on advertizing/$1000
FAMILIES = size of sales district/1000 families
STORES = number of competing stores in distric
OWNER AGE OF FRANCHISE
Uber waiting times
negative strong relationship
13.2966161 Standard deviation
Lyft Waiting time
Uber waiting times
Uber vs Lyft waiting times
Uber waiting time
Purchase answer to see full