+1(978)310-4246 credencewriters@gmail.com
  

DescriptionCollege of Computing and Informatics
Project
Deadline: Monday 13/2/2023 @ 23:59
[Total Mark is 14]
Student Details:
CRN:
Name:
Name:
Name:
ID:
ID:
ID:
Instructions:
• You must submit two separate copies (one Word file and one PDF file) using the Assignment Template on
Blackboard via the allocated folder. These files must not be in compressed format.
• It is your responsibility to check and make sure that you have uploaded both the correct files.
• Zero mark will be given if you try to bypass the SafeAssign (e.g. misspell words, remove spaces between
words, hide characters, use different character sets, convert text into image or languages other than English
or any kind of manipulation).
• Email submission will not be accepted.
• You are advised to make your work clear and well-presented. This includes filling your information on the cover
page.
• You must use this template, failing which will result in zero mark.
• You MUST show all your work, and text must not be converted into an image, unless specified otherwise by
the question.
• Late submission will result in ZERO mark.
• The work should be your own, copying from students or other resources will result in ZERO mark.
• Use Times New Roman font for all your answers.
Description and Instructions
Pg. 01
Description and Instructions
The objective of this project is to use the machine learning software Weka to experiment
different data preprocessing techniques (data cleaning, data reduction, normalization…)
and different data mining techniques (Frequent pattern mining techniques, supervised
learning techniques, unsupervised learning techniques) on a selected dataset.

You should use the same project document to prepare your answer. A word file
and a pdf file should be provided.

The number of students per project is 3 or 4 (maximum). You should fill the work
distribution table provided at the end of this project.

Using/copying ideas from previous years will result in zero mark.

You should select one of the datasets from any Machine Learning Repositories:


http://archive.ics.uci.edu/ml/

https://www.kaggle.com/datasets


The dataset may follow the following requirements

Number of instances: at least 300

Number of attributes: at least 10

The dataset should contain different types of attributes, some null values,
some missing values, different scales of attributes…
Note: You can modify the dataset manually by adding some attributes, removing
some values, adding some data discrepancies …
Question One
Pg. 02
Learning
Outcome(s): 1
Define different
data mining tasks,
problems and the
algorithms most
appropriate for
addressing them
Question One
2 Marks
1. Prepare a CSV OR ARFF format data file of the data. Add a screenshot of the
prepared dataset.
2. Load the dataset in Weka.
3. Using Weka, describe in detail:
a. The selected dataset: number of attributes, number of instances,
anomalies in the dataset…
b. Each attribute of the dataset (description, type, range, number of missing
values, min and max for the numeric values,…)
Add screenshots of the exploration of the data.
Question Two
Pg. 03
Learning
Outcome(s):1
Define different
Question Two
2 Marks
Apply at least two data preprocessing techniques (data cleaning, data reduction,
normalization…) to the select dataset.
data mining tasks,
problems and the
algorithms most
appropriate for
addressing them
Explain in detail why do you choose these techniques and add screenshots of data
before and after applying each technique.
Question Three
Pg. 04
Learning
Outcome(s):3
Employ data
mining and data
warehousing
techniques to
solve real-world
problems.
Learning
Outcome(s):4
Choose data
mining algorithms
for problems they
are specifically
designed for.
Question Three
2 Marks
Apply Apriori algorithm on your dataset with different support and confidence values
(at least 3 different scenarios). Discuss the generated rules.
Note:

You can use only a subset of the attributes of your dataset to apply A priori
algorithm.
Add screenshots of the output of the algorithm for each scenario.
Choose the best scenario.
Question Four
Pg. 05
Learning
Outcome(s):3
Employ data
Question Four
4 Marks
Apply two supervised machine learning techniques (decision trees, linear regression,
SVM, neural networks…) on the selected dataset.
mining and data
warehousing
techniques to
solve real-world
problems.
Learning
Outcome(s):4
Choose data
mining algorithms
for problems they
are specifically
designed for.
For each selected technique:

Give a brief description of the technique.
Provide the result and accuracies of the constructed model for different
parameters (at least 2 scenarios) and discuss them with supporting screenshots.
Question Five
Pg. 06
Question Five
Learning
Outcome(s):3
4 Marks
Apply two unsupervised machine learning techniques (K-means, Hierarchical
clustering…) on the selected dataset.
Employ data
mining and data
For each selected technique:
warehousing

techniques to
solve real-world
problems.
Learning
Outcome(s):4
Choose data
mining algorithms
for problems they
are specifically
designed for.
Give a brief description of the technique
Provide the result of each technique using different settings (at least 2 scenarios)
and discuss them with supporting screenshots.
Work Distribution
Pg. 07
Work Distribution
Fill the following table.
Student ID
Student Name
Questions
‫الجامعة السعودية االلكترونية‬
‫الجامعة السعودية االلكترونية‬
‫‪1‬‬
‫‪26/12/2021‬‬
College of Computing and Informatics
Bachelor of Science in Information Technology
IT446
Data Mining and Data Warehousing
2
IT446
Data Mining and Data Warehousing
Week 1
Introduction
3
1.
2.
3.
4.
Saudi Vision 2030
Saudi Digital Library
Git and Github
Stackoverflow
Contents
4
• Demonstrate the Saudi Digital Library and searching for
knowledge
• Understand the Major Objective of Saudi Vision 2030
Weekly
Learning
Outcomes
5
• Saudi Vision 2030
Weekly
Learning
Outcomes
6
Saudi Vision 2030
• Under the leadership of the Custodian of the Two Holy Mosques, Vision
2030 was launched, a roadmap drawn up by His Royal Highness the Crown
Prince, to harness the strengths God bestowed upon us – our strategic
position, investment power and place at the center of Arab and Islamic
worlds. The full attention of the Kingdom, and our Leadership, is on
harnessing our potential to achieve our ambitions.
7
Reference: https://www.vision2030.gov.sa/
Saudi Vision 2030
• Vision 2030 Draws on The Nation’s Intrinsic Strengths
1. Saudi Arabia is the land of the Two Holy Mosques which positions the Kingdom at
the heart of the Arab and Islamic worlds
2. Saudi Arabia is using its investment power to create a more diverse and
sustainable economy
3. The Kingdom is using its strategic location to build its role as an integral driver of
international trade and to connect three continents: Africa, Asia and Europe
8
Reference: https://www.vision2030.gov.sa/
• Saudi Digital Library
Weekly
Learning
Outcomes
9
Saudi Digital Library
• Saudi Digital Library, is the largest academic gathering of information
sources in the Arab world, with more than (310،000) scientific reference,
covering all academic disciplines, and the continuous updating of the
content in this; thus achieving huge accumulation cognitive in the long run.
Library has contracted with more than 300 global publisher. The library
won the award for the Arab Federation for Libraries and Information
‘know’ for outstanding projects in the Arab world in 2010.
10
Reference: https://sdl.edu.sa/sdlportal/en/publishers.aspx
Saudi Digital Library
• Advantages:
• One central management, manages this huge content, and constantly updated.
• Common share for the benefit of, any University would benefit other universities
that are now available to the other, in any scientific field.
• Enhance the status of universities when evaluating, for Academic Accreditation, and
through sources rich, modern, and publish the best Global Publishers.
• Bridging the gap between Saudi universities, where emerging universities can get
the same service, you get major Saudi universities.
11
Reference: https://sdl.edu.sa/sdlportal/en/publishers.aspx
• Git and Github
Weekly
Learning
Outcomes
12
What is GIT?
• Git
• is a version control system for tracking changes in computer files and
coordinating work on those files among multiple people.
• It is primarily used for source code management in software development and
was initially created by Linus Torvalds for development of the Linux Kernel.
• Git is not Github.
• Git is the version control software
• Github is a git repository hosting service which offers all the source code
management provided in git.
• Github is where you upload your git repository.
13
Reference: https://medium.com/@lucasmaurer/git-gud-the-working-tree-staging-area-and-local-repo-a1f0f4822018
Version control
• Version control is:
• a system that records changes to a file or set of files over time so that you can
recall specific versions later.
• If you are a graphic or web designer and want to keep every version of an
image or layout (which you would most certainly want to), a Version Control
System (VCS) is a very wise thing to use.
• It allows you to revert selected files back to a previous state, revert the entire
project back to a previous state, compare changes over time, see who last
modified something that might be causing a problem, who introduced an issue
and when, and more.
• Using a VCS also generally means that if you lose files, you can easily recover.
14
Reference: https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control
What Is GitHub
• GitHub is
• a for-profit company that offers a cloud-based Git repository hosting
service.
• Essentially, it makes it a lot easier for individuals and teams to use Git for
version control and collaboration.
• GitHub’s interface is user-friendly enough so even novice coders
can take advantage of Git.
15
Reference: https://guides.github.com/activities/hello-world/
Create a Repository
16
Reference: https://guides.github.com/activities/hello-world/
Make and commit changes
17
Reference: https://guides.github.com/activities/hello-world/
• Stack overflow
Weekly
Learning
Outcomes
18
Stack overflow
• Stack Overflow is
• a question-and-answer website for professional and enthusiast programmers.
• It is the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood
and Joel Spolsky.
• It features questions and answers on a wide range of topics in computer programming.
• Stack Overflow only accepts questions about programming that are tightly focused on a
specific problem.
• Questions of a broader nature–or those inviting answers that are inherently a matter of
opinion– are usually rejected by the site’s users and marked as closed.
https://stackoverflow.com/
19
Reference: https://en.wikipedia.org/wiki/Stack_Overflow
Thank You
20
‫الجامعة السعودية االلكترونية‬
‫الجامعة السعودية االلكترونية‬
‫‪26/12/2021‬‬
College of Computing and Informatics
Bachelor of Science in Information Technology
IT446
Data Mining and Data Warehousing
IT446
Data Mining and Data Warehousing
Chapter 1 (introduction)
Chapter 2 (Getting to Know Your Data)
Week 2
Week Learning Outcomes
• Define data mining and its applications.
• Recognize kinds of data and patterns that can be mine
• Describe data objects and attribute types
4
Chapter 1: Introduction 1.1 – 1.5

Why Data Mining?

What is Data Mining?

What kinds of Data can be Mined?

What kind of Patters Can be Mined?

Which Technologies Are Used?

What Kind of Applications Are Targeted?

Major Issues in Data Mining
Chapter 2: Getting to Know Your Data 2.1- 2.3

Data Objects and Attributes Types

Basic Statistical Descriptions of Data

Data Visualization
Why Data Mining?
• The rapid growth of Data
• Data collection and availability caused by Digital Transformation and
Automation
• Across all fields
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
• As result of that, we are drowning in data, data mining help finding knowledge
within data.
• “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
6
What Is Data Mining?
• Data mining has multiple names depending on time and field
• Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
• Data mining is a process of extracting knowledge from data regardless of
the reasons and what these knowledge is used for.
• In other words, the extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) patterns or knowledge
from data.
7
What Is Data Mining?
• Watch out! From naming everything Data Mining
• It is easy to overstate what data mining is and include similar tasks such
as:
• Simple search and query processing
• Expert Systems that does not rely on data
8
Knowledge Discovery (KDD) Process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
9
Knowledge Discovery (KDD) Process
• Data cleaning (to remove noise and inconsistent data)
• Data integration (where multiple data sources may be combined)
• Data selection (where data relevant to the analysis task are retrieved from the
database)
• Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)
• Data mining (an essential process where intelligent methods are applied to extract
data patterns or knowledge)
• Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures—see Section 1.4.6)
• Knowledge presentation (where visualization and knowledge representation
techniques are used to present mined knowledge to users)
10
What kinds of Data can be Mined?
• Database-oriented data sets and applications
• Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
• Data streams and sensor data
• Time-series data, temporal data, sequence data (incl. bio-sequences)
• Structure data, graphs, social networks and multi-linked data
• Object-relational databases
• Heterogeneous databases and legacy databases
• Spatial data and spatiotemporal data
• Multimedia database
• Text databases
• The World-Wide Web
11
What kind of Patters Can be Mined?
Data Mining Techniques Used
1) Generalization
• Generalize, summarize, and contrast data characteristics, e.g., dry vs.
wet region
2) Association and Correlation Analysis
• Frequent patterns (or frequent itemsets) e.g. What items are
frequently purchased together in your supermarket?
• Association, correlation vs. causality
• A typical association rule: Bread -> Milk [0.5%, 75%] (support,
confidence)
• Are strongly associated items also strongly correlated?
• Are all pattern interesting?
12
What kind of Patters Can be Mined?
Data Mining Techniques Used
3) Classification (supervised learning)
• Build a model that can classify objects based on their characteristics.
Model will be learn to do so by looking at previous examples (refer to
as training data)
• Applications include credit card fraud detection and patient
diagnoses.
4) Cluster Analysis (unsupervised learning)
• Group data objects together based on their characteristics where
objects in a group are similar and objects belong to different groups
are dissimilar.
• Application include grouping online shopper or news readers to
better target them with advertisements.
13
What kind of Patters Can be Mined?
Data Mining Techniques Used
5) Outlier Analysis
• Outlier: A data object that does not comply with the general behavior
of the data
• Finding characteristics that a group of data objects share and those
do not closely share those characteristics are considered outliers.
• noise vs. outliers.
• Outliers analysis can use classification or clustering techniques.
• Application include transaction fraud detection, and rare event
analysis.
14
Which Technologies Are Used?
Machine
Learning
Applications
Algorithm
Pattern
Recognition
Data Mining
Database
Technology
15
Statistics
Visualization
High-Performance
Computing
Why Confluence of Multiple Disciplines?
• Tremendous amount of data
• Algorithms must be highly scalable to handle such as tera-bytes of data
• High-dimensionality of data
• Micro-array may have tens of thousands of dimensions
• High complexity of data
• Data streams and sensor data
• Time-series data, temporal data, sequence data
• Structure data, graphs, social networks and multi-linked data
• Heterogeneous databases and legacy databases
• Spatial, spatiotemporal, multimedia, text and Web data
• Software programs, scientific simulations
16
Applications of Data Mining
• Web page analysis: from web page classification, clustering to PageRank &
HITS algorithms
• Collaborative analysis & recommender systems
• Basket data analysis to targeted marketing
• Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological network
analysis
• Data mining and software engineering (e.g., IEEE Computer, Aug. 2009
issue)
• From major dedicated data mining systems/tools (e.g., SAS, MS SQL-Server
Analysis Manager, Oracle Data Mining Tools) to invisible data mining
17
Major Issues in Data Mining (1)
• Mining Methodology
• Mining various and new kinds of knowledge
• Mining knowledge in multi-dimensional space
• Data mining: An interdisciplinary effort
• Boosting the power of discovery in a networked environment
• Handling noise, uncertainty, and incompleteness of data
• Pattern evaluation and pattern- or constraint-guided mining
• User Interaction
• Interactive mining
• Incorporation of background knowledge
• Presentation and visualization of data mining results
18
Major Issues in Data Mining (2)
• Efficiency and Scalability
• Efficiency and scalability of data mining algorithms
• Parallel, distributed, stream, and incremental mining methods
• Diversity of data types
• Handling complex types of data
• Mining dynamic, networked, and global data repositories
• Data mining and society
• Social impacts of data mining
• Privacy-preserving data mining
• Invisible data mining
19
Data Objects and Attributes Types
• Data sets are made up of data objects.
• A data object represents an entity—
• in a sales database, the objects may be customers, store items, and
sales;
• in a medical database, the objects may be patients;
• in a university database, the objects may be students, professors, and
courses.
• Data objects are typically described or represented by attributes.
• Data objects can also be referred to as samples, examples, instances, data
points, or objects.
20
Attributes
• An attribute is a data field, representing a characteristic or feature of a data
object.
• The nouns attribute, dimension, feature, and variable are often used
interchangeably in the literature.
• The term dimension is commonly used in data warehousing.
• Machine learning literature tends to use the term feature.
• Statisticians prefer the term variable.
• Data mining and database professionals commonly use the term
attribute, and we do in this course.
21
Attributes Types
• The type of an attribute is determined by the set of possible values the
attribute can have.
• These are the four types:
1) Nominal Attributes:
2) Binary Attributes
3) Ordinal Attributes
4) Numeric Attributes
• Interval-Scaled Attributes
• Ratio-Scaled Attributes
22
Attributes Types
• The type of an attribute is determined by the set of possible values the
attribute can have. These are the four types:
1) Nominal Attributes: Each value represents some kind of category,
code, or state, and so nominal attributes are also referred to as
categorical. The values do not have any meaningful order. E.g. Hair
color, marital status, occupation, ID numbers, zip codes
2) Binary Attributes: A binary attribute is a nominal attribute with only
two categories or states: 0 or 1, where 0 typically means that the
attribute is absent, and 1 means that it is present. Binary attributes
are referred to as Boolean if the two states correspond to true and
false. E.g. Medical test result or gender
23
Attributes Types
3) Ordinal Attributes: an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude
between successive values is not known. E.g.
• Size = {small, medium, large}
• Grades = {A, B, C, D, F}
• Army rankings … Etc
24
Attributes Types
4) Numeric Attributes: is quantitative; that is, it is a measurable
quantity, represented in integer or real values. Numeric attributes can
be interval-scaled or ratio-scaled.
• Interval-Scaled Attributes are measured on a scale of equal-size
units. No true zero-point. E.g. temperature in C˚or F˚, calendar dates.
• Ratio-Scaled Attributes are numeric attribute with an inherent zeropoint. E.g., area, weight, height, length, counts, monetary quantities.
Ratio between two data object’s attribute can be calculated.
25
Discrete vs. Continuous Attributes
• Discrete Attribute (Nominal, Binary and Ordinal)
• Has only a finite or countably infinite set of values E.g., zip codes,
profession, or the set of words in a collection of documents
• Sometimes, represented as integer variables
• Continuous or Numeric Attribute (Ratio and Interval)
• Has real numbers as attribute values
• E.g., temperature, height, or weight
• Practically, real values can only be measured and represented using a
finite number of digits
• Continuous attributes are typically represented as floating-point
variables
26
Basic Statistical Descriptions of Data
• Motivation
• To better understand the data
• How?
• by measuring data’s central tendency and distribution (variation and
spread).
• Measuring the Central Tendency characteristics
• Mean, Median, Mode and Midrange.
• Measuring the Data dispersion (or distribution) characteristics
• Range, max, min, quantiles, outliers, variance and standard deviation.
27
Basic Statistical Descriptions of Data
Measuring the Central Tendency characteristics
• Mean
• Mean is average value for all the data also is the center of data. It calculated by
dividing the sum of all values over the sample size.
1 n
x =  xi
n i =1
• Trimmed mean
• The mean can also be calculated on a trimmed data by removing the extreme
values.
• Weighted average or Weighted arithmetic mean
• Differ from regular mean by giving each value a weight that reflect its significance
or importance.
n
x=
w x
i =1
n
i
i
w
i
i =1
28
Basic Statistical Descriptions of Data
Measuring the Central Tendency characteristics
• Median
• After sorting the data, the median is the middle value if the size of data
is an odd number otherwise the sum of the two middle numbers
divided by 2.
• Sorting can be computational expensive. However, without sorting we
can approximate the value.
median = L1 + (
n / 2 − ( freq )l
freq median
29
) width
Basic Statistical Descriptions of Data
Measuring the Central Tendency characteristics
• Mode is a value that occurs most frequently in the data
• Sometimes we have multiple values with the same highest frequency.
(Unimodal or Multimodel e.g. Bimodal, Trimodal)
• Only one value with highest frequency = Unimodel
• Two values with highest and equally frequent values = bimodal
• Three values with highest and most frequent values = trimodal
• Unimodel mode can also be approximated using
mean − mode = 3  (mean − median)
30
Basic Statistical Descriptions of Data
Measuring the Central Tendency characteristics
• Midrange is another measure of central tendency. It is simply the average
of the min and max values of the data.
• This is easy to compute using the SQL aggregate functions, max() and
min().
• When data have a symmetric distribution all central tendency measure
return the same center value.
• But data usually do not!
31
Measuring the Central Tendency characteristics – Example
• Suppose we have the following values for salary (in thousands of dollars), shown in
increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
n
1
• Mean x =
xi

n i =1
• Trimmed mean
• In this example, remove 30, 36 and 110. Then, recalculate.
• Median
• Mode
• 52 and 70 are the modes (bimodal)
• Midrange
• (30 (min) + 110 (max) ) / 2 = 70
32
Basic Statistical Descriptions of Data
Measuring the Central Tendency characteristics
• Data in real world application tend to have asymmetric data distribution or
positively (negatively) skewed data.
symmetric
positively skewed
33
negatively skewed
Basic Statistical Descriptions of Data
Measuring the Dispersion (distribution) of Data
• Quartiles, outliers and boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and
plot outliers individually
• Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
• Variance: (algebraic, scalable computation)
1 n
1 n 2
2
 =  ( xi −  ) =  xi −  2
N i =1
N i =1
2
• Standard deviation s (or σ) is the square root of variance s2 (or σ2)
34
Dispersion (distribution) of Data
• Popular visualization plots visualize data distribution
• Boxplot: graphic display of five-number summary
• Histogram: x-axis are values, y-axis represent frequencies
• Quantile plot: each value xi is paired with fi indicating that approximately 100 fi %
of data are xi
• Quantile-quantile (q-q) plot: graphs the quantiles of one univariate distribution
against the corresponding quantiles of another
• Scatter plot: each pair of values is a pair of coordinates and plotted as points in the
plane
35
Boxplot
• Five-number summary of a distribution
• Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and
third quartiles, i.e., the height of the box
is IQR
• The median is marked by a line within the
box
• Whiskers: two lines outside the box
extended to Minimum and Maximum
• Outliers: points beyond a specified outlier
threshold, plotted individually
36
Histogram
• Histograms (or frequency histograms) are at least a century old and are widely used.
• The height of the bar indicates the frequency (i.e., count) of the values that fill within
range of the bar. The resulting graph is more commonly known as a bar chart.
37
Quantile plot
• Displays all of the data (allowing the user to assess both the overall behavior and
unusual occurrences)
• Plots quantile information: For a data xi data sorted in increasing order, fi indicates
that approximately 100 fi% of the data are below or equal to the value xi
38
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution against the corresponding
quantiles of another
• View: Is there is a shift in going from one distribution to another?
• Example shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit
prices of items sold at Branch 1 tend to be lower than those at Branch 2.
39
Scatter Plot
• Provides a first look at bivariate data to see clusters of points, outliers, etc
• Each pair of values is treated as a pair of coordinates and plotted as points in the plane
40
Scatter Plot – Correlation
Positively Correlated
Negatively Correlated
41
No Correlation
Data Visualization – Pixel-oriented
• Data visualization
• aims to communicate data clearly and effectively through graphical
representation. Data visualization has been
• used extensively in many applications—for example, at work for
reporting, managing business operations, and tracking progress of tasks.
• used to discover data relationships that are otherwise not easily
observable by looking at the raw data.
• Provide a visual proof of computer representations derived
42
Data Visualization
• Categorization of visualization methods:
• Pixel-oriented visualization techniques
• Geometric projection visualization techniques
• Icon-based visualization techniques
• Hierarchical visualization techniques
• Visualizing complex data and relations
43
Pixel-oriented visualization techniques
• Each window represent an attribute (a feature)
• Each data object is represented in a pixel on each window
• The density of pixel color is proportional to the value.
Income (a)
(b) Credit Limit
(c) transaction volume
44
(d) age
Laying Out Pixels in Circle Segments
• To save space and show the connections among multiple dimensions, space
filling is often done in a circle segment
Representing a data record (a)
in circle segment
(b) Laying out pixels in circle segment
45
Geometric projection visualization techniques
• Visualization of geometric transformations and projections of the data
• Methods
• Direct visualization
• Scatterplot and scatterplot matrices
• Landscapes
• Projection pursuit technique: Help users find meaningful projections of
multidimensional data
• Prosection views
• Hyperslice
• Parallel coordinates
46
Scatterplot matrices and Direct visualization
47
3D Scatterplot and 2D Scatterplot with Cartesian Coordinates
48
Icon-based visualization techniques
• Visualization of the data values as features of icons
• Typical visualization methods
• Chernoff Faces
• Stick Figures
• General techniques
• Shape coding: Use shape to represent certain information encoding
• Color icons: Use color icons to encode more information
• Tile bars: Use small icons to represent the relevant feature vectors in
document retrieval
49
Icon-based visualization techniques
50
Hierarchical visualization techniques
• Visualization of the data using a hierarchical partitioning into
subspaces
• Methods
• Dimensional Stacking
• Worlds-within-Worlds
• Tree-Map
• Cone Trees
• InfoCube
51
Hierarchical visualization techniques – Dimensional Stacking
52
Hierarchical visualization techniques – InfoCube
53
Visualizing complex data and relations
• Visualizing non-numerical data: text and social networks
• Tag cloud: visualizing user-generated tags
The importance of tag is
represented by font size/color
◼ Besides text data, there are
also methods to visualize
relationships, such as
visualizing social networks

Newsmap: Google News Stories in 2005
54
Required Reading
1. Data Mining: Concepts and Techniques, Chapter 1 (introduction)
2. Data Mining: Concepts and Techniques, Chapter 2 (Getting to
Know Your Data)
Recommended Reading
1. Data Mining: Practical Machine Learning Tools and Techniques Chapter 1 (What is
it all about?)
2. Data Mining: Practical Machine Learning Tools and Techniques Chapter 2 (Input:
concept, instances, attributes)
This Presentation is mainly dependent on the textbook: Data Mining: Concepts and Techniques (3rd ed.)
Thank You
56
‫الجامعة السعودية االلكترونية‬
‫الجامعة السعودية االلكترونية‬
‫‪1‬‬
‫‪26/12/2021‬‬
College of Computing and Informatics
Bachelor of Science in Information Technology
IT446
Data Mining and Data Warehousing
2
IT446
Data Mining and Data Warehousing
Chapter 3 (Data Preprocessing)
Week 3
3
Week Learning Outcomes
▪ Recognize major tasks in data preprocessing.
▪ Perform data cleaning.
▪ Perform data integration.
4
Chapter 3: Data Preprocessing
▪ Data Preprocessing Overview
▪ Data Cleaning
▪ Data Integration
▪ Preprocessing Practices on Weka
5
Data Quality: Why Preprocess the Data?
• Preprocessing improves data quality including
• Accuracy: correctness of data i.e., having incorrect attribute values could
be due to instruments used may be faulty or human error (or intentional).
• Completeness: full information is available i.e. Incomplete data can occur
for values that are not always available, such as customer information for
sales transaction data.
• Consistency: data from all sources are the same i.e. Two different users
may have very different assessments or live in different time zones.
• Timeliness: time difference and delay i.e. some store branches has delay in
syncing sales
• Believability: reflects how much the data are trusted by users
• Interpretability: reflects how easy the data are understood.
6
Major Tasks in Data Preprocessing
• Data cleaning routines work to “clean” the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving
inconsistencies.
• Data integration is the process integrating data from multiple sources
(databases, data cubes, or files).
• Data reduction obtains a reduced representation of the data set that is
much smaller in volume but produces the same (or almost the same)
mining results
• Data transformation convert the data into appropriate forms for better
mining results.
7
Data Preprocessing Overview
8
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
9
Incomplete Data (Missing Values)
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred
10
How to Handle Missing Values?
• Ignore the sample: not effective when the % of ignored sample
is too high.
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean for all data
• the attribute mean for all samples belonging to the same
class
• the most probable value: based of some statistical models.
11
Noisy Data
• Noise: random error in data
• Incorrect attribute values may be due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which require data cleaning
• duplicate records
• incomplete data
• inconsistent data
12
How to Handle Noisy Data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with
possible outliers)
13
Data Cleaning as a Process
• Data discrepancy detection
• Use metadata (e.g., domain, range, dependency, distribution)
• Check field overloading
• Check uniqueness rule, consecutive rule and null rule
• Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to
detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship to detect violators
(e.g., correlation and clustering to find outliers)
• Data migration and integration
• Data migration tools: allow transformations to be specified
• ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations
through a graphical user interface
• Integration of the two processes
• Iterative and interactive (e.g., Potter’s Wheels)
14
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id = B.cust-#
• Integrate metadata from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different sources
are different
• Possible reasons: different representations, different scales, e.g., metric
vs. British units
15
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple databases
• Object identification: The same attribute or object may have different
names in different databases
• Derivable data: One attribute may be a “derived” attribute in another
table, e.g., quarters revenue and annual revenue
• Redundant attributes may be able to be detected by correlation analysis
and covariance analysis
• Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
16
Correlation Analysis (Nominal Data)
• Correlation analysis is used on attributes or features to determine their
dependencies on among each other.
• Χ2 (chi-square) test is to find correlation (dependency) among two features.
2
(
Observed

Expected
)
2 = 
Expected
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population
17
Chi-Square Calculation: An Example
Like science fiction
Play chess
250(90)
Not play chess
200(360)
Sum (row)
450
Not like science fiction
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts
calculated based on the data distribution in the two categories)
2
2
2
2
(
250

90
)
(
50

210
)
(
200

360
)
(
1000

840
)
2 =
+
+
+
= 507 .93
90
210
360
840
• It shows that like_science_fiction and play_chess are correlated in the group
18
Visually Evaluating Correlation
Scatter plots showing the
similarity from –1 to 1.
19
Covariance (Numeric Data)
• Covariance is similar to correlation
Correlation coefficient:
• where n is the number of tuples, A and B are the respective mean or expected
values of A and B, σA and σB are the respective standard deviation of A and B.
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its expected value. • Independence: CovA,B = 0 but the converse is not true: • Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions (e.g., the data follow multivariate normal distributions) does a covariance of 0 imply independence 20 Co-Variance: An Example • It can be simplified in computation as • Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14). • Question: If the stocks are affected by the same industry trends, will their prices rise or fall together? • E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4 • E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6 • Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4 • Thus, A and B rise together since Cov(A, B) > 0.
21
Weka Tool
• WEKA is an open source tool that provides
• many machine learning algorithms implementations
• variety of datasets
• a variety of tools for transforming datasets
• Ability to preprocess a dataset and train a model
• Ability to analyze the resulting model and its performance
• all of the above without writing any program code.
• WEKA is available from http://www.cs.waikato.ac.nz/ml/weka
• For more information
https://www.cs.waikato.ac.nz/ml/weka/Witten_et_al_2016_appendix.pd
f
22
Weka Tool
23
Weka Tool
24
Preprocessing Practices on Weka
• Practical in class exercise:
• Pick a dataset and load it on Weka
• Look at the data graphs
• Apply some preprocessing techniques
• Look how the data has changed
• Choose another preprocessing techniques
• Look at the data graphs again.
25
Required Reading
1. Data Mining: Concepts and Techniques, Chapter 3: Data
Preprocessing 3.1, 3.2 and 3.3
Recommended Reading
1. Data Mining, Fourth Edition: Practical Machine Learning Tools and Techniques,
Chapter 2- Input: concepts, instances, attributes
26
This Presentation is mainly dependent on the textbook: Data Mining: Concepts and Techniques (3rd ed.)
Thank You
27
‫الجامعة السعودية االلكترونية‬
‫الجامعة السعودية االلكترونية‬
‫‪26/12/2021‬‬
‫‪1‬‬
College of Computing and Informatics
Bachelor of Science in Information Technology
IT446
Data Mining and Data Warehousing
2
IT446
Data Mining and Data Warehousing
Chapter 3 (Data Preprocessing II)
Week 4
3
Week Learning Outcomes
▪ Describe various data reduction and transformations
techniques.
▪ Explain the rationale for data reduction and transformation
▪ Employ data reduction and transformation techniques
▪ Employ several types of similarity measures
4
Chapter 3: Data Preprocessing II
▪ Data Reduction
▪ Data Transformation
▪ Data Discretization
▪ Measuring Data Similarity and Dissimilarity (Chapter 2.4)
▪ Preprocessing Practices II on Weka
5
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical results
• Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
• Data compression (lossy/lossless)
6
Data Reduction 1: Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier analysis,
becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
• Dimensionality reduction techniques
• Wavelet transforms
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection)
7
What Is Wavelet Transform?
• Decomposes a signal into different
frequency subbands
• Applicable to n-dimensional signals
• Data are transformed to preserve relative
distance between objects at different
levels of resolution
• Allow natural clusters to become more
distinguishable
• Used for image compression
8
Wavelet Transformation
• Discrete wavelet transform (DWT) for linear signal processing, multi-resolution
analysis
• Compressed approximation: store only a small fraction of the strongest of the
wavelet coefficients
• Similar to discrete Fourier transform (DFT), but better lossy compression, localized in
space
• Method:
• Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
• Each transform has 2 functions: smoothing, difference
• Applies to pairs of data, resulting in two set of data of length L/2
• Applies two functions recursively, until reaches the desired length
Haar2
Daubechie4
9
Wavelet Decomposition
• Wavelets: A math tool for space-efficient hierarchical decomposition of
functions
• S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ = [23/4, -11/4, 1/2, 0, 0, -1, 1, 0]
• Compression: many small detail coefficients can be replaced by 0’s, and only
the significant coefficients are retained
10
Haar Wavelet Coefficients
Coefficient “Supports”
Hierarchical
2.75
decomposition
structure (a.k.a. +
“error tree”) + -1.25
+
+
2
0.5
0


+
-1
-1
0.5
0

+
+
0
0
– +
– +

0
2
2
5
4
-1
-1
3

+
-1.25
– +
0
+
2.75
4
Original frequency distribution
11
0
+
+
+



+
Why Wavelet Transform?
• Use hat-shape filters
• Emphasize region where points cluster
• Suppress weaker information in their boundaries
• Effective removal of outliers
• Insensitive to noise, insensitive to input order
• Multi-resolution
• Detect arbitrary shaped clusters at different scales
• Efficient
• Complexity O(N)
• Only applicable to low dimensional data
12
Principal Component Analysis (PCA)
• Find a projection that captures the
largest amount of variation in data
• The original data are projected onto a
much smaller space, resulting in
x2
dimensionality reduction.
• We find the eigenvectors of the
covariance matrix, and these
eigenvectors define the new space
e
x1
13
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
• Duplicate much or all of the information contained in one or more
other attributes
• E.g., purchase price of a product and the amount of sales tax paid
• Irrelevant attributes
• Contain no information that is useful for the data mining task at hand
• E.g., students’ ID is often irrelevant to the task of predicting students’
GPA
14
Heuristic Search in Attribute Selection
• There are 2^d possible attribute combinations of d attributes
• Typical heuristic attribute selection methods:
• Best single attribute under the attribute independence assumption:
choose by significance tests
• Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, …
• Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
• Best combined attribute selection and elimination
• Optimal branch and bound:
• Use attribute elimination and backtracking
15
Attribute Creation (Feature Generation)
• Create new attributes (features) that can capture the important information in a
data set more effectively than the original ones
• Three general methodologies
• Attribute extraction
• Domain-specific
• Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet transformation, manifold
approaches (not covered)
• Attribute construction
• Combining features (see: discriminative frequent patterns in Chapter 7)
• Data discretization
16
Data Reduction 2: Numerosity Reduction
• Reduce data volume by choosing alternative, smaller forms of data
representation
• Parametric methods (e.g., regression)
• Assume the data fits some model, estimate model parameters, store
only the parameters, and discard the data (except possible outliers)
• Ex.: Log-linear models—obtain value at a point in m-D space as the
product on appropriate marginal subspaces
• Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling, …
17
Parametric Data Reduction: Regression and Log-Linear Models
• Linear regression
• Data modeled to fit a straight line
• Often uses the least-square method to fit the line
• Multiple regression
• Allows a response variable Y to be modeled as a linear function of
multidimensional feature vector
• Log-linear model
• Approximates discrete multidimensional probability distributions
18
Regression Analysis
• Regression analysis: A collective name for
techniques for the modeling and analysis
of numerical data consisting of values of a
dependent variable (also called response
variable or measurement) and of one or
more independent variables (aka.
explanatory variables or predictors)
• The parameters are estimated so as to give
a “best fit” of the data
• Most commonly the best fit is evaluated by
using the least squares method, but other
criteria have also been used
19
Y1
Y1’
y=x+1
X1
x
• Used for prediction (including
forecasting of time-series data),
inference, hypothesis testing,
and modeling of causal
relationships
Regress Analysis and Log-Linear Models
• Linear regression: Y = w X + b
• Two regression coefficients, w and b, specify the line and are to be estimated
by using the data at hand
• Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2
• Many nonlinear functions can be transformed into the above
• Log-linear models:
• Approximate discrete multidimensional probability distributions
• Estimate the probability of each point (tuple) in a multi-dimensional space for a
set of discretized attributes, based on a smaller subset of dimensional
combinations
• Useful for dimensionality reduction and data smoothing
20
Histogram Analysis
40
• Divide data into buckets and store
average (sum) for each bucket
• Partitioning rules:
• Equal-width: equal bucket range
• Equal-frequency (or equaldepth)
35
30
25
20
15
10
21
100000
90000
80000
70000
60000
50000
40000
30000
20000
0
10000
5
Clustering for Data Reduction
• Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
• Can be very effective if data is clustered but not if data is
“smeared”
• Can have hierarchical clustering and be stored in multidimensional index tree structures
• There are many choices of clustering definitions and clustering
algorithms
• Cluster analysis will be studied in depth in Chapter 10
22
Sampling
• Sampling: obtaining a small sample s to represent the whole
data set N
• Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
• Key principle: Choose a representative subset of the data
• Simple random sampling may have very poor performance in
the presence of skew
• Develop adaptive sampling methods, e.g., stratified
sampling:
• Note: Sampling may not reduce database I/Os (page at a time)
23
Types of Sampling
• Simple random sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• Once an object is selected, it is removed from the population
• Sampling with replacement
• A selected object is not removed from the population
• Stratified sampling:
• Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the data)
• Used in conjunction with skewed data
24
Sampling: With or without Replacement
Raw Data
25
Sampling: Cluster or Stratified Sampling
Raw Data
Cluster/Stratified Sample
26
Data Transformation
• A function that maps the entire set of values of a given attribute to a new set
of replacement values s.t. each old value can be identified with one of the new
values
• Methods
• Smoothing: Remove noise from data
• Attribute/feature construction
• New attributes constructed from the given ones
• Aggregation: Summarization, data cube construction
• Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Discretization: Concept hierarchy climbing
27
Normalization
• Min-max normalization: to [new_minA, new_maxA]
v − minA
v’ =
(new _ maxA − new _ minA) + new _ minA
maxA − minA
• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is
73,600 − 12,000
mapped to
(1.0 − 0) + 0 = 0.716
98,000 − 12,000
• Z-score normalization (μ: mean, σ: standard deviation):
v’ =
v − A

A
• Ex. Let μ = 54,000, σ = 16,000. Then
• Normalization by decimal scaling
v
v’ = j
10
73,600 − 54,000
= 1.225
16,000
Where j is the smallest integer such that Max(|ν’|) < 1 28 Discretization • Three types of attributes • Nominal—values from an unordered set, e.g., color, profession • Ordinal—values from an ordered set, e.g., military or academic rank • Numeric—real numbers, e.g., integer or real numbers • Discretization: Divide the range of a continuous attribute into intervals • Interval labels can then be used to replace actual data values • Reduce data size by discretization • Supervised vs. unsupervised • Split (top-down) vs. merge (bottom-up) • Discretization can be performed recursively on an attribute • Prepare for further analysis, e.g., classification 29 Data Discretization Methods • Typical methods: All the methods can be applied recursively • Binning • Top-down split, unsupervised • Histogram analysis • Top-down split, unsupervised • Clustering analysis (unsupervised, top-down split or bottom-up merge) • Decision-tree analysis (supervised, top-down split) • Correlation (e.g., X^2) analysis (unsupervised, bottom-up merge) 30 Simple Discretization: Binning • Equal-width (distance) partitioning • Divides the range into N intervals of equal size: uniform grid • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. • The most straightforward, but outliers may dominate presentation • Skewed data is not handled well • Equal-depth (frequency) partitioning • Divides the range into N intervals, each containing approximately same number of samples • Good data scaling • Managing categorical attributes can be tricky 31 Simple Discretization: Binning Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 32 Simple Discretization: Binning Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 33 Discretization Without Using Class Labels (Binning vs. Clustering) Equal frequency (binning) K-means clustering leads to better results 34 Discretization by Classification & Correlation Analysis • Classification (e.g., decision tree analysis) • Supervised: Given class labels, e.g., cancerous vs. benign • Using entropy to determine split point (discretization point) • Top-down, recursive split • Details to be covered in Chapter 7 • Correlation analysis (e.g., Chi-merge: χ2-based discretization) • Supervised: use class information • Bottom-up merge: find the best neighboring intervals (those having similar distributions of classes, i.e., low χ2 values) to merge • Merge performed recursively, until a predefined stopping condition 35 Concept Hierarchy Generation for Nominal Data • Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts • street < city < state < country • Specification of a hierarchy for a set of values by explicit data grouping • {Urbana, Champaign, Chicago} < Illinois • Specification of only a partial set of attributes • E.g., only street < city, not others • Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values • E.g., for a set of attributes: {street, city, state, country} 36 Automatic Concept Hierarchy Generation • Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set • The attribute with the most distinct values is placed at the lowest level of the hierarchy • Exceptions, e.g., weekday, month, quarter, year country 15 distinct values province_or_ state 365 distinct values city 3567 distinct values street 674,339 distinct values 37 Measuring Data Similarity and Dissimilarity • Similarity • Numerical measure of how alike two data objects are • Value is higher when objects are more alike • Often falls in the range [0,1] • Dissimilarity (e.g., distance) • Numerical measure of how different two data objects are • Lower when objects are more alike • Minimum dissimilarity is often 0 • Upper limit varies • Proximity refers to a similarity or dissimilarity 38 Data Matrix and Dissimilarity Matrix • Data matrix • n data points with p dimensions • Two modes • Dissimilarity matrix • n data points, but registers only the distance • A triangular matrix • Single mode 39  x11   ... x  i1  ... x  n1 ... x1f ... ... ... xif ... ... ... xnf  0  d(2,1) 0   d(3,1) d ( 3,2)  :  : d ( n,1) d ( n,2) ... x1p   ... ...  ... xip   ... ...  ... xnp   0 : ...       ... 0 Proximity Measure for Nominal Attributes • Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary attribute) • Method 1: Simple matching • m: # of matches, p: total # of variables m d (i, j) = p − p • Method 2: Use a large number of binary attributes • creating a new binary attribute for each of the M nominal states 40 Proximity Measure for Binary Attributes • A contingency table for binary data Object i • Distance measure for symmetric binary variables: • Distance measure for asymmetric binary variables: • Jaccard coefficient (similarity measure for asymmetric binary variables): • Note: Jaccard coefficient is the same as “coherence”: 41 Object j Dissimilarity between Binary Variables • Example Name Jack Mary Jim Gender Fever Cough Test-1 M Y N P F Y N P M Y P N Test-2 N N N • Gender is a symmetric attribute • The remaining attributes are asymmetric binary • Let the values Y and P be 1, and the value N 0 0+1 = 0.33 2+ 0+1 1+1 d ( jack , jim ) = = 0.67 1+1+1 1+ 2 d ( jim , m ary) = = 0.7542 1+1+ 2 d ( jack , m ary) = Test-3 N P N Test-4 N N N Standardizing Numeric Data − • Z-score: z = x • X: raw score to be standardized, μ: mean of the population, σ: standard deviation • the distance between the raw score and the population mean in units of the standard deviation • negative when the raw score is below the mean, “+” when above • An alternative way: Calculate the mean absolute deviation sf = 1 n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |) m f = 1n (x1 f + x2 f + ... + xnf ) xif − m f where zif = sf • standardized measure (z-score): • Using mean absolute deviation is more robust than using standard deviation . 43 Example: Data Matrix and Dissimilarity Matrix Data Matrix point x1 x2 x3 x4 attribute1 attribute2 1 2 3 5 2 0 4 5 Dissimilarity Matrix (with Euclidean Distance) x1 x1 x2 x3 x4 44 x2 0 3.61 5.1 4.24 x3 0 5.1 1 x4 0 5.39 0 Distance on Numeric Data: Minkowski Distance • Minkowski distance: A popular distance measure where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and h is the order (the distance so defined is also called L-h norm) • Properties • d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
• d(i, j) = d(j, i) (Symmetry)
• d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric
45
Special Cases of Minkowski Distance
• h = 1: Manhattan (city block, L1 norm) distance
• E.g., the Hamming distance: the number of bits that are different between two binary vectors
d (i, j) =| x − x | + | x − x | +…+ | x − x |
i1
j1
i2
j2
ip
jp
• h = 2: (L2 norm) Euclidean distance
d (i, j) =
(| x − x |2 + | x − x |2 +…+ | x − x |2 )
i1
j1
i2
j2
ip
jp
• h → . “supremum” (Lmax norm, L norm) distance.
• This is the maximum difference between any component (attribute) of the vectors
46
Example: Data Matrix and Dissimilarity Matrix
point
x1
x2
x3
x4
Manhattan (L1)
L
x1
x2
x3
x4
x1
0
5
3
6
x2
x3
x4
0
6
1
0
7
0
x2
x3
x4
Euclidean (L2)
L2
x1
x2
x3
x4
x1
0
3.61
2.24
4.24
0
5.1
1
0
5.39
0
Supremum
L
x1
x2
x3
x4
x1
x2
0
3
2
3
x3
0
5
1
x4
0
5
0
47
attribute 1 attribute 2
1
2
3
5
2
0
4
5
Ordinal Variables
• An ordinal variable can be discrete or continuous
• Order is important, e.g., rank
• Can be treated like interval-scaled
• replace xif by their rank
rif {1,…, M f }
• map the range of each variable onto [0, 1] by replacing i-th object in the f-th
variable by
rif −1
zif =
M f −1
• compute the dissimilarity using methods for interval-scaled variables
48
Attributes of Mixed Type
• A database may contain all attribute types
• Nominal, symmetric binary, asymmetric binary, numeric, ordinal
• One may use a weighted formula to combine their effects
• f is binary or nominal:
 pf = 1 ij( f ) dij( f )
d (i, j) =
 pf = 1 ij( f )
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
• f is numeric: use the normalized distance
• f is ordinal
r
−1
zif =
M −1
if
• Compute ranks rif and
• Treat zif as interval-scaled
f
49
Cosine Similarity
• A document can be represented by thousands of attributes, each recording the frequency of a
particular word (such as keywords) or phrase in the document.
• Other vector objects: gene features in micro-arrays, …
• Applications: information retrieval, biologic taxonomy, gene feature mapping, …
• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d||: the length of vector d
50
Example: Cosine Similarity
• cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d|: the length of vector d
• Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
cos(d1, d2 ) = 0.94
51
= 4.12
Required Reading
1. Data Mining: Concepts and Techniques, Chapter 3: Data
Preprocessing 3.4-end
2. Data Mining: Concepts and Techniques, Chapter 2 (Getting to
Know Your Data) 2.4
Recommended Reading
1. Data Mining, Fourth Edition: Practical Machine Learning Tools and Techniques,
Chapter 8 – Data Transformation
52
This Presentation is mainly dependent on the textbook: Data Mining: Concepts and Techniques (3rd ed.)
Thank You
53
‫الجامعة السعودية االلكترونية‬
‫الجامعة السعودية االلكترونية‬
‫‪1‬‬
‫‪26/12/2021‬‬
College of Computing and Informatics
Bachelor of Science in Information Technology
IT446
Data Mining and Data Warehousing
2
IT446
Data Mining and Data Warehousing
Chapter 4 (Data Warehousing and On-line Analytical
Processing)
Week 5
3
Week Learning Outcomes
▪ Recognize data warehouse definition and how it differs from other
database systems
▪ Demonstrate data warehouse modeling: data cube and OLAP.
▪ Describe data warehouse design and usage.
▪ Perform data warehouse implementation.
▪ Apply data generalization by attribute-oriented induction.
4
Chapter 4: Data Warehousing and On-line Analytical Processing
▪ Basic Concepts of Data Warehouse
▪ Data Warehouse Modeling: Data Cube and OLAP
▪ Data Warehouse Design and Usage
▪ Data Warehouse Implementation
▪ Data Generalization by Attribute-Oriented Induction
5
What is a Data Warehouse?
• Defined in many different ways, but not rigorously.
• A decision support database that is maintained separately from the organization’s
operational database
• Support information processing by providing a solid platform of consolidated,
historical data for analysis.
• “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision-making process.”—W. H. Inmon
• Data warehousing:
• The process of constructing and using data warehouses
6
Data Warehouse—Subject-Oriented
• Organized around major subjects, such as customer, product,
sales
• Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing
• Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process
7
Data Warehouse—Integrated
• Constructed by integrating multiple, heterogeneous data sources
• relational databases, flat files, on-line transaction records
• Data cleaning and data integration techniques are applied.
• Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
• When data is moved to the warehouse, it is converted.
8
Data Warehouse—Time Variant
• The time horizon for the data warehouse is significantly longer than that of
operational systems
• Operational database: current value data
• Data warehouse data: provide information from a historical perspective
(e.g., past 5-10 years)
• Every key structure in the data warehouse
• Contains an element of time, explicitly or implicitly
• But the key of operational data may or may not contain “time element”
9
Data Warehouse—Nonvolatile
• A physically separate store of data transformed from the operational
environment
• Operational update of data does not occur in the data warehouse
environment
• Does not require transaction processing, recovery, and concurrency
control mechanisms
• Requires only two operations in data accessing:
• initial loading of data and access of data
10
OLTP and OLAP
• The two terms look similar but refer to different kinds of
systems.
• Online transaction processing (OLTP) captures, stores, and
processes data from transactions in real time.
• Online analytical processing (OLAP) uses complex queries to
analyze aggregated historical data from OLTP systems.
11
OLTP vs. OLAP
12
Why a Separate Data Warehouse?
• High performance for both systems
• DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery
• Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation
• Different functions and different data:
• missing data: Decision support requires historical data which operational DBs do not typically
maintain
• data consolidation: DS requires consolidation (aggregation, summarization) of data from
heterogeneous sources
• data quality: different sources typically use inconsistent data representations, codes and formats
which have to be reconciled
• Note: There are more and more systems which perform OLAP analysis directly on relational databases
13
Data Warehouse: A Multi-Tiered Architecture
Other
sources
Operational
DBs
Metadata
Extract
Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
OLAP Engine
14
Front-End Tools
Three Data Warehouse Models
• Enterprise warehouse
• collects all of the information about subjects spanning the entire
organization
• Data Mart
• a subset of corporate-wide data that is of value to a specific groups of
users. Its scope is confined to specific, selected groups, such as
marketing data mart
• Independent vs. dependent (directly from warehouse) data mart
• Virtual warehouse
• A set of views over operational databases
• Only some of the possible summary views may be materialized
15
Extraction, Transformation, and Loading (ETL)
• Data extraction
• get data from multiple, heterogeneous, and external sources
• Data cleaning
• detect errors in the data and rectify them when possible
• Data transformation
• convert data from legacy or host format to warehouse format
• Load
• sort, summarize, consolidate, compute views, check integrity, and build indicies
and partitions
• Refresh
• propagate the updates from the data sources to the warehouse
16
Metadata Repository
• Meta data is the data defining warehouse objects. It stores:
• Description of the structure of the data warehouse
• schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents
• Operational meta-data
• data lineage (history of migrated data and transformation path), currency of data (active, archived, or
purged), monitoring information (warehouse usage statistics, error reports, audit trails)
• The algorithms used for summarization
• The mapping from operational environment to the data warehouse
• Data related to system performance
• warehouse schema, view and derived data definitions
• Business data
• business terms and definitions, ownership of data, charging policies
17
From Tables and Spreadsheets to Data Cubes
• A data warehouse is based on a multidimensional data model which views data in the form of a data
cube
• A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
• Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter,
year)
• Fact table contains measures (such as dollars_sold) and keys to each of the related dimension
tables
• In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid,
which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms
a data cube.
18
Cube: A Lattice of Cuboids
all
time
item
0-D (apex) cuboid
location
supplier
1-D cuboids
time,location
item,location
location,supplier
2-D cuboids
time,supplier
item,supplier
time,location,supplier
3-D cuboids
time,item,supplier
item,location,supplier
4-D (base) cuboid
19
Conceptual Modeling of Data Warehouses
• Modeling data warehouses: dimensions & measures
• Star schema: A fact table in the middle connected to a set of
dimension tables
• Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
• Fact constellations: Multiple fact tables share dimension tables,
viewed as a collection of stars, therefore called galaxy schema or fact
constellation
20
Example of Star Schema
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
21
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
state_or_province
country
Example of Snowflake Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
22
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
state_or_province
country
Example of Fact Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
item_key
item_name
brand
type
supplier_type
Sales Fact Table
time_key
item_key
location_key
branch_key
branch_name
branch_type
time_key
item_key
shipper_key
from_location
branch_key
branch
Shipping Fact Table
units_sold
dollars_sold
avg_sales
Measures
23
location
to_location
location_key
street
city
province_or_state
country
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type
A Concept Hierarchy: Dimension (location)
all
all
Europe
region
country
city
office
Germany
Frankfurt



Spain
North_America
Canada
Vancouver …
L. Chan
24


Mexico
Toronto
M. Wind
Data Cube Measures: Three Categories
• Distributive: if the result derived by applying the function to n aggregate values is the
same as that derived by applying the function on all the data without partitioning
• E.g., count(), sum(), min(), max()
• Algebraic: if it can be computed by an algebraic function with M arguments (where M
is a bounded integer), each of which is obtained by applying a distributive aggregate
function
• E.g., avg(), min_N(), standard_deviation()
• Holistic: if there is no constant bound on the storage size needed to describe a
subaggregate.
• E.g., median(), mode(), rank()
25
Multidimensional Data
• Sales volume as a function of product, month, and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Product
Category Country Quarter
Product
City
Office
Month
26
Month Week
Day
A Sample Data Cube
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
sum
27
Country
TV
PC
VCR
sum
1Qtr
Date
Total annual sales
of TVs in U.S.A.
Cuboids Corresponding to the Cube
all
0-D (apex) cuboid
product
product,date
date
country
product,country
1-D cuboids
date, country
2-D cuboids
3-D (base) cuboid
product, date, country
28
Typical OLAP Operations
• Roll up (drill-up): summarize data
• by climbing up hierarchy or by dimension reduction
• Drill down (roll down): reverse of roll-up
• from higher level summary to lower level summary or detailed data, or
introducing new dimensions
• Slice and dice: project and select
• Pivot (rotate):
• reorient the cube, visualization, 3D to series of 2D planes
• Other operations
• drill across: involving (across) more than one fact table
• drill through: through the bottom level of the cube to its back-end relational
tables (using SQL)
29
OLAP Operations Examples
30
OLAP Operations Examples
31
OLAP Operations Examples
32
OLAP Operations Examples
33
A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time
Product
ANNUALY QTRLY
DAILY
PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
Location
Each circle is
called a footprint
DIVISION
Promotion
34
Organization
Browsing a Data Cube
• Visualization
• OLAP capabilities
• Interactive manipulation
35
Design of Data Warehouse: A Business Analysis Framework
• Four views regarding the design of a data warehouse
• Top-down view
• allows selection of the relevant information necessary for the data warehouse
• Data source view
• exposes the information being captured, stored, and managed by operational
systems
• Data warehouse view
• consists of fact tables and dimension tables
• Business query view
• sees the perspectives of data in the warehouse from the view of end-user
36
Data Warehouse Design Process
• Top-down, bottom-up approaches or a combination of both
• Top-down: Starts with overall design and planning (mature)
• Bottom-up: Starts with experiments and prototypes (rapid)
• From software engineering point of view
• Waterfall: structured and systematic analysis at each step before
proceeding to the next
• Spiral: rapid generation of increasingly functional systems, short turn
around time, quick turn around
• Typical data warehouse design process
• Choose a business process to model, e.g., orders, invoices, etc.
• Choose the grain (atomic level of data) of the business process
• Choose the dimensions that will apply to each fact table record
• Choose the measure that will populate each fact table record
37
Data Warehouse Development: A Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Data
Mart
Enterprise
Data
Warehouse
Data
Mart
Model refinement
Model refinement
Define a high-level corporate data model
38
Data Warehouse Usage
• Three kinds of data warehouse applications
• Information processing
• supports querying, basic statistical analysis, and reporting using crosstabs,
tables, charts and graphs
• Analytical processing
• multidimensional analysis of data warehouse data
• supports basic OLAP operations, slice-dice, drilling, pivoting
• Data mining
• knowledge discovery from hidden patterns
• supports associations, constructing analytical models, performing classification
and prediction, and presenting the mining results using visualization tools
39
From On-Line Analytical Processing (OLAP) to On Line Analytical
Mining (OLAM)
• Why online analytical mining?
• High quality of data in data warehouses
• DW contains integrated, consistent, cleaned data
• Available information processing structure surrounding data warehouses
• ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP
tools
• OLAP-based exploratory data analysis
• Mining with drilling, dicing, pivoting, etc.
• On-line selection of data mining functions
• Integration and swapping of multiple mining functions, algorithms,
and tasks
40
Efficient Data Cube Computation
• Data cube can be viewed as a lattice of cuboids
• The bottom-most cuboid is the base cuboid
• The top-most cuboid (apex) contains only one cell
• How many cuboids in an n-dimensional cube with L levels?
• Materialization of data cube
n
T =  ( Li +1)
i =1
• Materialize every (cuboid) (full materialization), none (no materialization), or some
(partial materialization)
• Selection of which cuboids to materialize
• Based on size, sharing, access frequency, etc.
41
The “Compute Cube” Operator
• Cube definition and computation in DMQL
define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
• Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96)
()
SELECT item, city, year, SUM (amount)
FROM SALES
(city)
(item)
(year)
(city, item)
(city, year)
(item, year)
CUBE BY item, city, year
• Need compute the following Group-Bys
(date, product, customer),
(date,product),(date, customer), (product, customer),
(date), (product), (customer)
(city, item, year)
()
42
Indexing OLAP Data: Bitmap Index
• Index on a particular column
• Each value in the column has a bit vector: bit-op is fast
• The length of the bit vector: # of records in the base table
• The i-th bit is set if the i-th row of the base table has the value for the indexed column
• not suitable for high cardinality domains

A recent bit compression technique, Word-Aligned Hybrid (WAH), makes it work for high
cardinality domain as well [Wu, et al. TODS’06]
Base table
Index on Region
Index on Type
Cust Region Type RecIDAsia Europe America RecID Retail Dealer
C1 Asia
Retail
1
1
0
1
1
0
0
C2 Europe Dealer 2
2
0
1
0
1
0
C3 Asia
Dealer 3
1
0
0
3
0
1
4
0
0
1
4
1
0
C4 America Retail
0
1
0
5
0
1
C5 Europe Dealer 5
43
Indexing OLAP Data: Join Indices
• Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …)
• Traditional indices map the values to a list of record ids
• It materializes relational join in JI file and speeds up
relational join
• In data warehouses, join index relates the values of the
dimensions of a start schema to rows in the fact table.
• E.g. fact table: Sales and two dimensions city and product
• A join index on city maintains for each distinct city a
list of R-IDs of the tuples recording the Sales in the city
• Join indices can span multiple dimensions
44
Efficient Processing OLAP Queries
• Determine which operations should be performed on the available cuboids
• Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projection
• Determine which materialized cuboid(s) should be selected for OLAP op.
• Let the query to be processed be on {brand, province_or_state} with the condition “year = 2004”, and there are
4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?
• Explore indexing structures and compressed vs. dense array structs in MOLAP
45
OLAP Server Architectures
• Relational OLAP (ROLAP)
• Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware
• Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional
tools and services
• Greater scalability
• Multidimensional OLAP (MOLAP)
• Sparse array-based multidimensional storage engine
• Fast indexing to pre-computed summarized data
• Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
• Flexibility, e.g., low level: relational, high-level: array
• Specialized SQL servers (e.g., Redbricks)
• Specialized support for SQL queries over star/snowflake schemas
46
Attribute-Oriented Induction
• Proposed in 1989 (KDD ‘89 workshop)
• Not confined to categorical data nor particular measures
• How it is done?
• Collect the task-relevant data (initial relation) using a relational database
query
• Perform generalization by attribute removal or attribute generalization
• Apply aggregation by merging identical, generalized tuples and
accumulating their respective counts
• Interaction with users for knowledge presentation
47
Attribute-Oriented Induction: An Example
Example: Describe general characteristics of graduate students in the University
database
• Step 1. Fetch relevant set of data using an SQL statement, e.g.,
Select * (i.e., name, gender, major, birth_place, birth_date,
residence, phone#, gpa)
from student
where student_status in {“Msc”, “MBA”, “PhD” }
• Step 2. Perform attribute-oriented induction
• Step 3. Present results in generalized relation, cross-tab, or rule forms
48
Class Characterization: An Example
Name
Jim
Initial
Relation Woodman
Gender
Major
M
CS
Scott
Lachance
Laura Lee

M
F

Removed
Retained
Gender Major
Prime
Generalized
Relation
M
F

Birth-Place
Birth_date
Residence
Phone #
GPA
Vancouver,BC, 8-12-76
Canada
CS
Montreal, Que, 28-7-75
Canada
Physics Seattle, WA, USA 25-8-70



3511 Main St.,
Richmond
345 1st Ave.,
Richmond
687-4598
3.67
253-9106
3.70
125 Austin Ave.,
Burnaby

420-5232

3.83

Sci,Eng,
Bus
City
Removed
Excl,
VG,..
Country
Age range
Birth_region
Age_range
Residence
GPA
Canada
Foreign

20-25
25-30

Richmond
Burnaby

Very-good
Excellent

Canada
Foreign
Total
M
16
14
30
F
10
22
32
Total
26
36
62
Science
Science

Birth_Region
Gender
49
Count
16
22

Basic Principles of Attribute-Oriented Induction
• Data focusing: task-relevant data, including dimensions, and the result is the initial
relation
• Attribute-removal: remove attribute A if there is a large set of distinct values for A
but (1) there is no generalization operator on A, or (2) A’s higher level concepts are
expressed in terms of other attributes
• Attribute-generalization: If there is a large set of distinct values for A, and there
exists a set of generalization operators on A, then select an operator and generalize
A
• Attribute-threshold control: typical 2-8, specified/default
• Generalized relation threshold control: control the final relation/rule size
50
Attribute-Oriented Induction: Basic Algorithm
• InitialRel: Query processing of task-relevant data, deriving the initial relation.
• PreGen: Based on the analysis of the number of distinct values in each attribute,
determine generalization plan for each attribute: removal? or how high to
generalize?
• PrimeGen: Based on the PreGen plan, perform generalization to the right level to
derive a “prime generalized relation”, accumulating the counts.
• Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping
into rules, cross tabs, visualization presentations.
51
Presentation of Generalized Results
• Generalized relation:
• Relations where some or all attributes are generalized, with counts or other aggregation values
accumulated.
• Cross tabulation:
• Mapping results into cross tabulation form (similar to contingency tables).
• Visualization techniques:
• Pie charts, bar charts, curves, cubes, and other visual forms.
• Quantitative characteristic rules:
• Mapping generalized result into characteristic rules with quantitative information associated
with it, e.g.,
grad ( x)  male( x) 
birth _ region( x) =”Canada”[t :53%] birth _ region( x) =” foreign”[t : 47%].
52
Mining Class Comparisons

Comparison: Comparing two or more classes

Method:

Partition the set of relevant data into the target class and the contrasting class(es)

Generalize both classes to the same high level concepts

Compare tuples with the same high level descriptions

Present for every tuple its description and two measures



support – distribution within single class

comparison – distribution between classes
Highlight the tuples with strong discriminant features
Relevance Analysis:

Find attributes (features) which best distinguish different classes
53
Concept Description vs. Cube-Based OLAP
• Similarity:
Data generalization
• Presentation of data summarization at multiple levels of abstraction
• Interactive drilling, pivoting, slicing and dicing
• Differences:
• OLAP has systematic preprocessing, query independent, and can drill down to
rather low level
• AOI has automated desired level allocation, and may perform dimension
relevance analysis/ranking when there are many relevant dimensions
• AOI works on the data which are not in relational forms

54
Required Reading
1. Data Mining: Concepts and Techniques, chapter 4: Data
Warehousing and Online Analytical Processing
Recommended Reading
1. W. H. Inmon. 2002. Building the Data Warehouse, 3rd Edition (3rd. ed.). John Wiley
& Sons, Inc., USA.
2. Data Warehouse Tutorial for Beginners: Learn in 7 Days
https://www.guru99.com/data-warehousing-tutorial.html
55
This Presentation is mainly dependent on the textbook: Data Mining: Concepts and Techniques (3rd ed.)
Thank You
56
‫الجامعة السعودية االلكترونية‬
‫الجامعة السعودية االلكترونية‬
‫‪26/12/2021‬‬
‫‪1‬‬
College of Computing and Informatics
Bachelor of Science in Information Technology
IT446
Data Mining and Data Warehousing
2
IT446
Data Mining and Data Warehousing
Chapter 6 (Mining Frequent Patterns, Associations, and
Correlations)
Week 6
3
Week Learning Outcomes
▪ Describe types of frequent itemsets.
▪ Explain the process of generating association rules.
▪ Employ Frequent itemsets algorithms.
4
Chapter 6: Mining Frequent Patterns, Associations, and Correlations
▪ Basic Concepts of Mining Frequent Patterns
▪ Frequent Itemset Mining Methods
▪ Which Patterns Are Interesting?—Pattern Evaluation
Methods
5
What Is Frequent Pattern Analysis?
• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data
set

First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association
rule mining
• Motivation: Finding inherent regularities in data
• What products were often purchased together?— Beer and diapers?!
• What are the subsequent purchases after buying a PC?
• What kinds of DNA are sensitive to this new drug?
• Can we automatically classify web documents?

Applications

Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream)
analysis, and DNA sequence analysis.
6
Why Is Freq. Pattern Mining Important?
• Freq. pattern: An intrinsic and important property of datasets
• Foundation for many essential data mining tasks
• Association, correlation, and causality analysis
• Sequential, structural (e.g., sub-graph) patterns
• Pattern analysis in spatiotemporal, multimedia, time-series, and
stream data
• Classification: discriminative, frequent pattern analysis
• Cluster analysis: frequent pattern-based clustering
• Data warehousing: iceberg cube and cube-gradient
• Semantic data compression: fascicles
• Broad applications
7
Basic Concepts: Association Rules
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40
Nuts, Eggs, Milk
50
Nuts, Coffee, Diaper, Eggs, Milk
Customer
buys both
Customer
buys diaper
• Find all the rules X → Y with minimum
support and confidence
• support, s, probability that a
transaction contains X  Y
• confidence, c, conditional probability
that a transaction having X also
contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer,
Diaper}:3

Customer
buys beer
Association rules: (many more!)
◼ Beer → Diaper (60%, 100%)
◼ Diaper → Beer (60%, 75%)
8
Closed Patterns and Max-Patterns
• A long pattern contains a combinatorial number of sub-patterns, e.g., {a1,
…, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 subpatterns!
• Solution: Mine closed patterns and max-patterns instead
• An itemset X is closed if X is frequent and there exists no super-pattern Y ‫כ‬
X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99)
• An itemset X is a max-pattern if X is frequent and there exists no frequent
super-pattern Y ‫ כ‬X (proposed by Bayardo @ SIGMOD’98)
• Closed pattern is a lossless compression of freq. patterns
• Reducing the # of patterns and rules
9
Closed Patterns and Max-Patterns
• Exercise. DB = {, < a1, …, a50>}
• Min_sup = 1.
• What is the set of closed itemset?
• : 1
• < a1, …, a50>: 2
• What is the set of max-pattern?
• : 1
• What is the set of all patterns?
• !!
10
Computational Complexity of Frequent Itemset Mining
• How many itemsets are potentially to be generated in the worst case?
• The number of frequent itemsets to be generated is senstive to the minsup threshold
• When minsup is low, there exist potentially an exponential number of frequent itemsets
• The worst case: MN where M: # distinct items, and N: max length of transactions
• The worst case complexty vs. the expected probability
• Ex. Suppose Walmart has 104 kinds of products
• The chance to pick up one product 10-4
• The chance to pick up a particular set of 10 products: ~10-40
• What is the chance this particular set of 10 products to be frequent 103 times in 109
transactions?
11
Scalable Frequent Itemset Mining Methods
• Apriori: A Candidate Generation-and-Test Approach
• Improving the Efficiency of Apriori
• FPGrowth: A Frequent Pattern-Growth Approach
• ECLAT: Frequent Pattern Mining with Vertical Data Format
12
The Downward Closure Property and Scalable Mining
Methods
• The downward closure property of frequent patterns
• Any subset of a frequent itemset must be frequent
• If {beer, diaper, nuts} is frequent, so is {beer, diaper}
• i.e., every transaction having {beer, diaper, nuts} also contains {beer,
diaper}
• Scalable mining methods: Three major approaches
• Apriori (Agrawal & Srikant@VLDB’94)
• Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
• Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)
13
Apriori: A Candidate Generation & Test Approach
• Apriori pruning principle: If there is any itemset which is infrequent, its
superset should not be generated/tested! (Agrawal & Srikant @VLDB’94,
Mannila, et al. @ KDD’ 94)
• Method:
• Initially, scan DB once to get frequent 1-itemset
• Generate length (k+1) candidate itemsets from length k frequent itemsets
• Test the candidates against DB
• Terminate when no frequent or candidate set can be generated
14
The Apriori Algorithm—An Example
Database TDB
Tid
Items
10
A, C, D
20
B, C, E
30
A, B, C, E
40
B, E
Supmin = 2
Itemset
{A, C}
{B, C}
{B, E}
{C, E}
sup
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
sup
1
2
1
2
3
2
C1
1st scan
C2
L2
Itemset
sup
2
2
3
2
Itemset
sup
{A}
2
{B}
3
{C}
3
{E}
3
L1
C2
2nd scan
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
C3
Itemset
{B, C, E}
L3
3rd scan
15
Itemset
sup
{B, C, E}
2
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
16
Implementation of Apriori
• How to generate candidates?
• Step 1: self-joining Lk
• Step 2: pruning
• Example of Candidate-generation
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
• Pruning:
• acde is removed because ade is not in L3
• C4 = {abcd}
17
How to Count Supports of Candidates?
• Why counting supports of candidates a problem?
• The total number of candidates can be very huge
• One transaction may contain many candidates
• Method:
• Candidate itemsets are stored in a hash-tree
• Leaf node of hash-tree contains a list of itemsets and counts
• Interior node contains a hash table
• Subset function: finds all the candidates contained in a transaction
18
Counting Supports of Candidates Using Hash Tree
Subset function
3,6,9
1,4,7
Transaction: 1 2 3 5 6
2,5,8
1+2356
234
567
13+56
145
136
12+356
124
457
125
458
19
159
345
356
357
689
367
368
Candidate Generation: An SQL Implementation
• SQL Implementation of candidate generation
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 • Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck • Use object-relational extensions like UDFs, BLOBs, and Table functions for efficient implementation [See: S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD’98] 20 Further Improvement of the Apriori Method • Major computational challenges • Multiple scans of transaction database • Huge number of candidates • Tedious workload of support counting for candidates • Improving Apriori: general ideas • Reduce passes of transaction database scans • Shrink number of candidates • Facilitate support counting of candidates 21 Partition: Scan Database Only Twice • Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB • Scan 1: partition database and find local frequent patterns • Scan 2: consolidate global frequent patterns • A. Savasere, E. Omiecinski and S. Navathe, VLDB’95 DB1 + sup1(i) < σDB1 DB2 + + sup2(i) < σDB2 DBk supk(i) < σDBk 22 = DB sup(i) < σDB DHP: Reduce the Number of Candidates • A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent • {ab, ad, ae} • {bd, be, de} • … • Frequent 1-itemset: a, b, d, e itemsets 35 88 {ab, ad, ae} {bd, be, de} . . . • Hash entries count . . . • Candidates: a, b, c, d, e 102 {yz, qs, wt} Hash Table • ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold • J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. SIGMOD’95 23 Sampling for Frequent Patterns • Select a sample of original database, mine frequent patterns within sample using Apriori • Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked • Example: check abcd instead of ab, ac, …, etc. • Scan database again to find missed frequent patterns • H. Toivonen. Sampling large databases for association rules. In VLDB’96 24 DIC: Reduce Number of Scans ABCD • Once both A and D are determined frequent, the counting of AD begins • Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins ABC ABD ACD BCD AB AC BC AD BD CD Transactions A B C D 1-itemsets 2-itemsets … Apriori {} Itemset lattice 1-itemsets 2-items S. Brin R. Motwani, J. Ullman, DIC and S. Tsur. Dynamic itemset counting and implication rules for market basket data. SIGMOD’97 3-items 25 Construct FP-tree from a Transaction Database TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 3 {b, f, h, j, o, w} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} Header Table Item frequency head 4 f 4 c 3 a 3 b 3 m 3 p 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Sort frequent items in frequency descending order, f-list 3. Scan DB again, construct FP-tree F-list = f-c-a-b-m-p 26 {} f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Partition Patterns and Databases • Frequent patterns can be partitioned into subsets according to f-list • F-list = f-c-a-b-m-p • Patterns containing p • Patterns having m but no p •… • Patterns having c but no a nor b, m, p • Pattern f • Completeness and non-redunden... Purchase answer to see full attachment

error: Content is protected !!