Learn Data Science: Data Preprocessing in Data Mining & Machine Learning | Part 1
Video version of the story, if you are into that sort of thing
Data Preprocessing refers to the steps applied to make data more suitable for data mining. The steps used for Data Preprocessing usually fall into two categories:
selecting data objects and attributes for the analysis.
creating/changing the attributes.
In this discussion we are going to talk about the following approaches of Data Preprocessing:
Aggregation - Part 1
Sampling - Part 1
Dimensionality Reduction - Part 1
Feature Subset Selection - Part 1 & 2
Feature Creation - Part 2
Discretization and Binarization - Part 2
Variable Transformation - Part 2
What is Aggregation?
→ In simpler terms it refers to combining two or more attributes (or objects) into single attribute (or object).
The purpose Aggregation serves are as follows:
→ Data Reduction: Reduce the number of objects or attributes. This results into smaller data sets and hence require less memory and processing time, and hence, aggregation may permit the use of more expensive data mining algorithms.
→ Change of Scale: Aggregation can act as a change of scope or scale by providing a high-level view of the data instead of a low-level view. For example,
Cities aggregated into regions, states, countries etc.
Days aggregated into weeks, months and years.
→ More “Stable” Data: Aggregated Data tends to have less variability.
What is Sampling?
→ Sampling is a commonly used approach for selecting a subset of the data objects to be analysed.
→ The key aspect of sampling is to use a sample that is representative. A sample is representative if it has approximately the same property (of interest) as the original set of data. If the mean (average) of the data objects is the property of interest, then a sample is representative if it has a mean that is close to that of the original data.
Types of Sampling
Simple Random Sampling:
→ There is an equal probability of selecting any particular item
→ Sampling without replacement: As each item is selected, it is removed from the population.
→ Sampling with replacement: Objects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked up more than once.
Stratified sampling: Split the data into several partitions, then draw random samples from each partition.
Progressive Sampling: The proper sample size can be difficult to determine, so adaptive or progressive sampling schemes are sometimes used. These approaches start with a small sample, and then increase the sample size until a sample of sufficient size has been obtained.
What is Dimensionality Reduction?
→ The term dimensionality reduction is often reserved for those techniques that reduce the dimensionality of a data set by creating new attributes that are a combination of the old attributes.
Purpose:
→ Avoid curse of dimensionality. To learn more about this, visit my earlier article explaining it in detail.
→ Reduce amount of time and memory required by data mining algorithms.
→ Allow data to be more easily visualised.
→ May help to eliminate irrelevant features or reduce noise.
Techniques:
→ Principal Components Analysis (PCA)
→ Singular Value Decomposition
The techniques mentioned here are very vast to discuss in this post. You can learn more about them on internet. I have added YouTube links to both, in case you want to watch those videos and learn.
What is Feature Subset Selection?
→ It is another way to reduce dimensionality of data by only using a subset of the features available. While it might seem that such an approach would lose information, this is not the case if redundant and irrelevant features are present.
Redundant features:
→ Duplicate much or all of the information contained in one or more other attributes. Example: purchase price of a product and the amount of sales tax paid.
Irrelevant features:
→ Contain no information that is useful for the data mining task at hand. Example: students’ ID is often irrelevant to the task of predicting students’ GPA.
