Feature Learning , Deep Learning and Machine learning

Machine learning is a very successful technology but applying it today often requires spending substantial effort hand-designing features. This is true for applications in vision, audio, and text

Any machine learning algorithm performs as good as provided features are. Let’s understand this by using image classification example. When we try to classify an image into “motorcycle” and “Not Motorcycle”.

Algorithm needs features so that it can draw information from them.

Earlier Several researchers have spent decades to hand design these features . Below image shows sample process of creating features.


fig 1 – Feature vector creation


In the case of images, audio, and text , coming up with features is difficult , time-consuming and it requires expert knowledge. When we work on applications of learning, we spend a lot of time in tuning these features.

So this is where usual machine learning fails us. What about if we can automate this feature learning task instead of hand -engineering them ?

Self-taught learning/ Unsupervised Feature Learning

In particular, the promise of self-taught learning and unsupervised feature learning is that if we can get our algorithms to learn from ”unlabeled” data, then we can easily obtain and learn from massive amounts of it. Even though a single unlabeled example is less informative than a single labeled example, if we can get tons of the former—for example, by downloading random unlabeled images/audio clips/text documents off the internet—and if our algorithms can exploit this unlabeled data effectively, then we might be able to achieve better performance than the massive hand-engineering and massive hand-labeling approaches.

In Self-taught learning and Unsupervised feature learning, we will give our algorithms a large amount of unlabeled data with which to learn a good feature representation of the input.

Deep Learning

“Deep Learning” algorithms  can automatically learn feature representations (often from unlabeled data) thus avoiding a lot of time-consuming engineering. These algorithms are based on building massive artificial neural networks that were loosely inspired by cortical (brain) computations. Below image shows comparison of deep learning feature discovery process among other algorithms.


fig 2 –  Deep Learning Feature creation


To simulate the brain’s visual processing, sparse coding was developed to explain early visual processing in the brain(edge – detection).

Input: Images x (1) , x (2) , …, x (m) (each in Rn* n)

Learn: Dictionary of bases f1 , f2 , …, fk (also Rn*n), so that each input x can be approximately decomposed as:

x  =  å aj fj                           s.t. aj ’s are mostly zero (“sparse”)



Sparse coding algorithm automatically learns to represent an image in terms of the edges that appear in it. It gives a more succinct, higher-level representation than the raw pixels.

Lets understand this by using  below example.



fig 3 – Face detection using deep  learning


In the first layer, deep learning algorithm uses sparse coding and express images in succinct, higher-level representation. Rectangles are shown in each layer look like the same size but higher level up features look at the bigger version of the image. In 2nd layer ,  we can say that some neuron detects an eye which looks like that, similarly, some neuron found ear feature too. And in highest layer, neurons find a way to detect faces.

Theoretical results suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one may need  deep architectures. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers or in complicated propositional formulae re-using many sub-formulae. Searching the parameter space of deep architectures is a difficult task, but learning algorithms such as those for Deep Belief Networks have recently been proposed to tackle this problem with notable success, beating the state-of-the-art in certain areas.

Conclusion :

Usual Machine learning is simply a curve fitting. It is capable of producing great results but we spent a lot of time in feature discovery.  While deep learning is closer to AI. It automatically learns features instead to creating it manually. We might miss some important features while creating them but deep learning tries to learn higher level features by itself.


we can use Theono(Python)  for implementation of deep learning.

Theano is a Python library that lets you to define, optimize, and evaluate mathematical expressions, especially ones with multi-dimensional arrays (numpy.ndarray). Using Theano it is possible to attain speeds rivaling hand-crafted C implementations for problems involving large amounts of data. It can also surpass C on a CPU by many orders of magnitude by taking advantage of recent GPUs.

To know more about Theono follow this link – http://deeplearning.net/software/theano/introduction.html

References :

fig 1- http://www.cs.stanford.edu/people/ang//slides/DeepLearning-Mar2013.pptx

fig 2 – http://videolectures.net/deeplearning2015_bengio_theoretical_motivations/

fig 3 – http://www.cs.stanford.edu/people/ang//slides/DeepLearning-Mar2013.pptx



A brief introduction to Outliers and Outlier Removal methods

What is an Outlier?

 Simply speaking, Outlier is an observation that appears far away and diverges from an overall   pattern in a sample.

Let’s take an example, we do customer profiling and find out that the average annual income of customers is $0.8 million. But, there are two customers having annual income of $4 and $4.2 million. These two customers annual income is much higher than rest of the population. These two observations will be seen as Outliers.

What are the types of Outliers?

Outlier can be of two types: Univariate and Multivariate. Above, we have discussed the example of univariate outlier. These outliers can be found when we look at distribution of a single variable. Multi-variate outliers are outliers in an n-dimensional space. In order to find them, you have to look at distributions in multi-dimensions.

Let us understand this with an example. Let us say we understand the relationship between height and weight. Below, we have univariate and bivariate distribution for Height, Weight. Take a look at the box plot. We do not have any outlier (above and below 1.5*IQR, most common method). Now look at the scatter plot. Here, we have two values below and one above the average in a specific segment of weight and height.

What causes Outliers?

Whenever we come across outliers, the ideal way to tackle them is to find out the reason of having these outliers. The method to deal with them would then depend on the reason of their occurrence. Causes of outliers can be classified in two broad categories:

  1. Artificial (Error) / Non-natural
  2. Natural.

Let’s understand various types of outliers in more detail:

  • Data Entry Errors:- Human errors such as errors caused during data collection, recording, or entry can cause outliers in data. For example: Annual income of a customer is $100,000. Accidentally, the data entry operator puts an additional zero in the figure. Now the income becomes $1,000,000 which is 10 times higher. Evidently, this will be the outlier value when compared with rest of the population.
  • Measurement Error: It is the most common source of outliers. This is caused when the measurement instrument used turns out to be faulty. For example: There are 10 weighing machines. 9 of them are correct, 1 is faulty. Weight measured by people on the faulty machine will be higher / lower than the rest of people in the group. The weights measured on faulty machine can lead to outliers.
  • Experimental Error: Another cause of outliers is experimental error. For example: In a 100m sprint of 7 runners, one runner missed out on concentrating on the ‘Go’ call which caused him to start late. Hence, this caused the runner’s run time to be more than other runners. His total run time can be an outlier.
  • Intentional Outlier: This is commonly found in self-reported measures that involves sensitive data. For example: Teens would typically under report the amount of alcohol that they consume. Only a fraction of them would report actual value. Here actual values might look like outliers because rest of the teens is under reporting the consumption.
  • Data Processing Error: Whenever we perform data mining, we extract data from multiple sources. It is possible that some manipulation or extraction errors may lead to outliers in the dataset.
  • Sampling error: For instance, we have to measure the height of athletes. By mistake, we include a few basketball players in the sample. This inclusion is likely to cause outliers in the dataset.
  • Natural Outlier: When an outlier is not artificial (due to error), it is a natural outlier. For instance: In my last assignment with one of the renowned insurance company, I noticed that the performance of top 50 financial advisors was far higher than rest of the population. Surprisingly, it was not due to any error. Hence, whenever we perform any data mining activity with advisors, we used to treat this segment separately.

What is the Impact of Outliers on a dataset?

Outliers can drastically change the results of the data analysis and statistical modeling. There are numerous unfavourable impacts of outliers in the data set:

  • It increases the error variance and reduces the power of statistical tests
  • If the outliers are non-randomly distributed, they can decrease normality
  • They can bias or influence estimates that may be of substantive interest
  • They can also impact the basic assumption of Regression, ANOVA and other statistical model assumptions.

To understand the impact deeply, let’s take an example to check what happens to a data set with and without outliers in the data set.

As you can see, data set with outliers has significantly different mean and standard deviation. In the first scenario, we will say that average is 5.45. But with the outlier, average soars to 30. This would change the estimate completely.

How to detect Outliers?

Most commonly used method to detect outliers is visualization. We use various visualization methods, like Box-plot, Histogram, Scatter Plot (above, we have used box plot and scatter plot for visualization). Some analysts also various thumb rules to detect outliers. Some of them are:

  • Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
  • Use capping methods. Any value which out of range of 5th and 95th percentile can be considered as outlier
  • Data points, three or more standard deviation away from mean are considered outlier
  • Outlier detection is merely a special case of the examination of data for influential data points and it also depends on the business understanding
  • Bivariate and multivariate outliers are typically measured using either an index of influence or leverage, or distance. Popular indices such as Mahalanobis’ distance and Cook’s D are frequently used to detect outliers.
  • In SAS, we can use PROC Univariate, PROC SGPLOT. To identify outliers and influential observation, we also look at statistical measure like STUDENT, COOKD, RSTUDENT and others.

How to remove Outliers?

Most of the ways to deal with outliers are similar to the methods of missing values like deleting observations, transforming them, binning them, treat them as a separate group, imputing values and other statistical methods. Here, we will discuss the common techniques used to deal with outliers:

Deleting observations: We delete outlier values if it is due to data entry error, data processing error or outlier observations are very small in numbers. We can also use trimming at both ends to remove outliers.

Transforming and binning values: Transforming variables can also eliminate outliers. Natural log of a value reduces the variation caused by extreme values. Binning is also a form of variable transformation. Decision Tree algorithm allows to deal with outliers well due to binning of variable. We can also use the process of assigning weights to different observations.

Imputing: Like imputation of missing values, we can also impute outliers. We can use mean, median, mode imputation methods. Before imputing values, we should analyse if it is natural outlier or artificial. If it is artificial, we can go with imputing values. We can also use statistical model to predict values of outlier observation and after that we can impute it with predicted values.

Treat separately: If there are significant number of outliers, we should treat them separately in the statistical model. One of the approach is to treat both groups as two different groups and build individual model for both groups and then combine the output.

Outlier treatment methods:

1.Box plot:


histogramconsists of parallel vertical bars that graphically show the frequency distribution of a quantitative variable. The area of each bar is equal to the frequency of items found in each class.


In the data set faithful, the histogram of the eruptions variable is a collection of parallel vertical bars showing the number of eruptions classified according to their durations.


Find the histogram of the eruption durations in faithful.


We apply the hist function to produce the histogram of the eruptions variable.

> duration = faithful$eruptions 
> hist(duration,    # apply the hist function 
+   right=FALSE)    # intervals closed on the left

R detection methods in R

1. Using  “outliers” package.
outlier_tf = outlier(data_full$target column,logical=TRUE)
#This gives an array with all values False, except for the outlier (as defined in the package documentation “Finds value with largest difference between it and sample mean, which can be an outlier”).  That value is returned as True. 
find_outlier = which(outlier_tf==TRUE,arr.ind=TRUE)
#This finds the location of the outlier by finding that “True” value within the “outlier_tf” array. 
data_new = data_full[-find_outlier,]
#This creates a new dataset based on the old data, removing the one row that contains the outlier .

2. using DMwR Package       

Quartile Method

The quartiles of a ranked set of data values are the three points that divide the data set into four equal groups, each group comprising a quarter of the data

The first quartile (Q1) is defined as the middle number between the smallest number and the median of the data set

The third quartile (Q3) is the middle value between the median and the highest value of the data set.

Inter Quartile Range(IQR) refers to the difference between third and first quartile

To find the oultier n,

n> Q3+1.5*IQR


n<Q1 -1.5*IQR

The same can be done in R using Box Whisker Plots


Outliers are found based on their local neighbourhoods,more specifically on the local densities.

To calculate the local outlier factor scores


k is the number of neighbours

To plot density plots


To find the  data points with the highest outlier scores(greater the score greater the chance of the data point being an outlier)

For example the top 5 scores

outliers <- order(outlier.scores, decreasing=T)[1:5]