Transformers: Why everyone is talking about it?

Transformers!! yay 😀

Update 11th Dec 2021: see what Andrej Karpathy is saying about transformers

I had been planning to read about the Transformers for a while, but I had to wait until I had a few more days to get my hands on it. This is a seminal concept in NLP and I am going to try to explain it in laymen terms as possible. I was overwhelmed with the level of details in transformers whenever I tried reading about it. So I will give you the building blocks of transformers so that you don’t hesitate to read further.

Why Transformers?

Its important to understand why Transformers are so popular. Recurrent Neural Networks,long short-term memory and gated RNNs are the popularly approaches used for Sequence Modelling tasks such as machine translation and language modeling. However, RNN/CNN handle sequences word-by-word in a sequential fashion. This sequentiality is an obstacle toward parallelization of the process. Moreover, when such sequences are too long, the model is prone to forgetting the content of distant positions in sequence or mix it with following positions’ content.
Recent works have achieved significant improvements in computational efficiency and model performance through factorization tricks and conditional computation. But they are not enough to eliminate the fundamental constraint of sequential computation. Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance inthe input or output sequences. In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network.But, the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and has reached a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
Now we know why we are using Transformers.

The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution

The main characteristics are:
Non-Sequential: sentences are processed as a whole rather than word by word.

(Masked)Self Attention: this is the newly introduced ‘unit’ used to compute similarity scores between words in a sentence.

Positional encoding: another innovation introduced to replace recurrence. The idea is to use fixed or learned weights that encode information related to a specific position of a token in a sentence.

Layer Normalization: a normalization technique that is used to stabilize the variance of activations in a layer. Neural net layers work best when input vectors have a uniform mean and std in each dimension.
Self-attention is a key component of the TransformersLet us distil how it works.


Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
Say the following sentence is an input sentence we want to translate:
”The animal didn’t cross the street because it was too tired”
What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.
When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.
As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.
If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.

So far, we have understood why transformers are being used and what are main components. lets talk about model architecture.

Here, the encoder maps an input sequence of symbol representations (x1, …, xn) to a sequenceof continuous representations z = (z1, …, zn). Given z, the decoder then generates an outputsequence (y1, …, ym) of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.The Transformer follows this overall architecture using stacked self-attention and point-wise, fullyconnected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,respectively.

Transformers Architecture


The idea of this post was to explain the what and why of Transformers and how it works. This was the seminal concept in NLP but it is so generic that it is expanding into other domains like vision as well. It allows you cross-pollinate your ideas even while working on completely different domains. I am pretty excited about the future of Transformers. For the longest time, I did not know why we are using transformers and why it outperforms other techniques. Once I understood it, it became go-to technique for NLP tasks. I hope it helped you to understand the concept and how it can be used in your own projects.
I highly recommend you to read the below sources as I have skipped a lot of details not to scare you away from transformers like I was scared whenever I tried to understand transformers.  


[Attention all you need-*The seminal paper on transformers*](
[Transformers- Highly recommended for deep understanding](
[Transformers vs RNN/LSTM](

A concise overview of cost function optimizers!

Photo by Jon Tyson on Unsplash

Optimizers are needed to find the optimal solution for the given task. Optimizers associate themselves with cost function and model parameters together by updating the model. i.e. when you want to identify weights that minimize your mean squared error in linear regression, you need to use some function to find parameters such that mean squared error is minimum, this function is called optimizer. So you use the optimizer function to reach global minima with respect to the cost function.

Types of optimizers:

1. Gradient Descent

2. Momentum

3. Nesterov Momentum

4. Adagrad

5. RMSProp

6. Adam

Gradient Descent

Gradient Descent, is one of the simplest optimization algorithms. It uses just one static learning rate for all parameters during the entire training phase.

The static learning rate does not imply an equal update after every minibatch. As the optimizers approach an (sub)optimal value, their gradients start to decrease.

There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.

Types of gradient descent method:

a. Batch Gradient Descent

b. Stochastic Gradient Descent

c. Mini-batch gradient Descent


1. It does not guarantee convergence and slower than other newer methods.

2. It can stuck in local minima.

3. Choosing a right learning rate is difficult.


This method has literal meaning. Its a method that helps accelerate SGD in the relavant direction and dampen oscillations.

Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way. The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation. The momentum helps to avoid local minima.

Read more about momentum [here](

Nesterov Momentum

It is same as Momentum but with one additional information of notional momentum.

However, a ball that rolls down a hill, blindly following the slope, is highly unsatisfactory. We’d like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again.


Adagrad is a gradient based algorithm that adapts the learning rate to the parameters. In momentum based optimizers we adapted our updates to the slope of error function and speed up SGD. While adagrad updates based on learning rate.

It adapts the learning rate to the parameters, performing smaller updates

(i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. For this reason, it is well-suited for dealing with sparse data.

Adagrad’s main weakness is its accumulation of the squared gradients in the denominator: Since every added term is positive, the accumulated sum keeps growing during training. This in turn causes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge.


RMSProp was developed to resolve the weakness of adagrad’s radically diminishing learning rates.

To combat that problem, RMSprop decay the past accumulated gradient, so only a portion of past gradients are considered. Now, instead of considering all of the past gradients, RMSprop behaves like moving average. RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients.


Adam is the latest state of the art of first order optimization method that’s widely used in the real world. It’s a modification of RMSprop. Loosely speaking, Adam is RMSprop with momentum. So, Adam tries to combine the best of both world of momentum and adaptive learning rate.

Which optimizer to use?

So, which optimizer should you now use? If your input data is sparse, then you likely achieve the best results using one of the adaptive learning-rate methods. An additional benefit is that you won’t need to tune the learning rate but likely achieve the best results with the default value.

In summary, RMSprop is an extension of Adagrad that deals with its radically diminishing learning rates.Adam, finally, adds bias-correction and momentum to RMSprop. Insofar, RMSprop, and Adam are very similar algorithms that do well in similar circumstances. its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Insofar, Adam might be the best overall choice.

Interestingly, many recent papers use vanilla SGD without momentum and a simple learning rate annealing schedule. As has been shown, SGD usually achieves to find a minimum, but it might take significantly longer than with some of the optimizers, is much more reliant on a robust initialization and annealing schedule, and may get stuck in saddle points rather than local minima. Consequently, if you care about fast convergence and train a deep or complex neural network, you should choose one of the adaptive learning rate methods.


In this post we looked at the optimization algorithms beyond SGD. We looked at two classes of algorithms: momentum based and adaptive learning rate methods.

Further Reading

Deep Learning Explained in laymen terms

Deep learning is pattern recognition via so-called neural networks. Neural networks are a set of algorithms, modeled after the human brain. They are sensors: a form of machine perception. Deep learning is a name for a certain type of stacked neural network composed of several node layers. Each layer’s output is simultaneously the subsequent layer’s input, starting from an initial input layer.

Deep-learning networks are distinguished from the more commonplace single-hidden-layer neural networks by their depth; that is, the number of node layers through which data is passed in a multistep process of pattern recognition. Three or more including input and output is deep learning. Anything less is simply machine learning.


Deep learning is motivated by intuition, theoretical arguments from circuit theory, empirical results, and current knowledge of neuroscience.


  • The main concept in deep learning algorithms is automating the extraction of representations (abstractions) from the data.


  • A key concept underlying Deep Learning methods is distributed representations of the data, in which a large number of possible configurations of the abstract features of the input data is feasible, allowing for a compact representation of each sample and leading to a richer generalization.


  • Deep learning algorithms lead to abstract representations because more abstract representations are often constructed based on less abstract ones.An important advantage of more abstract representations is that they can be invariant to the local changes in the input data.
  • Deep learning algorithms are actually Deep architectures of consecutive layers.


  • Stacking up the nonlinear transformation layers is the basic idea in deep learning algorithms.


  • It is important to note that the transformations in the layers of deep architecture are non-linear transformations which try to extract underlying explanatory factors in the data.


  • The final representation of data constructed by the deep learning algorithm (output of the final layer) provides useful information from the data which can be used as features in building classifiers, or even can be used for data indexing and other applications which are more efficient when using abstract representations of data rather than high dimensional sensory data.



Let’s understand in layman’s terms-

Imagine you’re building a shopping recommendation engine, and you discover that if an item is trending and a user has browsed the category of that item in the last day, they are very likely to buy the trending item.

These two variables are so accurate together that you can combine them into a new single variable, or feature (Call it “interested_in_trending_category”, for example).

Finding connections between variables and packaging them into a new single variable is called feature engineering
Deep learning is automated feature engineering.


auto-feature deep learning




Setting up a GPU based Deep Learning Machine

Using GPU for deep learning has seen a tremendous performance. It has been reported that execution time using GPU is 10x -50x times faster than CPU-based deep learning and It is also a lot cheaper than CPU-based system. You can see this below in the picture.


I was curious to check deep learning performance on my laptop which has GeForce GT 940M GPU.
Today I will walk you through how to set up GPU based deep learning machine to make use of GPUs. I have used Tensorflow for deep learning on a windows system. Using GPU in windows system is really a pain. You can’t get it to work if you don’t follow correct steps. But if you follow the steps it will be very easy to set up Tensorflow with GPU for windows.


  • Python 3.5 – Currently Tensorflow on windows doesn’t support python 2.7.
  • nvidia cuda GPU


  • CUDA toolkit
    Use this link to install cuda-
    According to your windows version, you can install this toolkit.
    Recommended version: Cuda Toolkit 8.0
  • cuDNN
    Use this link to install cuDNN -
    You need to register to install this. You need to choose cuDNN v5.1. I have tried latest version but it didn’t work out.After downloading, You need to copy and replace these filescuDNN
    into this location C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0

    Now you also need to set path for environment variables. Check below snapshots and make the required changes. If they are not there you have to do it manually.

  • Python
    Install using anaconda . Use whatever anaconda python 2.7 or 3.5 you want to use for your daily tasks because we will create a separate environment for python 3.5 .
  • Tensorflow with GPU
    Create a virtual environment for tensorflow

    conda create --name tensorflow-gpu python=3.5  

    Then activate this virtual environment:

    activate tensorflow-gpu  

    And finally, install TensorFlow with GPU support:

    pip install tensorflow-gpu  

    Test the TensorFlow installation

    >>> import tensorflow as tf
    >>> hello = tf.constant('Hello, TensorFlow!')
    >>> sess = tf.Session()
    >>> print(
    Hello, TensorFlow!  

    If you run into any error check below link-
    Any other link might lead you to different problems.

  • Let’s play with Tensorflow GPU 

    Let’s check performance on MNIST data using convolution neural network.
    download the code- lets run it and check its performance

  • GPU based Tensorflowtensorflow_gpu

We can see  each step is taking roughly around ~40 ms. Now we want to see if this gpu performanvce worth or not.

  • CPU Tensorflowtensorflow_cpu

Let’s take a look at CPU performance. Really? Each step is taking ~370 ms . Wow what a performance!! Tensorflow with GPU is 10x faster than Tensorflow with CPU.

Next steps:

Further, You can install Keras library to do more advance things in deep learning. Keras is a high-level neural networks API, written in Python and capable of running on top of either TensorFlow or Theano. It was developed with a focus on enabling fast experimentation.Keras uses Tensorflow as backend. Keras also work seamlessly on CPU and GPU. Follow below commands. Install jupyter notebook too if you love working with notebooks.

conda install jupyter
conda install scipy pandas 
conda install mingw libpython (theano dependencies) 
conda install theano 
pip install keras

In case of any trouble, leave comments and let me know your thoughts about this articles.

Happy hunting with deep learning !!

How to improve performance of Neural Networks


Neural networks have been the most promising field of research for quite some time. Recently they have picked up more pace. In earlier days of neural networks, it could only implement single hidden layers and still we have seen better results.
Deep learning methods are becoming exponentially more important due to their demonstrated success at tackling complex learning problems. At the same time, increasing access to high-performance computing resources and state-of-the-art open-source libraries are making it more and more feasible for enterprises, small firms, and individuals to use these methods.

Neural network models have become the center of attraction in solving machine learning problems.

Now, What’s the use of knowing something when we can’t apply our knowledge intelligently. There are various problems with neural networks when we implement them and if we don’t know how to deal with them, then so-called “Neural Network” becomes useless.

Some Issues with Neural Network:

  1. Sometimes neural networks fail to converge due to low dimensionality.
  2. Even a small change in weights can lead to significant change in output. sometimes results may be worse.
  3. The gradient may become zero . In this case , weight optimization fails.
  4. Data overfitting.
  5. Time complexity is too high. Sometimes algorithm runs for days even on small data set.
  6. We get the same output for every input when we predict.


So what next!!


One day I sat down(I am not kidding!)  with neural networks to check What can I do for better performance of neural networks.  I have tried and tested various use cases to discover solutions.
Let’s dig deeper now. Now we’ll check out the proven way to improve the performance(Speed and Accuracy both) of neural network models:

1. Increase hidden Layers

we have always been wondering what happens if we can implement more hidden layers!! In theory, it has been established that many of the functions will converge in a higher level of abstraction. So it seems more layers better results

Multiple hidden layers for networks are created using the mlp function in the RSNNS package and neuralnet in the neuralnet package. As far as I know, these are the only neural network functions in R that can create multiple hidden layers(I am not talking about Deep Learning here). All others use a single hidden layer. Let’s start exploring the neural net package first.

I won’t go into the details of the algorithms. You can google it yourself about their training process.  I have used a data set and want to predict Response/Target  variable. Below is a sample code for 4 layers.

R code

     A. Neuralnet Package



multi_net = neuralnet(action_click~ FAL_DAYS_last_visit_index+NoofSMS_30days_index+offer_index+Days_last_SMS_index+camp_catL3_index+Index_weekday , algorithm= ‘rprop+’, data=train, hidden = c(6,9,10,11) ,stepmax=1e9 , err.fct = “ce”   ,linear.output =F)


I have tried several iteration. Below are the confusion matrix of some of  the results


    B. RSNNS Package


a = mlp(train[,2:7], train$action_click, size = c(5,6), maxit = 5000,

initFunc = “Randomize_Weights”, initFuncParams = c(-0.3, 0.3),

learnFunc = “Std_Backpropagation”, learnFuncParams = c(0.2,0),

hiddenActFunc = “Act_Logistic”, shufflePatterns = TRUE, linOut = FALSE )


I have tried several iteration. Below are the confusion matrix of some of  the results.


From my experiment, I have concluded that when you increase layers, it may result in better accuracy but it’s not a thumb rule. You have to just test it with a different number of layers. I have tried several data set with several iterations and it seems neuralnet package performs better than RSNNS.  Always start with single layer then gradually increase if you don’t have performance improvement .


Figure 2 . A multi layered Neural Network


2. Change Activation function

Changing activation function can be a deal breaker for you. I have tested results with sigmoid, tanh and Rectified linear units. Simplest and most successful activation function is rectified linear unit. Mostly we use sigmoid function network.  Compared to sigmoid, the gradients of ReLU does not approach zero when x is very big. ReLU also converges faster than other activation function. You should know how to use these activation function i.e. when you use “tanh” activation function you should categorize your binary classes into “-1” and “1”.  The classes encoded in 0 and 1 , won’t work in tanh activation function.


3. Change Activation function in Output layer

I have experimented with trying a different activation function in output layer than that of in hidden layers. In some cases, results were better so its better to try with different activation function in output neuron.

As with the single-layered ANN, the choice of activation function for the output layer will depend on the task that we would like the network to perform (i.e. categorization or regression). However, in multi-layered NN, it is generally desirable for the hidden units to have nonlinear activation functions (e.g. logistic sigmoid or tanh). This is because multiple layers of linear computations can be equally formulated as a single layer of linear computations. Thus using linear activations for the hidden layers doesn’t buy us much. However, using linear activations for the output unit activation function (in conjunction with nonlinear activations for the hidden units) allows the network to perform nonlinear regression.

4. Increase number of neurons

If an inadequate number of neurons are used, the network will be unable to model complex data, and the resulting fit will be poor. If too many neurons are used, the training time may become excessively long, and, worse, the network may overfit the data. When overfitting $ occurs, the network will begin to model random noise in the data. The result is that the model fits the training data extremely well, but it generalizes poorly to new, unseen data. Validation must be used to test for this.

There is no rule of thumb in choosing number of neurons but you can consider this one –

N is number of hidden neurons-

  • N = 2/3 the size of the input layer, plus the size of the output layer.
  • N < twice the size of the input layer


5. Weight initialization

While training neural networks, first-time weights are assigned randomly. Although weight updation does take place, but sometimes neural network can converge in local minima. When we use multilayered architecture, random weights does not perform well. We can supply optimal initial weights. You should try with different random seed to generate different random weights then choose the seed number which works well for your problem.
You can use methods like Adaptive weight initialization, Xavier weight initialization etc  to initialize weights.

The random values of initial synaptic weights generally lead to a big error. So learning is finding a proper value for the synaptic weights, in order to find the minimum value for output error. below figure shows being trapped in local minima in order to find optimal weights-


Figure 3: Local minima problem due to random initialization of weights

6. More data

When We have lots of data , then neural network generalizes well. otherwise, it may overfits data. So it’s better to have more data. Overfitting is a general problem when using neural networks. The amount of data needed to train a neural network is very much problem-dependent. The quality of training data (i.e., how well the available training data represents the problem space) is as important as the quantity (i.e., the number of records, or examples of input-output pairs). The key is to use training data that generally span the problem data space. For relatively small datasets (fewer than 20 input variables, 100 to several thousand records) a minimum of 10 to 40 records (examples) per input variable is recommended for training. For relatively large datasets (more than 20 000 records), the dataset should be sub-sampled to obtain a smaller dataset that contains 30 – 50 records per input variable. In either case, any “extra” records should be used for validating the neural networks produced.

7. Normalizing/Scaling data

Most of the times scaling/normalizing your input data can lead to improvement. There are a variety of practical reasons why standardizing the inputs can make training faster and reduce the chances of getting stuck in local optima. Also, weight decay and Bayesian estimation can be done more conveniently with standardized inputs. When NN use gradient descent to optimize parameters , standardizing covariates may speed up convergence (because when you have unscaled covariates, the corresponding parameters may inappropriately dominate the gradient).

8. Change learning algorithm parameters

Try different learning rates (0.01 to 0.9). Also try different momentum parameters, if your algorithm supports it (0.1 to 0.9). Changing learning rate parameter can help us to identify if we are getting stuck in local minima.

The two plots below nicely emphasize the importance of choosing learning rate by illustrating two most common problems with gradient descent:

(i) If the learning rate is too large, gradient descent will overshoot the minima and diverge.

(ii) If the learning rate is too small, the algorithm will require too many epochs to converge and can become trapped in local minima more easily.


Figure 4 : Effect of learning rate parameter values


9. Deep learning for auto feature generation

Machine learning is one of the fastest-growing and most exciting fields out there, and deep learning represents its true bleeding edge. Usual neural networks are not efficient in creating features. Like other machine learning models, Neural networks algorithm’s performance also depends on the quality of features. If we have better features then we would have better accuracy. When we use deep architecture then features are created automatically and every layer refines the features. i.e.


auto-feature deep learning


10. Misc- You can try with a different number of epoch and different random seed. Various parameters like dropout ratio, regularization weight penalties, early stopping etc can be changed while training neural network models.

To improve generalization on small noisy data, you can train multiple neural networks and average their output or you can also take a weighted average. There are various types of neural network model and you should choose according to your problem. i.e. while doing stock prediction you should first try Recurrent Neural network models.


Figure 5 : After dropout, insignificant neurons do not participate in training




Feature Learning , Deep Learning and Machine learning

Machine learning is a very successful technology but applying it today often requires spending substantial effort hand-designing features. This is true for applications in vision, audio, and text

Any machine learning algorithm performs as good as provided features are. Let’s understand this by using image classification example. When we try to classify an image into “motorcycle” and “Not Motorcycle”.

Algorithm needs features so that it can draw information from them.

Earlier Several researchers have spent decades to hand design these features . Below image shows sample process of creating features.


fig 1 – Feature vector creation


In the case of images, audio, and text , coming up with features is difficult , time-consuming and it requires expert knowledge. When we work on applications of learning, we spend a lot of time in tuning these features.

So this is where usual machine learning fails us. What about if we can automate this feature learning task instead of hand -engineering them ?

Self-taught learning/ Unsupervised Feature Learning

In particular, the promise of self-taught learning and unsupervised feature learning is that if we can get our algorithms to learn from ”unlabeled” data, then we can easily obtain and learn from massive amounts of it. Even though a single unlabeled example is less informative than a single labeled example, if we can get tons of the former—for example, by downloading random unlabeled images/audio clips/text documents off the internet—and if our algorithms can exploit this unlabeled data effectively, then we might be able to achieve better performance than the massive hand-engineering and massive hand-labeling approaches.

In Self-taught learning and Unsupervised feature learning, we will give our algorithms a large amount of unlabeled data with which to learn a good feature representation of the input.

Deep Learning

“Deep Learning” algorithms  can automatically learn feature representations (often from unlabeled data) thus avoiding a lot of time-consuming engineering. These algorithms are based on building massive artificial neural networks that were loosely inspired by cortical (brain) computations. Below image shows comparison of deep learning feature discovery process among other algorithms.


fig 2 –  Deep Learning Feature creation


To simulate the brain’s visual processing, sparse coding was developed to explain early visual processing in the brain(edge – detection).

Input: Images x (1) , x (2) , …, x (m) (each in Rn* n)

Learn: Dictionary of bases f1 , f2 , …, fk (also Rn*n), so that each input x can be approximately decomposed as:

x  =  å aj fj                           s.t. aj ’s are mostly zero (“sparse”)



Sparse coding algorithm automatically learns to represent an image in terms of the edges that appear in it. It gives a more succinct, higher-level representation than the raw pixels.

Lets understand this by using  below example.



fig 3 – Face detection using deep  learning


In the first layer, deep learning algorithm uses sparse coding and express images in succinct, higher-level representation. Rectangles are shown in each layer look like the same size but higher level up features look at the bigger version of the image. In 2nd layer ,  we can say that some neuron detects an eye which looks like that, similarly, some neuron found ear feature too. And in highest layer, neurons find a way to detect faces.

Theoretical results suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one may need  deep architectures. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers or in complicated propositional formulae re-using many sub-formulae. Searching the parameter space of deep architectures is a difficult task, but learning algorithms such as those for Deep Belief Networks have recently been proposed to tackle this problem with notable success, beating the state-of-the-art in certain areas.

Conclusion :

Usual Machine learning is simply a curve fitting. It is capable of producing great results but we spent a lot of time in feature discovery.  While deep learning is closer to AI. It automatically learns features instead to creating it manually. We might miss some important features while creating them but deep learning tries to learn higher level features by itself.


we can use Theono(Python)  for implementation of deep learning.

Theano is a Python library that lets you to define, optimize, and evaluate mathematical expressions, especially ones with multi-dimensional arrays (numpy.ndarray). Using Theano it is possible to attain speeds rivaling hand-crafted C implementations for problems involving large amounts of data. It can also surpass C on a CPU by many orders of magnitude by taking advantage of recent GPUs.

To know more about Theono follow this link –

References :

fig 1-

fig 2 –

fig 3 –