MindMap Gallery Deep learning theoretical knowledge
Part of the content is collapsed, and a total of 1216 modules are included. Based on Yasuki Saito's two books "Introduction to Deep Learning: Theory and Implementation Based on Python" and "Advanced Deep Learning: Natural Language Processing Author: [Japanese] Yasuki Saito Translator: Lu Yujie". This is the most suitable book for getting started with deep learning that I have ever read. I highly recommend it before studying "Hands-On Deep Learning" by Li Mu! The content inside does not require any basic knowledge. It is taught from scratch and can be understood by high school students.
Edited at 2024-02-04 00:57:48One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
Project management is the process of applying specialized knowledge, skills, tools, and methods to project activities so that the project can achieve or exceed the set needs and expectations within the constraints of limited resources. This diagram provides a comprehensive overview of the 8 components of the project management process and can be used as a generic template for direct application.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
Project management is the process of applying specialized knowledge, skills, tools, and methods to project activities so that the project can achieve or exceed the set needs and expectations within the constraints of limited resources. This diagram provides a comprehensive overview of the 8 components of the project management process and can be used as a generic template for direct application.
Deep learning theoretical knowledge
introduction
basic concepts
The deep learning problem is a machine learning problem, which refers to summarizing general rules through algorithms from limited examples and applying them to new unknown data.
Unlike traditional machine learning, the models used in deep learning are generally more complex.
The data flow from the original input to the output target passes through multiple linear or nonlinear components. Each component processes information and in turn affects subsequent components.
When we finally get the output, we don't know exactly how much each component contributes. This question is called contribution Allocation problem.
The contribution allocation problem is also often translated as the credit allocation problem or the credit allocation problem.
The contribution allocation problem is a very critical issue, which is related to how to learn the parameters in each component.
At present, the model that can better solve the problem of contribution distribution is artificial neural network (ANN).
Neural networks and deep learning are not equivalent. Deep learning can use neural network models or other models (for example, deep belief network is a probabilistic graph model).
AI
AI basic concepts
smart concept
natural intelligence
definition
Refers to the power and behavioral abilities of humans and some animals
natural human intelligence
It is the comprehensive ability of human beings in understanding the objective world that is manifested by thinking processes and mental activities.
Different views and hierarchies of intelligence
View
theory of mind
Intelligence comes from thinking activities
knowledge threshold theory
Intelligence depends on applicable knowledge
evolutionary theory
Intelligence can be achieved by gradual evolution
Hierarchy
Characteristic capabilities included in intelligence
Perception
memory and thinking skills
learning and adaptability
capacity
Artificial intelligence concept
explain
Use artificial methods to achieve intelligence on machines
Study how to construct intelligent machines or systems, and simulate and extend artificial intelligence
Turing test
Basic content of AI research
The subject position of artificial intelligence
The intersection of natural sciences and social sciences
Core: Thinking and Intelligence
Basic subjects: mathematics, thinking science, computer
Interdisciplinary research with brain science and cognitive science
Research on methods and technologies of intelligent simulation
machine perception
Vision
hearing
machine thinking
machine learning
machine behavior
Domain classification
Perception: that is, simulating human perception ability to perceive and process external stimulus information (visual and speech, etc.). Main research areas include speech information processing and computer vision.
Learning: Simulating human learning ability, mainly studying how to learn from examples or interacting with the environment. Main research areas include supervised learning, unsupervised learning and reinforcement learning.
Cognition: Simulates human cognitive abilities. The main research areas include knowledge representation, natural language understanding, reasoning, planning, decision-making, etc.
history
Different schools of AI research
symbolism
Symbolism, also known as logicism, psychology school or computer school. By analyzing the functions of human intelligence and then realizing these functions through computers.
Basic assumptions
Information can be represented using symbols
Symbols can be manipulated through explicit rules (such as logical operations)
Human cognitive processes can be viewed as symbolic manipulation processes. In the reasoning period and knowledge period of artificial intelligence, the symbolic method is more popular and has achieved a lot of results.
connectionism
Connectionism, also known as the bionic school or the physiological school, is a type of information processing methods and theories in the field of cognitive science.
In the field of cognitive science, human cognitive process can be regarded as an information processing process. Connectionism believes that human cognitive processes are information processing processes in neural networks composed of a large number of simple neurons, rather than symbolic operations.
Therefore, the main structure of the connectionist model is an interconnected network composed of a large number of simple information processing units, which has the characteristics of nonlinearity, distribution, parallelization, local computing and adaptability.
Behaviorism
Behaviorism believes that artificial intelligence stems from cybernetics.
In addition to deep learning, there is currently another exciting technology in the field of machine learning, reinforcement learning.
Let an agent (Agent) continuously take different actions (Action), change its state (State), and interact with the environment (Enviroment) to obtain different rewards (Reward). We only need to design appropriate rewards. (Reward) rules, the agent can learn appropriate strategies through continuous trial and error.
Neural Networks
brain neural network
Artificial neural networks
The development history of neural networks
Model proposed
The period from 1943 to 1969 was the first climax period of the development of neural networks. During this period, scientists proposed many neuron models and learning rules.
In 1943, psychologist Warren McCulloch and mathematician Walter Pitts first described an idealized artificial neural network and constructed a computing mechanism based on simple logical operations. The neural network model they proposed is called the MP model.
ice age
From 1969 to 1983, it was the first low-level development of neural network. valley period. During this period, research on neural networks was at a state of stagnation and low ebb for many years.
In 1969, Marvin Minsky published the book "Perceptron", pointing out two key flaws of neural networks: the first is that the perceptron cannot handle the XOR loop problem; the second is that the computers at that time could not support the processing of large neural networks. Requires computing power.
In 1974, Paul Webos of Harvard University invented the backpropagation algorithm (BP), but it did not receive the attention it deserved at that time.
The renaissance caused by the backpropagation algorithm
1983~1995. The backpropagation algorithm has reignited interest in neural networks.
Caltech physicist John Hopfield proposed a neural network for associative memory and optimization calculations, called the Hopfield network. The Hopfield network achieved the best results at the time on the traveling salesman problem and caused a sensation.
David Rumelhart and James McClelland provide a comprehensive discussion of the application of connectionism to computer simulations of neural activity and reinvent the backpropagation algorithm.
Decline in popularity
1995~2006. Support vector machines and other simpler methods (such as linear classifiers) are gradually surpassing neural networks in popularity in the field of machine learning.
The rise of deep learning
2006 ~ now. Multi-layer feedforward neural networks can be pre-trained layer by layer and then fine-tuned using the backpropagation algorithm. Learn effectively.
machine learning
Data preprocessing
After data preprocessing, such as removing noise, etc. For example, in text classification, removing stop words, etc.
Feature extraction
Extract some effective features from raw data. For example, in image classification, extract edges, scale invariant feature transform (SIFT) features, etc.
Feature transformation
Perform certain processing on the features, such as dimensionality reduction and dimensionality enhancement. Dimensionality reduction includes two approaches: Feature Extraction and Feature Selection. Commonly used feature transformation methods include principal component analysis (PCA), linear discriminant analysis (Linear Discriminant Analysis), etc.
predict
The core part of machine learning, making predictions through a function
Indicates learning
In order to improve the accuracy of machine learning systems, convert input information into effective features
If there is an algorithm that can automatically learn effective features and improve the performance of the final machine learning model, then this kind of learning can be called representation learning.
display method
local representation
One way to represent colors is to name different colors by different names
The dimension is high and cannot be expanded. The similarity between different colors is 0.
distributed representation
RGB values to represent colors
To learn a good high-level semantic representation (generally distributed representation), it is usually necessary to start from the low-level features and go through multiple steps of non-linear transformation to obtain it.
deep learning
step
Contribution distribution problem
Different from "shallow learning", the key problem that deep learning needs to solve is the distribution of contribution
Take the following game of Go as an example. Whenever a game is played, the final result is either a win or a loss. We will think about which moves led to the final victory, and which moves led to the final defeat. How to judge the contribution of each move is the problem of contribution distribution, which is also a very difficult problem.
In a sense, deep learning can also be regarded as a kind of reinforcement learning (RL). Each internal component cannot directly obtain supervision information, but needs to obtain it through the final supervision information (reward) of the entire model, and there is Certain delay.
The neural network model can use the error back propagation algorithm, which can better solve the contribution distribution problem.
End-to-end learning
traditional learning style
In some complex tasks, traditional machine learning methods need to artificially cut the input and output of a task into many sub-modules (or multiple stages), and each sub-module is learned separately.
For example, a natural language understanding task generally requires steps such as word segmentation, part-of-speech tagging, syntactic analysis, semantic analysis, and semantic reasoning.
There are two problems with this way of learning
First, each module needs to be optimized separately, and its optimization goals and the overall mission goals are not guaranteed to be consistent.
The second is error propagation, that is, errors in the previous step will have a great impact on subsequent models. This increases the difficulty of practical application of machine learning methods.
New way of learning
End-to-End Learning, also known as end-to-end training, refers to the overall goal of directly optimizing the task without conducting training in modules or stages during the learning process.
Generally, there is no need to explicitly give the functions of different modules or stages, and no human intervention is required in the intermediate process.
Most deep learning using neural network models can also be regarded as an end-to-end learning.
Commonly used deep learning frameworks
Theano: A Python toolkit from the University of Montreal, used to efficiently define, optimize and execute Theano project is currently out of maintenance. Multidimensional array data corresponds to mathematical expressions. Theano can transparently use GPUs and efficient symbols differential.
Caffe: The full name is Convolutional Architecture for Fast Feature Embedding. It is a computing framework for convolutional network models. The network structure to be implemented can be specified in the configuration file and does not require coding. Caffe is implemented in C and Python and is mainly used for computer vision.
TensorFlow: A Python toolkit developed by Google that can run on any device with a CPU or GPU. TensorFlow's calculation process is represented using data flow graphs. Tensor Flow's name comes from the fact that the operation object in its calculation process is a multi-dimensional array, that is, a tensor.
Chainer: One of the earliest neural network frameworks that uses dynamic computing graphs. Its core development team is Preferred Networks, a machine learning startup from Japan. Compared with static calculation graphs used by Tensorflow, Theano, Caffe and other frameworks, dynamic calculation graphs can dynamically construct calculation graphs at runtime, so they are very suitable for some complex decision-making or reasoning tasks.
PyTorch5: A deep learning framework developed and maintained by Facebook, NVIDIA, Twitter and other companies. Its predecessor is Torch6 of Lua language. PyTorch is also a framework based on dynamic computing graphs, which has obvious advantages in tasks that require dynamically changing the structure of neural networks.
Organization of this book
perceptron
A perceptron is an algorithm with inputs and outputs. Given an input, a given value will be output.
The perceptron sets weights and biases as parameters
Logic circuits such as AND gates and OR gates can be represented using perceptrons.
An XOR gate cannot be represented by a single layer perceptron.
An XOR gate can be represented using a 2-layer perceptron
Single-layer perceptrons can only represent linear spaces, while multi-layer perceptrons can represent nonlinear spaces.
A 2-layer perceptron can (in theory) represent a computer.
Neural Networks
Perceptrons and Neural Networks
"Naive perceptron" refers to a single-layer network and a model that uses the step function as an activation function.
"Multilayer perceptron" refers to a neural network, that is, a multilayer network that uses smooth activation functions such as the sigmoid function or the ReLU function.
Operation: inner product of neural network
Y = np.dot(X, W)
Neural networks can be implemented efficiently by using matrix operations.
Affine layer
The matrix product operation performed in the forward propagation of the neural network is called "affine transformation" in the field of geometry
Affine transformation includes a linear transformation and a translation, which respectively correspond to the weighted sum operation and offset operation of the neural network.
Y = sigmoid(Y)
output layer
Activation function: Identity function is used for regression problems, and softmax function is used for classification problems.
Identity function
The input signal will be output unchanged
softmax function
Assume that the output layer has n neurons in total, and calculate the output yk of the k-th neuron.
Features: The sum of the output values of the output layer is 1
Note: overflow issues
quantity
Classification problem
Generally set to the number of categories
Handwritten digit recognition
The input layer has 28*28=784 neurons, and the output layer has 10 neurons. There are also two hidden layers, and the number of neurons can be any value.
Batch processing
Enter multiple sets of data at once
Neural network learning
loss function
Introduce concepts
When looking for optimal parameters (weights and biases), you are looking for parameters that make the value of the loss function as small as possible, so you need to calculate the derivatives of the parameters (the gradient to be precise)
Why not directly use recognition accuracy as an indicator?
The derivative of the parameter will become 0 in most places
The same goes for step functions as activation functions
type
mean square error
cross entropy error
mini-batch
Extract some test data
gradient
The vector summed up by the partial derivatives of all variables is called the gradient
The direction indicated by the gradient is the direction in which the function value at each point decreases the most.
hyperparameters
Manual setting
learning raten
minibatch size
Update times iters_num
training gained
Weight w and bias theta
Neural Networks
The gradient of the loss function with respect to the weight parameters
epoch
Number of cycles/minibatch size
Stochastic Gradient Descent (SGD)
Perform gradient descent on randomly selected data
error back propagation method
Although numerical differentiation is simple and easy to implement, its disadvantage is that it is time-consuming to calculate. There is a method that can efficiently calculate the gradient of weight parameters: error back propagation method
Computational graph
By using calculation graphs, you can intuitively grasp the calculation process
Forward propagation of computational graphs performs general computations. By backpropagating the computational graph, the derivatives of each node can be calculated
The error term of layer l can be calculated by the error term of layer l 1 Obtained, this is the back propagation of error.
formula
calculate
The amount of yellow is the value obtained during backpropagation
The green quantity is a known quantity
By implementing the constituent elements of a neural network as layers, gradients can be calculated efficiently
By comparing the results obtained by numerical differentiation and the error back propagation method, you can confirm whether the implementation of the error back propagation method is correct (gradient confirmation)
Reference video
https://www.bilibili.com/video/BV1LM411J7cW/?spm_id_from=333.788&vd_source=048c7bdfe54313b8b3ee1483d9d07e38
convolutional neural network
Everything should be as simple as possible, but not too simple. [Albert Einstein]
the whole frame
Compared
Network based on fully connected layer (Affine layer)
CNN based network
link order
Convolution[convolution layer]-ReLU-(Pooling[pooling layer])
The layer close to the output uses the previous Affine [affine transformation] - ReLU combination
The final output layer uses the previous Affine-Softmax combination
convolution layer
Convolution concept
Problems with the fully connected layer
The shape of the data is "ignored". The image is usually a 3-dimensional shape in the height, length, and channel directions, but the 3-dimensional data needs to be flattened into 1-dimensional data when input.
The image is a 3-dimensional shape, and this shape should contain important spatial information.
Spatially adjacent pixels have similar values
Each channel of RBG is closely related to each other.
There is little correlation between pixels that are far apart
The convolutional layer can keep the shape unchanged
definition
The input and output data of the convolution layer are called feature maps
The input data of the convolutional layer is called the input feature map
The output data is called the output feature map
Convolution operation
The convolution operation is equivalent to the filter operation in image processing
The main function of convolution is to slide a convolution kernel (ie filter) on an image (or some feature) and obtain a new set of features through the convolution operation.
two-dimensional
Given an image X ∈ R(M×N), and a filter W ∈ R (m×n), generally m << M, n << N, the convolution is
three dimensional
Correlation
In the process of calculating convolution, it is often necessary to flip the convolution kernel.
Flip is to reverse the order in two dimensions (top to bottom, left to right), that is, rotate 180 degrees.
In terms of specific implementation, cross-correlation operations are used instead of convolutions, which will reduce some unnecessary operations or overhead.
Cross-Correlation is a function that measures the correlation between two series, usually implemented using a sliding window dot product calculation.
Given an image X ∈ R(M×N) and a convolution kernel W ∈ R (m×n), their cross-correlation is
The difference between cross-correlation and convolution is only whether the convolution kernel is flipped. Cross-correlation can also be called non-flip convolution.
Convolution is used in neural networks for feature extraction. Whether the convolution kernel is flipped has nothing to do with its feature extraction capability. Especially when the convolution kernel is a learnable parameter, convolution and cross-correlation are equivalent.
Variants of convolution
Zero padding
In order to keep the space size constant, the input data needs to be padded
Stride
The interval of positions at which the filter is applied is called the stride
Commonly used convolutions
Narrow Convolution: Step size s = 1, no zero padding at both ends p = 0, and the output length after convolution is n − m 1.
Wide Convolution: Step size s = 1, zero padding at both ends p = m − 1, and the output length after convolution is n m − 1.
Equal-Width Convolution: Step size s = 1, zero padding at both ends p = (m −1)/2, output length n after convolution.
Mathematical properties of convolution
Convolution operation on 3D data
The input data and filter channel numbers should be set to the same value.
Multiple convolution operations
Regarding the filters of the convolution operation, the number of filters must also be considered. Therefore, as 4-dimensional data, the weight data of the filter should be written in the order of (output_channel, input_channel, height, width). For example, if there are 20 filters with a channel number of 3 and a size of 5 × 5, it can be written as (20, 3, 5, 5).
Batch processing
We hope that the convolution operation also corresponds to batch processing. To do this, the data passed between each layer needs to be saved as 4-dimensional data. Specifically, the data is saved in the order of (batch_num, channel, height, width).
Properties of convolutional layers
Local connection: Each neuron in the convolutional layer (assumed to be the l-th layer) is only connected to the neurons in a local window in the next layer (l-1 layer), forming a local connection network. The number of connections between the convolutional layer and the next layer is greatly reduced, from the original n(l) × n(l - 1) connections to n(l) × m connections. m is the filter size.
Weight sharing: The filter w(l) as a parameter is the same for all neurons in layer l.
Due to local connections and weight sharing, the parameters of the convolutional layer only have an m-dimensional weight w(l) and a 1-dimensional bias b(l), with a total of m 1 parameters.
The number of neurons in layer l is not chosen arbitrarily, but satisfies n(l) = n(l−1) − m 1.
Pooling layer
Also called aggregation layer, subsampling layer
Pooling is feature selection, which reduces the number of features, thereby reducing the number of parameters, reducing feature dimensions, and reducing the space in the height and length directions.
Commonly used aggregation functions
Maximum: Generally, the maximum value of all neurons in a region is taken.
Average aggregation (Mean): Generally, the average value of all neurons in the area is taken.
A typical pooling layer divides each feature map into non-overlapping regions of 2×2 size, and then uses maximum pooling for downsampling.
The pooling layer can also be regarded as a special convolutional layer
In some early convolutional networks (such as LeNet-5), nonlinear activation functions were sometimes used in the pooling layer.
where Y(′d) is the output of the pooling layer, f(·) is the nonlinear activation function, w(d) and b(d) are learnable scalar weights and biases.
Characteristics of the pooling layer
There are no parameters to learn
The number of channels does not change
Robust to small position changes (robustness)
parameter learning
Calculation of error terms
Visualization of CNN
Visualization of layer 1 weights
The filter before learning is randomly initialized, so there is no pattern in the shades of black and white, but the filter after learning becomes a regular image. We found that through learning, the filters are updated into regular filters, such as filters that gradient from white to black, filters that contain blocky areas (called blobs), etc. Filter responsive to horizontal and vertical edges
It can be seen that the filters of the convolutional layer will extract original information such as edges or patches. The CNN just implemented will pass this raw information to subsequent layers.
Information extraction based on hierarchical structure
Information extracted from the convolutional layers of CNN. The neurons in layer 1 respond to edges or patches, layer 3 responds to texture, layer 5 responds to object parts, and the final fully connected layer responds to the category of the object (dog or car).
If multiple convolutional layers are stacked, as the layers deepen, the extracted information becomes more complex and abstract. This is a very interesting part of deep learning. As the layers deepen, neurons change from simple shapes to "high-level" information. In other words, just as we understand the "meaning" of things, the objects of response gradually change.
Typical convolutional neural network
LeNet-5
LeNet was proposed in 1998 as a network for handwritten digit recognition. It has consecutive convolutional layers and pooling layers, and finally outputs the results through a fully connected layer.
Excluding the input layer, LeNet-5 has a total of 7 layers.
Input layer: The input image size is 32 × 32 = 1024.
Convolutional layer: Using six 5 × 5 filters, six groups of feature maps with a size of 28 × 28 = 784 are obtained. Therefore, the number of neurons in layer C1 is 6 × 784 = 4704, the number of trainable parameters is 6 × 25 6 = 156, and the number of connections is 156 × 784 = 122304 (including biases, the same below).
Pooling layer: The sampling window is 2×2, average pooling is used, and a nonlinear function is used. The number of neurons is 6 × 14 × 14 = 1176, the number of trainable parameters is 6 × (1 1) = 12, and the number of connections is 6 × 196 × (4 1) = 5, 880.
Convolutional layer. A connection table is used in LeNet-5 to define the dependency between the input and output feature maps. As shown in the figure, a total of 60 5 × 5 filters are used to obtain 16 groups of feature maps with a size of 10 × 10. The number of neurons is 16 × 100 = 1, 600, the number of trainable parameters is (60 × 25) 16 = 1, 516, and the number of connections is 100 × 1, 516 = 151, 600.
In the pooling layer, the sampling window is 2 × 2, and 16 feature maps of 5 × 5 size are obtained. The number of trainable parameters is 16 × 2 = 32, and the number of connections is 16 × 25 × (4 1) = 2000.
Convolutional layers, using 120 × 16 = 1, 920 5 × 5 filters, obtain 120 sets of feature maps with a size of 1 × 1. The number of neurons in layer C5 is 120, the number of trainable parameters is 1, 920 × 25 120 = 48120, and the number of connections is 120 × (16 × 25 1) = 48120.
The fully connected layer has 84 neurons and the number of trainable parameters is 84×(120 1) =10164. The number of connections and the number of trainable parameters are the same, which is 10164.
Output layer: The output layer consists of 10 Euclidean radial basis functions
join table
The fully connected relationship between the input and output feature maps of the convolutional layer is not necessary, and non-shared convolution can be used
Define a link table (Link Table) T to describe the connection relationship between the input and output feature maps.
If the p-th output feature map depends on the d-th input feature map, then Tp,d = 1, otherwise 0.
AlexNet
It was proposed in 2012 and uses many technical methods of modern deep convolutional networks.
Parallel training using GPUs
The activation function uses ReLU
Use Dropout to prevent overfitting
Use data augmentation to improve model accuracy
Use LRN (Local Response Normalization) layer for local normalization
Inception network
Inception module: A convolutional layer contains multiple convolution operations of different sizes
The Inception network is stacked by multiple inception modules and a small number of aggregation layers.
V1 version
The earliest v1 version of the Inception network is the very famous GoogLeNet [Szegedy et al., 2015], and won the 2014 ImageNet image classification competition.
Residual network ResNet
The information propagation efficiency is improved by adding direct edges to the nonlinear convolutional layer.
nonlinear elements
Can be one or more convolutional layers
Let this nonlinear unit f(x, θ) approximate an objective function h(x)
A nonlinear unit composed of a neural network has sufficient ability to approximately approximate the original objective function or the residual function, but in practice the latter is easier to learn
Conclusion: Let the nonlinear unit f(x, θ) approximate the residual function h(x)−x, and use f(x, θ) x to approximate h(x).
The residual network is a very deep network composed of many residual units connected in series.
Other convolution methods
transposed convolution
Atrous convolution (dilated convolution)
deep learning
Deepen the network
Heading towards a deeper network
This network refers to VGG which will be introduced in the next section.
Convolutional layer based on 3×3 small filters
The activation function is ReLU
The Dropout layer is used behind the fully connected layer.
Adam-based optimization
Use the initial value of He as the initial value of the weight
Recognition accuracy is 99.38%
Further improve recognition accuracy
Ensemble learning
learning rate decay
Data Augmentation
Increase the number of images by applying small changes such as rotation, vertical or horizontal movement, cropping, flipping, increasing brightness, etc.
Deeper motivation
Improve recognition performance
The importance of deepening can be seen from the results of large-scale image recognition competitions represented by ILSVRC. The results of this competition show that the top methods recently are mostly based on deep learning and have a tendency to gradually deepen the layers of the network. In other words, it can be seen that the deeper the layer, the higher the recognition performance.
Reduce the number of parameters of the network
The advantage of stacking small filters to deepen the network is that it can reduce the number of parameters and expand the receptive field (a local spatial region that imposes changes on neurons). Moreover, through overlay layers, activation functions such as ReLU are sandwiched in the middle of the convolutional layer, further improving the expressiveness of the network. This is because "nonlinear" expressiveness based on activation functions is added to the network. Through the superposition of nonlinear functions, more complex things can be expressed.
Make learning more efficient
Compared with a network without deepening layers, by deepening the layers, the learning data can be reduced and learning can be performed efficiently.
Deep learning network structure
ILSVRC competition
ImageNet contains a variety of images, and each image is associated with a label (category name). The ILSVRC Image Recognition Competition using this huge dataset is held every year.
The large-scale image recognition competition ILSVRC was held in 2012. In that year's competition, the method based on deep learning (commonly known as AlexNet) won overwhelmingly, completely subverting previous image recognition methods. In subsequent competitions, deep learning has been active at the center of the stage.
In particular, 2015's ResNet (a deep network with more than 150 layers) reduced the false recognition rate to 3.5%. It is said that this result even exceeds the recognition ability of ordinary people.
VGG
VGG is a basic CNN composed of convolutional layers and pooling layers. However, its characteristic is that it superimposes weighted layers (convolutional layers or fully connected layers) to 16 layers (or 19 layers), which has depth (sometimes also called "VGG16" or "VGG19" depending on the depth of the layer) .
GoogLeNet
The network not only has depth vertically, but also breadth horizontally, which is called the Inception structure.
ResNet
Has a deeper structure than previous networks
We already know that deepening depth is important for improving performance. However, in deep learning, if you deepen the depth too much, learning will not proceed smoothly in many cases, resulting in poor final performance. In ResNet, in order to solve this kind of problem, "shortcut structure" (also called "shortcut" or "path") is introduced. After importing this shortcut structure, the performance can be continuously improved as the layer deepens (of course, the layer deepening is also limited).
In practice, the weight data learned using the huge data set of ImageNet is often flexibly applied. This is called transfer learning. The learned weights (parts) are copied to other neural networks for re-learning (fine tuning). For example, prepare a network with the same structure as VGG, use the learned weights as initial values, and use the new data set as the object to re-learn. Transfer learning is very effective when the data set at hand is small.
Speeding up deep learning
Problems that need to be solved
The time ratio of each layer in AlexNet's forward processing: the left side is when using GPU, and the right side is when using CPU. "conv" in the figure corresponds to the convolution layer, "pool" corresponds to the pooling layer, "fc" corresponds to the fully connected layer, and "norm" corresponds to the regularization layer
The processing time of the convolutional layer accounts for 95% of the overall GPU and 89% of the overall CPU.
GPU-based speedup
GPUs are mainly provided by two companies, NVIDIA and AMD. Although both GPUs can be used for general numerical calculations, NVIDIA's GPU is more "close" to deep learning. In fact, most deep learning frameworks only benefit from NVIDIA's GPUs. This is because the deep learning framework uses CUDA, a comprehensive development environment for GPU computing provided by NVIDIA.
Distributed learning
Distributed computing on multiple GPUs or multiple machines
Google's TensorFlow and Microsoft's CNTK attach great importance to distributed learning during the development process
The horizontal axis is the number of GPUs The vertical axis is the speedup rate compared to a single GPU.
Digit reduction of arithmetic precision
Regarding numerical precision (number of digits to represent a value), we already know that deep learning does not require numerical precision in digits. This is an important property of neural networks. This property is based on the robustness of neural networks.
In the future, half-precision floating point numbers will be used as a standard, and it is expected to achieve speeds up to approximately 2 times that of the previous generation of GPUs.
Deep learning application cases
Object detection
Determine the type of object and the location of the object from the image
Among the methods of using CNN for object detection, there is a method called R-CNN
Image segmentation
Classify images at the pixel level
FCN classifies all pixels through one forward process.
FCN literally means "a network composed entirely of convolutional layers". Compared with the general CNN containing fully connected layers, FCN replaces the fully connected layers with convolutional layers that play the same role.
Image caption generation
A representative method for generating image captions based on deep learning is called NIC
NIC is composed of deep CNN and RNN (Recurrent Neural Network) that processes natural language.
The future of deep learning
Image style transformation
Image generation
Autopilot
reinforcement learning
Distributed representation of natural language and words
Marty: “This is heavy.” Dr. Brown: “In the future, things are so heavy?” —The movie "Back to the Future"
What is natural language processing
Our language is made of words, and the meaning of language is made of words. In other words, a word is the smallest unit of meaning.
Three ways to get computers to understand the meaning of words
Thesaurus-based approach
count-based approach
Inference-based approach (word2vec)
Thesaurus-based approach
Consider manually defining word meanings
Currently widely used is the synonym dictionary
A diagram based on the hypernym-hyponym relationship according to the meaning of each word
WordNet
The most famous synonym dictionary
effect
Get synonyms for a word
Calculate similarity between words
Used via the NLTK module
Problems
New words continue to appear, making it difficult to adapt to changes in the times
High labor cost
Unable to express subtle differences in words
count-based approach
corpus
A corpus is a large amount of text data
Corpora used in the field of natural language processing sometimes add additional information to text data. For example, each word of the text data can be marked with a part-of-speech. Here, it is assumed that the corpus we use has no tags added.
Python-based corpus preprocessing
famous corpora
Wikipedia and Google News
preprocessing
Uppercase -> Lowercase
text.lower()
Process punctuation
text.replace('.', ' .')
re.split('(\W )', text)
\W: Matches non-word characters (not letters, numbers, or underscores)
: Indicates matching the previous pattern "\W" repeated one or more times
Create word IDs and correspondence tables
Convert a list of words into a list of word IDs
corpus = [word_to_id[w] for w in words]
Distributed representation of words
Construct compact and reasonable vector representations in the word domain
Distribution hypothesis
The meaning of a word is formed by the words surrounding it
Context refers to the words surrounding a centered word
The size of the context is called the window size
The window size is 1 and the context contains 1 word on the left and right
co-occurrence matrix
The simplest way to use a vector is to count how many times a word appears around it.
text = 'You say goodbye and I say hello.'
Set window size to 1
similarity between vectors
cosine similarity
Sorting of similar words
Get the word vector of the query word
Obtain the cosine similarity between the word vector of the query word and all other word vectors respectively.
Results based on cosine similarity, showing their values in descending order
Improvements in count-based methods
Click mutual information
In the co-occurrence matrix, common words like the will be considered to have a strong correlation with nouns like car
PMI
, P(x) represents the probability of x occurring, P(y) represents the probability of y occurring, and P(x, y) represents the probability of x and y occurring simultaneously.
PMI based on co-occurrence matrix
insufficient
When the number of co-occurrences of two words is 0, log(2)(0) = −∞
positive point mutual information
Get the PPMI matrix based on the co-occurrence matrix
Dimensionality reduction
We need to observe the distribution of data and find important "axes"
Singular value decomposition (SVD)
SVD decomposes any matrix into the product of 3 matrices
where U and V are orthogonal matrices whose column vectors are orthogonal to each other, and S is a diagonal matrix in which all but the diagonal elements are 0
The original matrix can be approximated by removing the redundant column vectors in the matrix U
U, S, V = np.linalg.svd(W)
If the matrix size is N*N, the computational complexity of SVD will reach O(N^3). Therefore, faster methods such as Truncated SVD are often used. Truncated SVD achieves high speed by truncating the parts with smaller singular values.
from sklearn.utils.extmath import randomized_svd U, S, V = randomized_svd(W, n_components=wordvec_size, n_iter=5, random_state=None)
PTB dataset
The PTB corpus is often used as a benchmark for evaluating proposed methods
Preprocessing of PTB corpus
brackets to replace rare words with the special character
Replace specific numbers with "N"
The preprocessing I did
Concatenate all the sentences and treat it as one big time series data. At this time, a special character <eos> is inserted at the end of each sentence
Hyperparameter assignment
window_size = 2
wordvec_size = 100
Evaluation based on PTB data set
For the query word you, you can see that personal pronouns such as i and we are ranked first. These are words with the same usage in grammar.
The query word year has synonyms such as month and quarter.
The query word car has synonyms such as auto and vehicle.
When using toyota as the query term, car manufacturer names or brand names such as nissan, honda and lexus appeared.
Summarize
Use the corpus to calculate the number of words in the context, convert them into a PPMI matrix, and then obtain good word vectors based on SVD dimensionality reduction.
word2vec
“If you don’t have a basis for judgment, don’t reason.” ——Arthur Conan Doyle, "A Scandal in Bohemia"
word embedding
Word2Vec is an algorithm for generating "word embeddings"
In addition to Word2Vec, there are other methods for generating word embeddings, such as GloVe (Global Vectors for Word Representation), FastText, etc. These methods may use different strategies and algorithms, but they all aim to effectively capture the semantic information of words in vector space.
Inference-based methods and neural networks
Problems with count-based methods
In the real world, corpora deal with a very large number of words. For example, it is said that the English vocabulary has over 1 million words. If the vocabulary size exceeds 1 million, then using the count-based method requires generating a huge matrix of 1 million × 1 million, but it is obviously unrealistic to perform SVD on such a large matrix.
Inference-based approaches using neural networks
Learning on mini-batch data. That is, using part of the data to learn and repeatedly updating the weights.
The learning of neural networks can be performed in parallel using multiple machines and multiple GPUs, thus accelerating the entire learning process.
Summary of inference-based methods
Target
Predict what words will come in the middle when given the surrounding words (context), like a cloze
reasoning method
Input context, and the model outputs the occurrence probability of each word.
As a product of model learning, we will get a distributed representation of the word
How to process words in neural networks
Convert words to vectors
Neural networks cannot directly process words like you or say. To use neural networks to process words, you need to first convert the words into fixed-length vectors.
Conversion method
one-hot vector
Only one element is 1, the other elements are 0
Neural Networks
input layer
Fully connected layer
The initial weights are random
Simple word2vec
Inference of CBOW model
structure
Features
There are two input layers
The transformation from the input layer to the intermediate layer is completed by the same fully connected layer (weight W(in))
The transformation from the intermediate layer to the output layer neurons is completed by another fully connected layer (weight W(out))
The neurons in the middle layer are the "average" of the values obtained by each input layer after being transformed by the fully connected layer.
The neurons in the output layer are the scores of each word, and the higher its value The larger the value, the higher the occurrence probability of the corresponding word.
CBOW model learning
Convert scores into probabilities using the Softmax function
Find the cross-entropy error between these probabilities and the supervised labels
Learn it as a loss
Weighted and distributed representation of word2vec
W(in) weight is the distributed representation of the word we want
Preparing study data
context and target words
Convert to one-hot representation
Implementation of CBOW model
Additional information
CBOW models and probability
The probability that wt occurs after wt−1 and wt 1 occur.
Loss function L (negative log likelihood) of CBOW model
skip-gram model
word2vec has two models
CBOW
skip-gram
Skip-gram is a model that inverts the context and target words processed by the CBOW model.
skip-gram network structure diagram
skip-gram models and probability
The skip-gram model has only one input layer, and the number of output layers is equal to the number of words in the context. First, the losses of each output layer must be calculated separately and then added together as the final loss.
Predict the context wt−1 and wt 1 based on the middle word (target word) wt
The loss function of the skip-gram model can be expressed as
Loss function comparison
The number of predictions of the skip-gram model is as many as the number of context words, so its loss function requires the sum of the losses corresponding to each context word. The CBOW model only requires the loss of the target word.
Judging from the accuracy of the distributed representation of words, the skip-grm model gives better results in most cases.
Counting-based vs. inference-based
Scenarios where new words need to be added to the vocabulary and the distributed representation of the words updated
Count-based methods require calculations from scratch
Inference-based methods allow for incremental learning of parameters
Properties of distributed representations of words
Count-based methods mainly encode the similarity of words
Inference-based methods can understand complex patterns between words
kingman woman=queen
Accuracy of distributed representations of words
Inference-based methods and counting-based methods are indistinguishable
Speeding up word2vec
Don't try to know everything, or you will know nothing. ——Democritus (ancient Greek philosopher)
Improve
study
other
Application of word2vec
The distributed representation of words obtained using word2vec can be used to find approximate words
transfer learning
Knowledge learned in one field can be applied to other fields
When solving natural language processing tasks, word2vec is generally not used to learn the distributed representation of words from scratch. Instead, it is first learned on a large-scale corpus (text data such as Wikipedia, Google News, etc.), and then the learned distributed representation is Applies to a single task.
In natural language processing tasks such as text classification, text clustering, part-of-speech tagging, and sentiment analysis, the first step of word vectorization can use the distributed representation of learned words.
Distributed representations of words work great in almost all types of natural language processing tasks!
Using distributed representations of words, it is also possible to convert documents (sequences of words) into fixed-length vectors.
If you can convert natural language into vectors, you can use many machine learning methods
How to evaluate word vectors
artificially created word similarity evaluation set to evaluate
The similarity between cat and animal is 8, and the similarity between cat and car is 2... Similar to this, the similarity between words is manually scored with a score from 0 to 10.
Compare the scores given by people and the cosine similarity given by word2vec to examine the correlation between them
in conclusion
Different models have different accuracies (choose the best model based on the corpus)
The bigger the corpus, the better the results (big data is always needed)
The dimensionality of word vectors must be moderate (too large will lead to poor accuracy)
RNN
I just remember me meowing and crying in a dark and humid place. ——Natsume Soseki's "I am a Cat"
Probability and Language Models
A simple feedforward network cannot fully learn the properties of time series data. As a result, RNN (Recurrent Neural Network) came into being.
word2vec from a probabilistic perspective
Can the original purpose of the CBOW model "predict target words from context" be used for something? Can P(wt|wt−2, wt−1) play a role in some practical scenarios?
The windows we considered before are all symmetrical, and then we only consider the left window.
language model
Use probability to evaluate the likelihood that a sequence of words will occur, that is, the extent to which a sequence of words is natural.
Probability representation
Where P(A,B) = P(A|B)*P(B) = P(B|A)*P(A)
Using the CBOW model as a language model
Markov chain
When the probability of an event depends only on the N events preceding it, it is called an "N-order Markov chain."
Limiting the context to the 2 words on the left is a second-order Markov chain
insufficient
If the window is too short, the context cannot be combined
If the window is too long, the order of words in the context will be ignored.
CBOW is short for Continuous Bag-Of-Words. Bag-Of-Words means "a bag of words," which means that the order of the words in the bag is ignored.
RNN has a mechanism that can remember context information no matter how long the context is. Therefore, time series data of arbitrary length can be processed using RNN.
word2vec is a method aimed at obtaining distributed representation of words, and is generally not used in language models.
RNN
recurrent neural network
The structure of the RNN layer
The input at time t is xt, which implies that time series data (x0, x1, ··· , xt, ···) will be input into the layer. Then, in the form corresponding to the input, output (h0, h1, ··· , ht, ···)
The output has two forks, which means the same thing was copied. A fork in the output will become its own input.
unroll loop
We use the word "moment" to refer to the index of time series data (that is, the input data at time t is xt). Expressions such as "the t-th word" and "the t-th RNN layer" are used, as well as expressions such as "the word at time t" or "the RNN layer at time t".
The RNN layer at each moment receives two values, which are the input passed to this layer and the output of the previous RNN layer.
RNN has two weights, namely the weight Wx that converts the input x into the output h and the weight Wh that converts the output of the previous RNN layer into the output at the current moment. Additionally, there is bias b.
From another perspective, RNN has a state h, which is continuously updated through the above formula. So h can be called a hidden state or a hidden state vector
The two schematic drawing methods are equivalent
Backpropagation Through Time
time-based backpropagation
To find the gradient based on BPTT, the intermediate data of the RNN layer at each moment must be saved in memory. Therefore, as the time series data gets longer, the computer's memory usage (not just calculations) also increases.
Truncated BPTT
In order to solve the above problem, when processing long time series data, the common practice is to cut the network connection into an appropriate length.
Networks that are too long in the direction of the time axis are truncated at appropriate locations to create multiple small networks, and then the error backpropagation method is performed on the cut-out small networks. This method is called Truncated BPTT (truncated BPTT).
In Truncated BPTT, the backward propagation connection of the network is cut off, but the forward propagation connection is still maintained.
Processing order
The first thing to do is to feed the input data of block 1 (x0, ... , x9) into the RNN layer.
Perform forward propagation first and then back propagation to get the desired gradient.
Next, the input data of the next block (x10, x11, ··· , x19) are fed into the RNN layer. The calculation of this forward propagation requires the last hidden state h9 of the previous block, so that the forward propagation connection can be maintained.
Mini-batch learning of Truncated BPTT
At the beginning of the input data, an "offset" needs to be made within individual batches.
Notice
To enter data in order
To shift the starting position of each batch (each sample) of input data
Implementation of RNN
Considering learning based on Truncated BPTT, the target neural network receives time series data of length T (T is any value), and these T states can be regarded as a layer
Call a layer that processes T steps at a time a "Time RNN layer"
The layer that performs the single-step processing in the Time RNN layer is called the "RNN layer"
Like Time RNN, layers that process time series data holistically are named starting with the word "Time", which is the naming convention laid out in this book. After that, we will also implement the Time Affine layer, Time Embedding layer, etc.
Implementation of RNN layer
forward propagation
Backpropagation
Implementation of Time RNN layer
forward propagation
Time RNN layer saves the hidden state h in a member variable to inherit the hidden state between blocks
Use the parameter stateful to record whether the hidden state h is called. In backpropagation, when stateful is False, the hidden state of the first RNN layer is the zero matrix.
Backpropagation
We store the gradient flowing to the hidden state at the previous moment in the member variable dh. This is because we will use it when we discuss seq2seq (sequence-to-sequence) in Chapter 7
Implementation of layers for processing time series data
Full picture of RNNLM
RNN-based language models are called RNNLM
structure
Layer 1 is the Embedding layer, which converts word IDs into distributed representations of words (word vectors). This word vector is fed into the RNN layer.
The RNN layer outputs the hidden state to the next layer (top), and also outputs the hidden state to the next RNN layer (right).
The hidden state output by the RNN layer upward passes through the Affine layer and is passed to the Softmax layer.
Sample
you say goodbye and i say hello
The first word, you with word ID 0, is entered. At this time, looking at the probability distribution output by the Softmax layer, we can see that the probability of say is the highest, which indicates that the word that appears after you is correctly predicted to be say.
Word 2 says. At this time, the output of the Softmax layer has a higher probability at goodbye and hello. Because the RNN layer saves the past information of "you say" as a short hidden state vector. The job of the RNN layer is to pass this information to the Affine layer above and the RNN layer at the next moment.
Implementation of Time layer
Target neural network structure
Time Affine
The Time Affine layer does not simply use T Affine layers, but uses matrix operations to achieve efficient overall processing.
Time Softmax
The loss error is implemented in Softmax. The Cross Entropy Error layer calculates the cross entropy error.
T Softmax with Loss layers each calculate the loss, then add them together and average, and the resulting value is used as the final loss.
Learning and evaluation of RNNLM
Implementation of RNNLM
Language model evaluation
The input data is 1
Perplexity is often used as an indicator to evaluate the prediction performance of language models. Perplexity=1/probability
The input data is multiple
Here, it is assumed that the amount of data is N. tn is the correct solution label in the form of a one-hot vector, tnk represents the k-th value of the n-th data, and ynk represents the probability distribution (the output of Softmax in a neural network). By the way, L is the loss of the neural network
The bigger the probability, the better, and the smaller the confusion, the better.
Perplexity represents the number of options that can be chosen next. If the confusion is 1.25, it means that the number of candidates for the next word is about 1.
Learning of RNNLM
Trainer class of RNNLM
Encapsulate the above operations into classes
Extended to graph structures
recurrent neural network
There are three hidden layers h1, h2 and h3, where h1 is calculated from two inputs x1 and x2, h2 is calculated from two other input layers x3 and x4, and h3 is calculated from two hidden layers h1 and h2.
graph network
Gated RNN
Take off your baggage and travel light. ——Nietzsche
When we say RNN, we refer more to the LSTM layer than the RNN from the previous chapter. When we need to explicitly refer to the RNN from the previous chapter, we say "simple RNN" or "Elman".
Problems with simple RNN
During the learning process, the RNN layer learns dependencies in the time direction by passing "meaningful gradients" to the past. But the gradient of learning is difficult to control, which can lead to gradient disappearance or gradient explosion.
reason
activation function
tanh
As you can see from the graph, its value is less than 1.0, and its value gets smaller as x moves away from 0. This means that as the backpropagated gradient passes through the tanh node, its value will become smaller and smaller. Therefore, if you pass the tanh function T times, the gradient will also decrease T times.
ReLU
Gradient does not degrade
MatMul (Matrix Product) Node
gradient explosion
As shown in the figure, the size of the gradient increases exponentially with the time step. If a gradient explosion occurs, it will eventually lead to overflow and values such as NaN (Not a Number, non-numeric value). As a result, the learning of the neural network will not work correctly.
gradient disappears
As shown in the figure, the size of the gradient decreases exponentially with the time step. If gradient disappearance occurs, the gradient will quickly become smaller. Once the gradient becomes small, the weight gradient cannot be updated and the model cannot learn long-term dependencies.
Reason for change
The matrix Wh is repeatedly multiplied T times. If Wh were a scalar, the problem would be simple: when Wh is greater than 1, the gradient increases exponentially; when Wh is less than 1, the gradient decreases exponentially.
If Wh is a matrix. At this point, the singular values of the matrix will become indicators. Simply put, the singular values of a matrix represent the degree of dispersion of the data. Depending on whether this singular value (more precisely the maximum of multiple singular values) is greater than 1, one can predict changes in the magnitude of the gradient.
Countermeasures against exploding gradients
There is an established method to solve gradient explosion, which is called gradient clipping.
It is assumed here that the gradients of all parameters used by the neural network can be integrated into a variable and represented by the symbol g. Then set the threshold to threshold. At this time, if the L2 norm g of the gradient is greater than or equal to the threshold, the gradient is corrected as described above.
Vanishing gradients and LSTM
In RNN learning, gradient disappearance is also a big problem. In order to solve this problem, the structure of the RNN layer needs to be fundamentally changed. Here the topic of this chapter, Gated RNN, is about to appear. Many Gated RNN frameworks (network structures) have been proposed, among which LSTM and GRU are the representative ones.
LSTM interface
LSTM is the abbreviation of Long Short-Term Memory, which means that it can maintain short-term memory (Short-Term Memory) for a long time.
First express the calculation of tanh(h(t−1)*Wh xt*Wx b) as a rectangular node tanh (ht−1 and xt are row vectors)
Let’s compare the interface (input and output) of LSTM and RNN
The difference between the interface of LSTM and RNN is that LSTM also has path c. This c is called a memory unit (or simply "unit"), which is equivalent to the dedicated memory department of LSTM.
The characteristic of the memory unit is that it only receives and transmits data within the LSTM layer. That is to say, from the side receiving the output of the LSTM, the output of the LSTM only has the hidden state vector h. Memory unit c is invisible to the outside world, and we don't even need to consider its existence.
The structure of the LSTM layer
ct stores the memory of the LSTM at time t, which can be considered to contain all necessary information from the past to time t. Then based on the ct of this carrier memory, the hidden state ht is output.
calculate
The current memory unit ct is calculated based on the three inputs c(t−1) h(t−1) and xt through "some kind of calculation" (described later).
The hidden state ht is calculated using the updated ct, the formula is ht = tanh(ct)
Gate
The opening and closing degree of the door is also automatically learned from the data. The opening and closing degree is represented by a real number from 0.0 to 1.0 (1.0 is fully open)
output gate
The hidden state ht only applies the tanh function to the memory unit ct, and we consider applying gates to tanh(ct). Since this gate manages the output of the next hidden state ht, it is called an output gate.
The output gate is calculated as follows. The sigmoid function is represented by σ()
ht can be calculated from the product of o and tanh(ct). The calculation method is the element-wise product, which is the product of the corresponding elements. It is also called the Hadamard product.
oblivion door
Now, we add a gate to forget unnecessary memories on the memory unit c(t−1), which is called the forget gate here.
The calculation of the forget gate is as follows
ct is obtained by the product of this f and the corresponding element of the previous memory unit ct−1
new memory unit
Now we also want to add some new information to this memory unit that should be remembered, for this we add a new tanh node
The result calculated based on the tanh node is added to the memory unit ct−1 at the previous moment.
The role of this tanh node is not to gate, but to add new information to the memory unit. Therefore, it does not use the sigmoid function as the activation function, but uses the tanh function.
input gate
We add a gate to g in Figure 6-17. This newly added gate is called the input gate here.
The input gate determines the value of each element of the new information g. Input gates do not add new information without consideration; rather, they make choices about which information to add. In other words, the input gate adds weighted new information.
The input gate is calculated as follows
LSTM gradient flow
Backpropagation of memory cells only flows through the " " and "×" nodes. The " " node flows out the gradient from the upstream as it is, so the gradient does not change (degenerates). The calculation of the "×" node is not a matrix product, but the product of the corresponding elements (Hadama product), which will not cause gradient changes.
Implementation of LSTM
For affine changes such as x*Wx h*Wh b, it can be integrated into one formula
Matrix libraries are generally faster when computing "large matrices" and the source code is cleaner by managing the weights together.
Language model using LSTM
The language model implemented here is almost the same as the previous chapter. The only difference is that where the Time RNN layer was used in the previous chapter, the Time LSTM layer is used this time.
Further improvements to RNNLM
Multi-layering of LSTM layers
Deepening the LSTM layer (stacking multiple LSTM layers) often works well.
How many layers are appropriate?
Because the number of layers is a hyperparameter, it needs to be determined based on the complexity of the problem to be solved and the size of the training data that can be provided.
In the case of learning a language model on the PTB data set, better results can be obtained when the number of LSTM layers is 2 to 4
The GNMT model used in Google Translate is superimposed on a network of 8 layers of LSTM.
Suppress overfitting based on Dropout
By deepening the depth, more expressive models can be created, but such models often suffer from overfitting. To make matters worse, RNNs are more prone to overfitting than conventional feedforward neural networks, so overfitting countermeasures for RNNs are very important.
Countermeasures
Add training data
Reduce model complexity
Regularization
Dropout
Dropout
Dropout randomly selects a subset of neurons, then ignores them and stops transmitting signals forward.
Dropout layer insertion position
Regular Dropout
error structure
If Dropout is inserted in the time series direction, information will be gradually lost over time as the model learns.
Correct structure
Insert Dropout layer vertically
Variation Dropout
By sharing the mask between Dropouts of the same layer, the mask is "fixed". In this way, the way information is lost is also "fixed", so the exponential information loss that occurs with regular Dropout can be avoided.
weight sharing
Weight tying can be literally translated as "weight binding".
The trick to binding (sharing) the weights of the Embedding layer and the Affine layer is weight sharing. By sharing the weights between these two layers, the number of parameters learned can be greatly reduced. In addition to this, it improves accuracy.
Better RNNLM implementation
Frontier Research
Generate text based on RNN
There is no such thing as a perfect article, just like there is no perfect despair. ——Haruki Murakami "Listen to the Wind Sing"
Generate text using language models
How to generate the next new word
Select the word with the highest probability, the result is uniquely determined
Words with high probability are easy to be selected, words with low probability are difficult to be selected.
Implementation of text generation
Better text generation
Use better language models
seq2seq model
Seq2Seq (Sequence to Sequence, sequence to sequence model)
Models for converting time series data into other time series data
The principle of seq2seq
This model has two modules - Encoder and Decoder. The encoder encodes the input data and the decoder decodes the encoded data.
seq2seq consists of two LSTM layers, the encoder LSTM and the decoder LSTM.
The hidden state h of LSTM is a fixed-length vector. The difference between it and the model in the previous section is that the LSTM layer receives the vector h. This single, small change allowed ordinary language models to evolve into decoders that could handle translation.
A simple attempt at time series data conversion
Trying to get seq2seq to do addition calculations
Variable length time series data
filling
Fill in the original data with invalid (meaningless) data, from And make the data length aligned.
When using padding you need to add some padding-specific processing to seq2seq
When padding is input in the decoder, its loss should not be calculated (this can be solved by adding a Softmax with Loss mask function to the layer)
When input padding in the encoder, the LSTM layer should output the input from the previous moment as is
additive data set
Implementation of seq2seq
Improvements to seq2seq
Reverse input data
In many cases, learning progresses faster and final accuracy improves after using this technique.
Intuitively, the propagation of gradients can be smoother and more effective after inverting the data.
peeping
The encoder using the helmet is called Peeky Decoder, and the seq2seq using Peeky Decoder is called Peeky seq2seq.
The encoder converts the input sentence into a fixed-length vector h, which concentrates all the information required by the decoder. It is the only source of information for the decoder.
The output h of the encoder, which concentrates important information, can be assigned to other layers of the decoder
Two vectors are input to the LSTM layer and the Affine layer at the same time, which actually represents the concatenation of the two vectors.
Application of seq2seq
Machine Translation: Convert "text in one language" to "text in another language"
Autosummary: Convert "a long text" into a "short summary"
Question and answer system: convert "question" into "answer"
Email auto-reply: Convert "received email text" to "reply text"
chatbot
algorithm learning
Automatic image description
Attention
Attention is everything. ——Title of Vaswani’s paper
Attention is undoubtedly one of the most important technologies in the field of deep learning in recent years. The goal of this chapter is to understand the structure of Attention at the code level, and then apply it to practical problems to experience its wonderful effects.
Attention structure
Problems with seq2seq
An encoder is used in seq2seq to encode the time series data, and then the encoded information is passed to the decoder. At this point, the output of the encoder is a fixed-length vector.
Fixed-length vectors mean that whatever the length of the input statement (no matter how long) is, will be converted into a vector of the same length.
Encoder improvements
The length of the encoder's output should change accordingly based on the length of the input text
Because the encoder processes from left to right, strictly speaking, the "cat" vector just contains the information of the three words "我的人", "は" and "猫". Considering the overall balance, it is best to include information around the word "cat" evenly. In this case, bidirectional RNN (or bidirectional LSTM) that processes time series data from both directions is more effective.
Decoder improvements
Previously we put the "last" hidden state of the encoder's LSTM layer into the "initial" hidden state of the decoder's LSTM layer.
The decoder in the previous chapter only used the last hidden state of the encoder's LSTM layer. If using hs, only the last row is extracted and passed to the decoder. Next we improve the decoder to be able to use all hs.
We focus on a certain word (or set of words) and convert this word at any time. This allows seq2seq to learn the correspondence between "which words in the input and output are related to which words"
Example
My generation [わがはい] = I
猫[ねこ] = cat
Many studies exploit knowledge of word correspondences such as "cat =cat". Such information indicating the correspondence between words (or phrases) is called alignment. So far, alignment has been mainly done manually, but the Attention technology we will introduce has successfully introduced the alignment idea into seq2seq automatically. This is also the evolution from "manual operation" to "mechanical automation".
structural changes
How to calculate
Can the operation of "selection" be replaced by a differentiable operation? Instead of "single selection", it is better to "select all". We separately calculate the weight representing the importance (contribution value) of each word.
a Like a probability distribution, each element is a scalar from 0.0 to 1.0, and the sum is 1. Then, calculate the weighted sum of the weight representing the importance of each word and the word vector hs to obtain the target vector.
When processing sequence data, the network should pay more attention to the important parts of the input and ignore the unimportant parts. It explicitly weights the important parts of the input sequence by learning the weights of different parts, so that the model can be better Pay close attention to output-related information. The key to the Attention mechanism is to introduce a mechanism to dynamically calculate the weight of each position in the input sequence, so that at each time step, different parts of the input sequence are weighted and summed to obtain the output of the current time step. When generating each output, the decoder pays different attention to different parts of the input sequence, allowing the model to better focus on important information in the input sequence.
Learning of weight a
Our goal is to express numerically how "similar" this h is to the individual word vectors of hs.
Here we use the simplest vector inner product.
There are several ways to calculate vector similarity. In addition to inner products, there is also the practice of using small neural networks to output scores.
Next, s is regularized using the old Softmax function
Integrate
Implementation of seq2seq with Attention
Attention's evaluation
We turned to confirm the effect of seq2seq with Attention by studying the "date format conversion" problem
Date format conversion problem
Learning of seq2seq with Attention
Visualization of Attention
Other topics about Attention
Bidirectional RNN
If we consider the overall balance, we want the vector to contain information around the word "cat" more evenly.
Bidirectional LSTM adds an LSTM layer processing in the opposite direction on top of the previous LSTM layer.
Splice the hidden states of the two LSTM layers at each moment and use it as the final hidden state vector (in addition to splicing, you can also "sum" or "average", etc.)
How to use the Attention layer
The output of the attention layer (context vector) is connected to the input of the LSTM layer at the next moment. Through this structure, the LSTM layer is able to use the information of the context vector. In contrast, the model we implemented uses context vectors in the Affine layer.
Deepening of seq2seq and skip connection
Deepen the RNN layer
seq2seq with attention using 3 layers of LSTM layer
residual connection
At the junction of the residual connection, two outputs are added.
Because addition propagates gradients "as is" when backpropagating, gradients in residual connections can be propagated to the previous layer without any effect. In this way, even if the layer is deepened, the gradient can propagate normally without gradient disappearance (or gradient explosion), and learning can proceed smoothly.
Application of Attention
GNMT
The history of machine translation
Rule-based translation
Use case-based translation
Statistics-based translation
Neural Machine Translation
Since 2016, Google Translate has been using neural machine translation for actual services. Machine translation system called GNMT
GNMT requires large amounts of data and computing resources. GNMT uses a large amount of training data, (1 model) learned on nearly 100 GPUs for 6 days. In addition, GNMT is also trying to further improve accuracy based on technologies such as ensemble learning and reinforcement learning that can learn 8 models in parallel.
Transformer
RNN can handle variable-length time series data well. However, RNN also has shortcomings, such as parallel processing problems.
RNN needs to be calculated step by step based on the calculation results of the previous moment, so it is (basically) impossible to calculate RNN in parallel in the time direction. This will become a big bottleneck when performing deep learning in a parallel computing environment using GPUs, so we have the motivation to avoid RNN.
Transformer does not use RNN, but uses Attention for processing. Let's take a brief look at this Transformer.
Self-Attention
Transformer is based on Attention, which uses the Self-Attention technique, which is important. Self-Attention literally translates as "one's own attention to oneself", that is to say, this is Attention based on a time series data, aiming to observe the relationship between each element in a time series data and other elements.
Use a fully connected neural network with one hidden layer and activation function ReLU. In addition, Nx in the figure means that the elements surrounded by the gray background are stacked N times.
NTM
NTM (Neural Turing Machine)
Neural networks can also gain additional capabilities using external storage devices.
Based on Attention, encoders and decoders implement "memory operations" in computers. In other words, this can be interpreted as, the encoder writes the necessary information to memory and the decoder reads from memory Get the necessary information.
In order to imitate the computer's memory operation, NTM's memory operation uses two Attentions,
Content-based Attention is the same as the Attention we introduced before, and is used to find similar vectors of a certain vector (query vector) from memory.
Position-based Attention is used to move forward and backward from the memory address (the weight of each location in the memory) that was focused on at the last moment. Here we omit the discussion of its technical details, which can be achieved through one-dimensional convolution operation. The movement function based on the memory location can reproduce the unique computer activity of "reading while advancing (one memory address)".
NTM has successfully solved long-term memory problems, sorting problems (arranging numbers from large to small), etc.
Network optimization and regularization
No mathematical trick can compensate for the lack of information [Cornelius Lanczos]
Two major difficulties
Optimization
Difficult to optimize and computationally intensive
generalization problem
The fitting ability is too strong and it is easy to overfit.
Network Optimization
Difficulties in network optimization
Network structure diversity
It is difficult to find a general optimization method. Different optimization methods also have relatively large differences in different network structures.
Difficulties with low-dimensional spaces
How to choose initialization parameters
Escape from the local optimum
Difficulties with high-dimensional spaces
How to choose initialization parameters
How to escape from a saddle point
In some dimensions it is the highest point, in other dimensions it is the lowest point
flat bottom
There are many parameters in deep neural networks and there is a certain degree of redundancy, which results in each single parameter having a relatively small impact on the final loss.
stuck in local minimum
optimization
Gradient descent method type
batch gradient descent
stochastic gradient descent
mini-batch gradient descent
If in gradient descent, calculating the gradient on the entire training data for each iteration requires more computing resources. Additionally, the data in large training sets is often very redundant, and there is no need to compute gradients over the entire training set.
learning rate decay
The learning rate should be kept larger at the beginning to ensure the convergence speed, and smaller when it converges to near the optimal point to avoid back and forth oscillations.
type
Reverse time decay
exponential decay
natural exponential decay
β is the attenuation rate, generally taking a value of 0.96.
There are also methods for adaptively adjusting the learning rate, such as AdaGrad, RMSprop, AdaDelta, etc.
AdaGrad method
Among the effective techniques for learning rate is a method called learning rate decay, which gradually decreases the learning rate as learning proceeds.
AdaGrad takes this idea further, adjusting the learning rate appropriately for each element of the parameters while learning at the same time
Ada comes from the English word Adaptive, which means "appropriate"
Like the previous SGD, W represents the weight parameter to be updated, the partial derivative represents the gradient, and n represents the learning rate.
But a new variable h appears, which stores the sum of the squares of all previous gradient values. Therefore, the deeper the learning, the smaller the update amplitude.
RMSProp method
If you learn endlessly, the update amount will become 0
The RMSProp method does not add all the past gradients equally, but gradually forgets the past gradients and reflects more information about the new gradients when doing the addition operation.
Technically speaking, this operation is called "exponential moving average", which exponentially reduces the scale of past gradients.
Gradient direction optimization
Momentum method
In mini-batch gradient descent, if the number of samples selected each time is relatively small, the loss will decrease in an oscillating manner.
By using the average gradient in the latest period of time instead of the gradient at the current moment as the direction of parameter update.
also called momentum method
Disadvantages of SGD
f(x,y)=(1/20)*x^2 y^2
Optimized update path based on SGD: moving towards the minimum value (0, 0) in a zigzag shape, low efficiency
ways to improve
Like the previous SGD, W represents the weight parameter to be updated, the partial derivative represents the gradient, and n represents the learning rate.
But a new variable v appears, which corresponds to the physical speed, which can be understood as the force exerted on the object in the gradient direction.
Adam method
Momentum moves according to the physical rules of a ball rolling in a bowl, and AdaGrad adjusts the update pace appropriately for each element of the parameter. Combining them is Adam's idea
There is no (currently) method that performs well on all problems. Each of these four methods has its own characteristics, and each has its own problems that it is good at solving and problems that it is not good at solving.
gradient cutoff
If the gradient suddenly increases, using a large gradient to update the parameters will lead to it being far away from the optimal point.
When the modulus of the gradient is greater than a certain threshold, the gradient is truncated.
Limit the modulus of the gradient to an interval, and truncate it when the modulus of the gradient is smaller or larger than this interval.
type
Truncate by value
gt = max(min(gt, b), a).
Truncate according to mold
Parameter initialization
Gaussian distribution initialization
The Gaussian initialization method is the simplest initialization method. The parameters are randomly initialized from a Gaussian distribution with a fixed mean (such as 0) and a fixed variance (such as 0.01).
When the number of input connections of a neuron is n(in), its input connection weight can be set to be initialized with the Gaussian distribution of N(0,sqrt(1/nin)).
If the number of output connections nout is also considered, it can be initialized according to the Gaussian distribution of N(0,sqrt(2/(nin nout)))
Uniformly distributed initialization
Uniform distribution initialization uses uniform distribution to initialize parameters within a given interval [−r, r]. The setting of the hyperparameter r can also be adjusted adaptively according to the number of connections of neurons.
Activation function type
logistic function
tanh
Xavier initial value
We tried using the weight initial values recommended in the paper by Xavier Glorot et al.
If the number of nodes in the previous layer is n, the initial value uses a Gaussian distribution with a standard deviation of (1/sqrt(n))
ReLU weight initial value
When the activation function uses ReLU, it is generally recommended to use the initial value dedicated to ReLU, which is the initial value recommended by Kaiming He et al., also known as the "He initial value".
When the number of nodes in the current layer is n, the initial value of He uses a Gaussian distribution with a standard deviation of (2/sqrt(n))
Data preprocessing
different units
The different sources and measurement units of each dimensional feature will cause the distribution range of these feature values to be very different. When we calculate the Euclidean distance between different samples, features with a large value range will play a dominant role.
scaling normalization
The value range of each feature is normalized to [0, 1] or [−1, 1] by scaling.
standard normalization
Also called z-score normalization
Each dimensional feature is processed to conform to the standard normal distribution (mean is 0, standard deviation is 1).
Data redundancy
After the input data is whitened, the correlation between features is low and all features have the same variance.
One of the main ways to achieve whitening is to use principal component analysis to remove the correlation between components.
Layer-by-layer normalization
When using stochastic gradient descent to train a network, each parameter update will cause the distribution of inputs to each layer in the middle of the network to change. The deeper the layer, the more obviously the distribution of its input will change.
batch normalization
Also called Batch Normalization, BN method
In order to make each layer have the appropriate breadth, the distribution of activation values is "forced" to be adjusted.
Perform regularization so that the mean of the data distribution is 0 and the variance is 1.
Any intermediate layer in the neural network can be normalized.
advantage
Can make learning happen quickly (can increase the learning rate)
Less dependent on initial values (not so sensitive to initial values)
Suppress overfitting (reduce the need for Dropout, etc.)
Batch Norm layer
Affine->Batch Norm->ReLU
layer normalization
Limitations of batch normalization
Batch normalization is a normalization operation on a single neuron in an intermediate layer, so the number of mini-batch samples must not be too small, otherwise it will be difficult to calculate the statistical information of a single neuron.
If the distribution of a neuron's net input changes dynamically in a neural network, such as a recurrent neural network, then the batch normalization operation cannot be applied
Layer normalization normalizes all neurons in an intermediate layer.
Batch normalization is very effective in convolutional neural networks (CNN), while layer normalization is more common in recurrent neural networks (RNN) and Transformer networks.
Hyperparameter optimization
composition
Network structure
connections between neurons
Number of layers
Number of neurons per layer
Type of activation function
Optimization parameters
Network optimization methods
learning rate
Sample size for small batches
regularization coefficient
Validation data (validation set)
Hyperparameter performance cannot be evaluated using test data
If the test data is used to confirm the "goodness" of the hyperparameter value, it will cause the hyperparameter value to be adjusted to only fit the test data.
The training data is used for parameter learning (weights and biases), and the validation data is used for performance evaluation of hyperparameters. In order to confirm the generalization ability, the test data should be used at the end (ideally only once)
Optimization
grid search
Target the right one by trying all combinations of hyperparameters Methods for group hyperparameter configuration.
Choose several "experience" values. For example, learning rate α, we can set α ∈ {0.01, 0.1, 0.5, 1.0}.
random search
Set the range of hyperparameters and randomly sample from the set range of hyperparameters
Evaluate recognition accuracy through validation data (but set epoch very small)
Repeat the above (100 times, etc.) and narrow the range of hyperparameters based on the results of their recognition accuracy
Bayesian optimization
Dynamic resource allocation
network regularization
Purpose: To suppress overfitting
weight decay
Weight decay is a method that has been frequently used to suppress overfitting. This method penalizes large weights during the learning process.
Simply put, the loss function becomes
λ is a hyperparameter that controls the strength of regularization
discard method
Dropout Method
If the network model becomes very complex, it will be difficult to deal with it using only weight decay.
The method of randomly deleting neurons during the learning process chooses to discard neurons randomly each time. The simplest way is to set a fixed probability p. For each neuron, there is a probability p to determine whether to retain it.
data augmentation
Rotate, flip, scale, translate, add noise
label smoothing
Add noise to output labels to avoid model overfitting
Model independent learning method
Ensemble learning
Integrate multiple models through a certain strategy to improve decision-making accuracy through group decision-making. The primary issue in ensemble learning is how to integrate multiple models. The more commonly used integration strategies include direct average, weighted average, etc.
Self-training and collaborative training
All belong to semi-supervised learning
self training
Self-training is to first use labeled data to train a model, and use this model to predict the labels of unlabeled samples, add samples with relatively high prediction confidence and their predicted pseudo labels to the training set, and then retrain the new model, and Keep repeating this process.
collaborative training
Co-training is an improved method of self-training
Two classifiers based on different views promote each other. A lot of data has different perspectives that are relatively independent.
Due to the conditional independence of different perspectives, models trained on different perspectives are equivalent to understanding the problem from different perspectives and have certain complementarity. Collaborative training is a method that uses this complementarity to perform self-training. First, two models f1 and f2 are trained on the training set according to different perspectives, and then f1 and f2 are used to predict on the unlabeled data set. Samples with relatively high prediction confidence are selected to be added to the training set, and two different perspectives are retrained. model and repeat this process.
multi-task learning
General machine learning models are aimed at a single specific task, such as handwritten digit recognition, object detection, etc. Models for different tasks are learned separately on their respective training sets.
If two tasks are related, there will be some shared knowledge between them, and this knowledge will be helpful to both tasks. These shared knowledge can be representations (features), model parameters, or learning algorithms, etc.
type
transfer learning
If there is a related task that already has a large amount of training data, although the distribution of these training data is different from that of the target task, due to the relatively large scale of the training data, we assume that we can learn some generalizable knowledge from it, then this knowledge will be useful for Target tasks will be of some help. How to transfer the generalizable knowledge in the training data of related tasks to the target task is the problem to be solved by transfer learning.
Transfer learning refers to the process of knowledge transfer between two different fields, using the knowledge learned in the source domain (Source Domain) DS to help the learning tasks in the target domain (Target Domain) DT. The number of training samples in the source domain is generally much larger than that in the target domain.
Classification
inductive transfer learning
A model is learned on the training data set that minimizes the expected risk (i.e., the error rate on the real data distribution).
Deriving transfer learning
Learn a model that minimizes error on a given test set
Fine-tuning is an application method of transfer learning. It usually refers to using new, task-specific data sets to perform additional training on the basis of an already trained model to improve the performance of the model on a specific task. The purpose of fine-tuning is to use the general knowledge learned by the pre-trained model on large-scale data to accelerate and optimize the learning process on specific tasks.
lifelong learning
question
Once training is completed, the model remains fixed and is no longer iteratively updated.
It is still very difficult for a model to be successful on many different tasks at the same time.
Lifelong Learning, also called Continuous Learning, refers to the continuous learning ability like humans, using the experience and knowledge learned in historical tasks to help learn new tasks that constantly emerge, and these experiences and Knowledge is continuously accumulated and will not change because of new tasks and forget old knowledge.
In lifelong learning, it is assumed that a lifelong learning algorithm has learned a model on historical tasks T1, T2, · · · , Tm. When a new task Tm 1 appears, this algorithm can learn a model based on the past tasks learned on m tasks. knowledge to help the m 1th task, while accumulating knowledge on all m 1 tasks.
This setting is very similar to inductive transfer learning, but the goal of inductive transfer learning is to optimize the performance of the target task without caring about the accumulation of knowledge. The goal of lifelong learning is continuous learning and knowledge accumulation. In addition, unlike multi-task learning, lifelong learning does not involve learning on all tasks simultaneously.
meta-learning
According to the no free lunch theorem, no universal learning algorithm is effective on all tasks. Therefore, when using machine learning algorithms to implement a certain task, we usually need to "discuss the matter" and choose the appropriate model, loss function, optimization algorithm, and hyperparameters based on the specific tasks.
The ability to dynamically adjust your own learning methods is called meta-learning, also known as learning of learning.
Another machine learning problem related to meta-learning is small sample learning
Two typical meta-learning methods
Optimizer-based meta-learning
The difference between different optimization algorithms lies in the different rules for updating parameters. Therefore, a natural meta-learning is to automatically learn a rule for updating parameters, that is, modeling the gradient descent process through another neural network (such as a recurrent neural network).
Model-agnostic meta-learning
It is a simple model-independent and task-independent meta-learning algorithm.