Sean - Undergraduate Research in Consumer Networking

Report
An Evolutionary Method for
Training Autoencoders for Deep
Learning Networks
M A STER ’S T HESIS DE F E NSE
S EA N L A N DE R
A DV I S OR: YI S HA N G
University of Missouri, Department of Computer Science
University of Missouri, Informatics Institute
Sean Lander, Master’s Candidate
Agenda
oOverview
oBackground and Related Work
oMethods
oPerformance and Testing
oResults
oConclusion and Future Work
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Agenda
oOverview
oBackground and Related Work
oMethods
oPerformance and Testing
oResults
oConclusion and Future Work
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Overview
Deep Learning classification/reconstruction
oSince 2006, Deep Learning Networks (DLNs) have changed
the landscape of classification problems
oStrong ability to create and utilize abstract features
oEasily lends itself to GPU and distributed systems
oDoes not require labeled data – VERY IMPORTANT
oCan be used for feature reduction and classification
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Overview
Problem and proposed solution
oProblems with DLNs:
oCostly to train with large data sets or high feature spaces
oLocal minima systemic with Artificial Neural Networks
oHyper-parameters must be hand selected
oProposed Solutions:
oEvolutionary based approach with local search phase
o Increased chance of global minimum
o Optimizes structure based on abstracted features
oData partitions based on population size (large data only)
o Reduced training time
o Reduced chances of overfitting
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Agenda
oOverview
oBackground and Related Work
oMethods
oPerformance and Testing
oResults
oConclusion and Future Work
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Background
Perceptrons
oStarted with Perceptron in 1950
oOnly capable of linear separability
oFailed on XOR
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Background
Artificial Neural Networks (ANNs)
oANNs went out of favor until the Multilayer Perceptron (MLP) introduced
oPro: Non-linear classification
oCon: Time consuming
oAdvance in training: Backpropagation
oIncreased training speeds
oLimited to shallow networks
oError propagation diminishes a
number of layers increase
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Background
Backpropagation using Gradient Descent
oProposed in 1988, based on classification error
oGiven m training samples:
o{ 1 ,  , … , 1 ,  }
oFor each sample (, ) where ,  ∈ 0,1 calculate its error:
o , ; ,  =
1
2
ℎ,  − 
2
oFor all m training samples the total error can be calculated as:
o ,  =
1


=1 (, ;  ,  )
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Background
Deep Learning Networks (DLNs)
oAllows for deep networks with multiple layers
oLayers pre-trained using unlabeled data
oLayers are “stacked” and fine tuned
oMinimizes error degradation for deep
neural networks (many layers)
oStill costly to train
oManual selection of hyper-parameters
oLocal, not global, minimum
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Background
Autoencoders for reconstruction
oAutoencoders can be used for
feature reduction and clustering
o“Classification error” is the ability
to reconstruct the sample input
oAbstracted features – output from
the hidden layer – can be used to
replace raw input for other
techniques
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Related Work
Evolutionary and genetic ANNs
oFirst use of Genetic Algorithms (GAs) in 1989
oTwo layer ANN on a small data set
oTested multiple types of chromosomal encodings and mutation types
oLate 1990s and early 2000s introduced other techniques
oMulti-level mutations and mutation priority
oAddition of local search in each generation
oInclusion of hyper-parameters as part of the mutation
oIssue of competing conventions starts to appear
o Two ANNs produce the same results by sharing the same nodes but in a permuted order
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Related Work
Hyper-parameter selection for DLNs
oMajority of the work explored using newer technologies
and methods such as GPU and distributed (MapReduce)
training
oImproved versions of Backpropagation, such as Conjugated
Gradient or Limited Memory BFGS were tested under
different conditions
oMost conclusions pointed toward manual parameter
selection via trial-and-error
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Agenda
oOverview
oBackground and Related Work
oMethods
oPerformance and Testing
oResults
oConclusion and Future Work
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Method 1
Evolutionary Autoencoder (EvoAE)
oIDEA: Autoencoders’ power are in their feature abstraction, the
hidden node output
oTraining many AEs will
make more potential
abstracted features
oBest AEs will contain the
best features
oJoining these features
should create a better AE
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Method 1
Evolutionary Autoencoder (EvoAE)
Search
Local
Initialization
Crossover
Mutation
x
University of Missouri, Department of Computer Science
h
x’
x
h
A1
B1
A2
B2
A3
B3
A4
C2
x’
Sean Lander, Master’s Candidate
Method 1A
Distributed learning and Mini-batches
oTraining of generic EvoAE increases in time linearly to the
size of the population
oANN training time increases drastically with data size
oTo combat this, mini-batches can be used where each AE is
trained against a batch and updated
oBatch size << total data
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Method 1A
Distributed learning and Mini-batches
oEvoAE lends itself to distributed system
oData duplication and storage now an issue due to data
duplication
• Forward propagation
Batch 1
Train • Backpropagation
Batch 2
…
Batch N
University of Missouri, Department of Computer Science
• Calculate error
Rank • Sort
GA
• Crossover
• Mutate
Sean Lander, Master’s Candidate
Method 2
EvoAE Evo-batches
oIDEA: When data is large, small batches can be representative
oPrevents overfitting as nodes being trained are almost always
introduced to new data
oScales well with large amounts of data even when parallel
training is not possible
oWorks well on limited memory systems by increasing size of the
population, thus reducing data per batch
oQuick training of large populations, equivalent to training a
single autoencoder using traditional methods
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Method 2
EvoAE Evo-batches
Original Data
Data A
Data B
Data A
Data B
Data C
Local
Crossover
Mutate
Search
University of Missouri, Department of Computer Science
Data D
Data C
Data D
Sean Lander, Master’s Candidate
Agenda
oOverview
oBackground and Related Work
oMethods
oPerformance and Testing
oResults
oConclusion and Future Work
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Performance and Testing
Hardware and testing parameters
oLenovo Y500 laptop
oIntel i7 3rd generation 2.4GHz
o12 GB RAM
oAll weights randomly initialized to N(0,0.5)
Parameter
Defaults
Learning Rate
0.1
Momentum
2
Weight Decay
0.003
Population Size
30
Parameter
Wine
Iris
Heart Disease
MNIST
Hidden Size
32
32
12
200
Generations
50
Hidden Std Dev NULL
NULL
NULL
80
Epochs/Gen
20
Hidden +/-
16
16
6
NULL
Train/Validate
80/20
Mutation Rate
0.1
0.1
0.1
0.1
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Performance and Testing
Baseline
oBaseline is a single AE with 30 random initializations
oTwo learning rates to create two baseline measurements
oBase learning rate
oLearning rate * 0.1
Learning rate
University of Missouri, Department of Computer Science
Learning rate * 0.1
Sean Lander, Master’s Candidate
Performance and Testing
Data partitioning
oThree data partitioning methods were used
oFull data
oMini-batch
oEvo-batch
Full data
University of Missouri, Department of Computer Science
Mini-batch
Evo-batch
Sean Lander, Master’s Candidate
Performance and Testing
Post-training configurations
oPost-training run in the following ways
oFull data (All)
oBatch data (Batch)
oNone
All sets below are using the Evo-batch configuration
Full data
University of Missouri, Department of Computer Science
Batch data
None
Sean Lander, Master’s Candidate
Agenda
oOverview
oBackground and Related Work
oMethods
oPerformance and Testing
oResults
oConclusion and Future Work
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Results
Parameters Review
Parameter
Hidden Size
Wine
32
MNIST
200
Hidden Std Dev
NULL
80
Hidden +/-
16
NULL
Mutation Rate
0.1
University of Missouri, Department of Computer Science
0.1
Parameter
Defaults
Learning Rate
0.1
Momentum
2
Weight Decay
0.003
Population Size
30
Generations
50
Epochs/Gen
20
Train/Validate
80/20
Sean Lander, Master’s Candidate
Results
Datasets
oUCI wine dataset
o178 samples
o13 features
o3 classes
oReduced MNIST dataset
o6000/1000 and 24k/6k training/testing samples
o784 features
o10 classes (0-9)
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Results
Small datasets - UCI Wine
University of Missouri, Department of Computer Science
Parameter
Wine
Hidden Size
32
Hidden Std Dev
NULL
Hidden +/-
16
Mutation Rate
0.1
Sean Lander, Master’s Candidate
Results
Small datasets - UCI Wine
Parameter
Wine
Hidden Size
32
Hidden Std Dev
NULL
Hidden +/-
16
Mutation Rate
0.1
oBest error-to-speed:
oBaseline 1
oBest overall error:
oFull data All
oFull data is fast on
small scale data
oEvo- and mini-batch
not good on small
scale data
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Results
Small datasets – MNIST 6k/1k
University of Missouri, Department of Computer Science
Parameter
MNIST
Hidden Size
200
Hidden Std Dev
80
Hidden +/-
NULL
Mutation Rate
0.1
Sean Lander, Master’s Candidate
Results
Small datasets – MNIST 6k/1k
Parameter
MNIST
Hidden Size
200
Hidden Std Dev
80
Hidden +/-
NULL
Mutation Rate
0.1
oBest error-to-time:
oMini-batch None
oBest overall error:
oMini-batch Batch
oFull data slows
exponentially on
large scale data
oEvo- and mini-batch
close to baseline speed
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Results
Medium datasets – MNIST 24k/6k
University of Missouri, Department of Computer Science
Parameter
MNIST
Hidden Size
200
Hidden Std Dev
80
Hidden +/-
NULL
Mutation Rate
0.1
Sean Lander, Master’s Candidate
Results
Medium datasets – MNIST 24k/6k
Parameter
MNIST
Hidden Size
200
Hidden Std Dev
80
Hidden +/-
NULL
Mutation Rate
0.1
oBest error-to-time:
oEvo-batch None
oBest overall error:
oEvo-batch Batch OR
oMini-batch Batch
oFull data too slow to
run on dataset
oEvoAE w/ population
30 trains as quickly as
a single baseline AE
when using Evo-batch
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Agenda
oOverview
oBackground and Related Work
oMethods
oPerformance and Testing
oResults
oConclusion and Future Work
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Conclusions
Good for large problems
oTraditional methods are still preferred choice for small
problems and toy problems
oEvoAE with Evo-batch produces effective and efficient
feature reduction given a large volume of data
oEvoAE is robust against poorly-chosen hyper-parameters,
specifically learning rate
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Future Work
oImmediate goals:
oTransition to distributed system, MapReduce based or otherwise
oHarness GPU technology for increased speeds (~50% in some
cases)
oLong term goals:
oOpen the system for use by novices and non-programmers
oMake the system easy to use and transparent to the user for both
modification and training purposes
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Thank you
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Background
Backpropagation with weight decay
oWe use this new cost to update weights and biases given some
learning rate α:
()
()
o =  − 
o


=  − 
 ,
()

 ,


oCost is prone to overfitting - weight decay variable λ is added
o ,  =
1


=1 (, ;  ,  )
University of Missouri, Department of Computer Science
+

2
−1
=1

=1
+1
() 2
(
 )
=1
Sean Lander, Master’s Candidate
Background
Conjugated Gradient Descent
oThis can become stuck in a loop, however, so we add a momentum
term β
()
o
o
 
=
=
 ,
()

 ,

 −1
+ 
+ 
 −1

()
()
 
o =  − ( )


 
o =  − ( )
oThis adds memory to the equation, as we use previous updates
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Background
Architecture and hyper-parameters
oArchitecture and hyper-parameter selection usually done
through trial-and-error
oManually optimized and updated by hand
oDynamic learning rates can be
implemented to correct for
sub-optimal learning rate selection
o ′ =  ∗ 1 + exp −ℎ
o ′ =  −  ′ ∗ ∆
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Results
Small datasets – UCI Iris
Parameter
Iris
Hidden Size
32
Hidden Std Dev
NULL
Hidden +/-
16
Mutation Rate
0.1
oThe UCI Iris dataset has 150 samples with 4 features and 3
classes
oBest error-to-speed:
oBaseline 1
oBest overall error:
oFull data None
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate
Results
Small datasets – UCI Heart Disease
Parameter
Heart Disease
Hidden Size
12
Hidden Std Dev
NULL
Hidden +/-
6
Mutation Rate
0.1
oThe UCI Heart Disease dataset has 297 samples with 13
features and 5 classes
oBest error-to-time:
oBaseline 1
oBest overall error:
oFull data None
University of Missouri, Department of Computer Science
Sean Lander, Master’s Candidate

similar documents