Tutorial Overview & Learning
COMP 4332 Tutorial 1
Feb 13
[email protected]
Course Work
• Three or Four assignments (20%)
1. Progress report of the first project
2. Add cross-validation to the first project
3. 3-4 quick questions about data mining
• Two projects (60%)
1. KDD Cup 2014 - Predicting Excitement at DonorsChoose.org
2. PAKDD 2014 - ASUS Malfunctional Components Prediction
• One term paper (10%)
• One presentation (10%)
Project-oriented tutorials
• Project and assignments count for 80% of your grade.
• You will write code in a few languages/tools.
• More importantly, you will do experiments!
• Very different from COMP4331. Light on concepts/math.
Heavy hands-on course.
COMP 4332 = COMP 4331 + COMP 4331
A data mining project requires ...
• 1. Explore data and data preprocessing.
• 2. Trying algorithms, SVM, Logistic Regression, Decision
Trees, Dimensionality Reduction, etc... And try varying
parameters in each algorithm.
• Labor intensive!
• Sometimes frustrated.
Repeatedly go to step 1 to reprocessing the data to feed
into different tools.
• 3. Summarize findings and design new methods and go
back to step 2.
The creative part!
1. Explore data/look at the data
• Visualization:
• 1D data summary: mean, variance, median, skewness; density
estimation(pdf), cdf; outliers, etc.
• 2D data summary: scatter plot, QQ-plot, correlation scores, etc.
• High-dimensional data summary: dimensionality reduction and plot
to 2D or 3D
• Store data and extract wanted part.
• Organized: SQL like queries...
• Quick and dirty: write a script for each operation...
2. Run experiments using tools
• Most of the time, tools are available.
• Weka, libsvm, etc..
Good news:)
• Sometimes, you need to implement a variant of existing
• A different decision tree
• A classifier handles unbalanced data
Numerical code is generally
hard to write correctly (hard
to DEBUG!). You will do this
in this course!
• Run the methods and vary parameters and plot results
and trends.
3. Summarize findings and design new
• After each iteration of step 1 and 2, you know more about
the data, you may have new ideas and go back to step 1
and 2.
• But before that, first document your findings.
A cloud of tools ...
• Data preprocessing: Python, Java/C++, SQL, Excel, text
• Visualization: Excel, Matlab, R, matlibplot
• SVM: libsvm, svmlight, liblinear packages
• Logistic regression: liblinear
• Decision Trees & tree ensemble: Weka, FEST
• Matrix factorization: libfm, GraphLab
*Bolded tools are that we will teach in the tutorials.
Teaching all of them is impossible!
• You have to take time to read the manuals of these tools,
and sometimes source code of them!
• Through this course, we will use Python to illustrate
• Data preprocessing (mostly its string processing)
• Algorithm implementation (numpy/scipy)
• Automaticly perform experiments
• Simple plotting (matlibplot)
• Sometimes, we use R’s plotting packages (core, ggplot2)
if matlibplot does not fit the requirement.
Why Python
• Easy to learn and easy to use.
• A good tool for us to illustrate the three steps of doing a data mining
• A concise and powerful language.
• A glue language. Easily integrate components written in other
• Widely used in IT industries. Organizations using Python
• We would use latest python version in this course(python3.4)
Setup Python Scientific Environment
• Anaconda Scientific Python Distribution
• It includes over 195 of the most popular Python packages for
science, math, engineering, data analysis. (numpy, scipy, sklearn,
• Cross Platform
• No need to install scientific package one by one
• Default IDE is weak. Recommended IDEs:
• Sublime Text (recommended)
• PyCharm (recommended)
• Eclipse + pydev (cross platform)
• Or simply Notepad++ editor with syntax highlighting (only in
Learn Python
• The official Python tutorial. Written for experienced
• Read it twice and try every code snippet in the tutorial.
• Code Like a Pythonista: Idiomatic Python
• Python Howto: sort, logging, functional programming, etc.
• MIT 6.00 course material.
• Liang Huang’s Python Short Course.
• numpy examples and scipy tutorial.
• Best place to ask a Python-related question:
http://stackoverflow.com/. It is better to send your Python
question to Stackoverflow rather than to our mailing list.
Learn Python (Books)
A Byte of Python
• Learning Python
• Python Cookbook
• Moving from Python2 to Python3
Play with Python data structures
• basic types: bool, integer, float, complex
• tuple: (x, y, ..)
• list:
[x, y, ...]
• string: ‘hello’, “world”
• dictionary: { x: a, y: b, ... }
• set: set([a, b, c, d])
• iteratable/sequence: a unified view for data structures
• tuple/list/dictionary/set/string are all iteratable.
• 1. Go through basic Python data structures and their
• 2. Show Python’s functions and control structures (if-then-
Project Requirement
• You should register an account as a team in the Kaggle
The mark on the team is the same as each member of
the team.
You should send the source code of your project or
report(if you use some other tools) to me plus the final
online score([email protected])
Marking: source code/report(50%) + final score you get on
the website(50%)
Deadline: Project 1(11:59pm Apr 12th), Project
2(11:59pm, May 10th)

similar documents