博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Choosing a Machine Learning Classifier
阅读量:4048 次
发布时间:2019-05-25

本文共 4383 字,大约阅读时间需要 14 分钟。

How do you know what machine learning algorithm to choose for your classification problem? Of course, if you really care about accuracy, your best bet is to test out a couple different ones (making sure to try different parameters within each algorithm as well), and select the best one by cross-validation. But if you’re simply looking for a “good enough” algorithm for your problem, or a place to start, here are some general guidelines I’ve found to work well over the years.

How large is your training set?

If your training set is small, high bias/low variance classifiers (e.g., Naive Bayes) have an advantage over low bias/high variance classifiers (e.g., kNN), since the latter will overfit. But low bias/high variance classifiers start to win out as your training set grows (they have lower asymptotic error), since high bias classifiers aren’t powerful enough to provide accurate models.

You can also think of this as a generative model vs. discriminative model distinction.

Advantages of some particular algorithms

Advantages of Naive Bayes: Super simple, you’re just doing a bunch of counts. If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. And even if the NB assumption doesn’t hold, a NB classifier still often does a great job in practice. A good bet if want something fast and easy that performs pretty well. Its main disadvantage is that it can’t learn interactions between features (e.g., it can’t learn that although you love movies with Brad Pitt and Tom Cruise, you hate movies where they’re together).

Advantages of Logistic Regression: Lots of ways to regularize your model, and you don’t have to worry as much about your features being correlated, like you do in Naive Bayes. You also have a nice probabilistic interpretation, unlike decision trees or SVMs, and you can easily update your model to take in new data (using an online gradient descent method), again unlike decision trees or SVMs. Use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you’re unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.

Advantages of Decision Trees: Easy to interpret and explain (for some people – I’m not sure I fall into this camp). They easily handle feature interactions and they’re non-parametric, so you don’t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end). One disadvantage is that they don’t support online learning, so you have to rebuild your tree when new examples come on. Another disadvantage is that they easily overfit, but that’s where ensemble methods like random forests (or boosted trees) come in. Plus, random forests are often the winner for lots of problems in classification (usually slightly ahead of SVMs, I believe), they’re fast and scalable, and you don’t have to worry about tuning a bunch of parameters like you do with SVMs, so they seem to be quite popular these days.

Advantages of SVMs: High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you’re data isn’t linearly separable in the base feature space. Especially popular in text classification problems where very high-dimensional spaces are the norm. Memory-intensive, hard to interpret, and kind of annoying to run and tune, though, so I think random forests are starting to steal the crown.

But…

Recall, though, that better data often beats better algorithms, and designing good features goes a long way. And if you have a huge dataset, then whichever classification algorithm you use might not matter so much in terms of classification performance (so choose your algorithm based on speed or ease of use instead).

And to reiterate what I said above, if you really care about accuracy, you should definitely try a bunch of different classifiers and select the best one by cross-validation. Or, to take a lesson from the Netflix Prize (and Middle Earth), just use an ensemble method to choose them all.

转载地址:http://rsbci.baihongyu.com/

你可能感兴趣的文章
误传了数千年的几个名句
查看>>
韩复榘经典语录
查看>>
厅、部、局、司区分大小
查看>>
VS2005中使用C#编写MDI窗口根据子窗口个数控制菜单项的enabled属性
查看>>
北川邓家“刘汉小学”无一死亡奇迹背后的真相
查看>>
救灾,从来没有胜利
查看>>
.net 2.0中ConfigurationManager替代了原来的ConfigurationSettings
查看>>
Asp.net 2.0中使用Datawindow.net2.0
查看>>
常用命名法:骆驼命名法,匈牙利命名法和帕斯卡命名法
查看>>
Server.MapPath方法测试结果
查看>>
Asp.net 默认配置下,Session莫名丢失的原因及解决办法
查看>>
Datawindow.net中如何使用Calendar控件
查看>>
如何在Datawindow.net中实现让当前行选中,并且当前行以其他颜色显示
查看>>
Datawindow.net如何使用导航栏
查看>>
如何利用Datawindow.net提取Sequence数据
查看>>
小诗,纪念我即将到来的结婚两周年
查看>>
自勉文[出处不详,待考证]
查看>>
中国行政级别
查看>>
国家公务员的级别
查看>>
悼念地震死难者:使整个网页变黑白色(灰色)的特效代码
查看>>