逻辑斯蒂回归 【关键词】Logistics函数,最大似然估计,梯度下降法 1、Logistics回归的原理 利用Logistics回归进行分类的主要思想是:根据现有数据对分类边界线建立回归公式,以此进行分类。这里的“回归” 一词源于最佳拟合,表示要找到最佳拟合参数集。
Logistic Regression和Linear Regression的原理是相似的,可以简单的描述为这样的过程:
(3)显然,J(θ)函数的值越小表示预测函数越准确(即h函数越准确),所以这一步需要做的是找到J(θ)函数的最小值。找函数的最小值有不同的方法,Logistic Regression实现时有梯度下降法(Gradient Descent)。
1) 构造预测函数 Logistic Regression虽然名字里带“回归”,但是它实际上是一种分类方法,用于两分类问题(即输出只有两种)。首先需要先找到一个预测函数(h),显然,该函数的输出必须是两类值(分别代表两个类别),所以利用了Logistic函数(或称为Sigmoid函数) ,函数形式为:
2)构造损失函数 Cost函数和J(θ)函数是基于最大似然估计 推导得到的。
3) 梯度下降法求J(θ)的最小值 求J(θ)的最小值可以使用梯度下降法 ,根据梯度下降法可得θ的更新过程:
2、实战 sklearn.linear_model.LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='liblinear', max_iter=100, multi_class='ovr', verbose=0, warm_start=False, n_jobs=1)
“lbfgs”, “sag” or “newton-cg”:大数量级的数据集以及多分类问题
1) 手写数字数据集的分类 使用KNN与Logistic回归两种方法
1 from sklearn.datasets import load_digits
{'DESCR': "Optical Recognition of Handwritten Digits Data Set\n===================================================\n\nNotes\n-----\nData Set Characteristics:\n :Number of Instances: 5620\n :Number of Attributes: 64\n :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n :Missing Attribute Values: None\n :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n :Date: July; 1998\n\nThis is a copy of the test set of the UCI ML hand-written digits datasets\nhttp://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n\nThe data set contains images of hand-written digits: 10 classes where\neach class refers to a digit.\n\nPreprocessing programs made available by NIST were used to extract\nnormalized bitmaps of handwritten digits from a preprinted form. From a\ntotal of 43 people, 30 contributed to the training set and different 13\nto the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n4x4 and the number of on pixels are counted in each block. This generates\nan input matrix of 8x8 where each element is an integer in the range\n0..16. This reduces dimensionality and gives invariance to small\ndistortions.\n\nFor info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\nT. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\nL. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\n1994.\n\nReferences\n----------\n - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\n Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\n Graduate Studies in Science and Engineering, Bogazici University.\n - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\n - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\n Linear dimensionalityreduction using relevance weighted LDA. School of\n Electrical and Electronic Engineering Nanyang Technological University.\n 2005.\n - Claudio Gentile. A New Approximate Maximal Margin Classification\n Algorithm. NIPS. 2000.\n",
'target': array([0, 1, 2, ..., 8, 9, 8]),
'target_names': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])}
1 2 3 train = digits.data target = digits.target images = digits.images
1 2 3 4 5 import matplotlib.pyplot as plt%matplotlib inline plt.imshow(train[0 ].reshape(8 ,8 ))
1 2 from sklearn.model_selection import train_test_splitX_train,X_test,y_train,y_test = train_test_split(train,target)
1 from sklearn.linear_model import LogisticRegression
1 2 3 4 5 6 7 8 9 logistic = LogisticRegression(C=0.1 ) logistic.fit(X_train,y_train) y_ = logistic.predict(X_test) logistic.score(X_test,y_test)
1 2 3 4 5 6 7 8 9 10 11 plt.figure(figsize=(10 ,16 )) for i in range(100 ): axes = plt.subplot(10 ,10 ,i+1 ) data = X_test[i].reshape(8 ,8 ) plt.imshow(data,cmap='gray' ) t = y_test[i] p = y_[i] title = 'T:' +str(t) + '\nP:' +str(p) axes.set_title(title) axes.axis('off' )
2) 使用make_blobs产生数据集进行分类 导包使用datasets.make_blobs创建一系列点
1 2 3 4 5 from sklearn.datasets import make_blobsfrom sklearn.neighbors import KNeighborsClassifierimport numpy as npimport pandas as pd
1 train,target = make_blobs(n_samples=150 ,n_features=2 ,centers=[[1 ,4 ],[3 ,2 ],[5 ,6 ]])
1 plt.scatter(train[:,0 ],train[:,1 ],c=target)
1 2 3 4 5 logistic = LogisticRegression() knnclf = KNeighborsClassifier() logistic.fit(train,target) knnclf.fit(train,target)
1 2 3 x = np.linspace(xmin,xmax,200 ) y = np.linspace(ymin,ymax,200 )
1 2 xx,yy = np.meshgrid(x,y)
1 X_test = np.c_[xx.ravel(),yy.ravel()]
(40000, 2)
1 %time y1_ = logistic.predict(X_test)
Wall time: 3.97 ms
1 %time y2_ = knnclf.predict(X_test)
Wall time: 87.2 ms
1 from matplotlib.colors import ListedColormap
1 2 3 4 5 6 7 colormap = ListedColormap(['#aa00ff' ,'#00aaff' ,'#aaffff' ]) def draw_classifier_bounds (X_train,y_train,X_test,y_test) : plt.figure(figsize=(10 ,8 )) axes = plt.subplot(111 ) axes.scatter(X_test[:,0 ],X_test[:,1 ],c=y_test,cmap=colormap) axes.scatter(X_train[:,0 ],X_train[:,1 ],c=y_train)
1 draw_classifier_bounds(train,target,X_test,y1_)
1 draw_classifier_bounds(train,target,X_test,y2_)
1 samples = pd.read_csv('../data/adults.txt' )
1 2 train = samples[['race' ,'occupation' ,'hours_per_week' ]].copy() target = samples['sex' ]
array(['White', 'Black', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo',
'Other'], dtype=object)
1 2 3 4 5 6 7 race_dic = { 'White' :0 , 'Black' :1 , 'Asian-Pac-Islander' :2 , 'Amer-Indian-Eskimo' :3 , 'Other' :4 }
1 train['race' ] = train['race' ].map(race_dic)
1 2 3 unique_arr = train['occupation' ].unique() def transform_occ (x) : return np.argwhere(x == unique_arr)[0 ,0 ]
1 train['occupation' ] = train['occupation' ].map(transform_occ)
1 2 3 from sklearn.linear_model import LogisticRegressionfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.model_selection import train_test_split
1 X_train,X_test,y_train,y_test = train_test_split(train,target,test_size=0.2 ,random_state=1 )
1 2 3 4 5 6 7 8 9 10 11 logistic = LogisticRegression(C=100 ) knnclf = KNeighborsClassifier(n_neighbors=9 ) logistic.fit(X_train,y_train) knnclf.fit(X_train,y_train) y1_ = logistic.predict(X_test) y2_ = knnclf.predict(X_test) print('logistic score is %f' %logistic.score(X_test,y_test)) print('knnclf score is %f' %knnclf.score(X_test,y_test))
logistic score is 0.681406
knnclf score is 0.714417
1 2 3 4 train = samples.drop('sex' ,axis=1 ).copy() target = samples.sex
1 2 3 4 5 6 columns = train.columns[train.dtypes == object] for column in columns: unique_arr = train[column].unique() def transform_obj (x) : return np.argwhere(x == unique_arr)[0 ,0 ] train[column] = train[column].map(transform_obj)
1 X_train,X_test,y_train,y_test = train_test_split(train,target,test_size=0.2 ,random_state=1 )
1 2 3 4 5 6 7 8 9 10 11 logistic = LogisticRegression(C=0.01 ) knnclf = KNeighborsClassifier(n_neighbors=5 ) logistic.fit(X_train,y_train) knnclf.fit(X_train,y_train) y1_ = logistic.predict(X_test) y2_ = knnclf.predict(X_test) print('logistic score is %f' %logistic.score(X_test,y_test)) print('knnclf score is %f' %knnclf.score(X_test,y_test))
logistic score is 0.668356
knnclf score is 0.667895