K-近邻算法(KNN) K nearest neighbour
0、导引 如何进行电影分类 众所周知,电影可以按照题材分类,然而题材本身是如何定义的?由谁来判定某部电影属于哪 个题材?也就是说同一题材的电影具有哪些公共特征?这些都是在进行电影分类时必须要考虑的问 题。没有哪个电影人会说自己制作的电影和以前的某部电影类似,但我们确实知道每部电影在风格 上的确有可能会和同题材的电影相近。那么动作片具有哪些共有特征,使得动作片之间非常类似, 而与爱情片存在着明显的差别呢?动作片中也会存在接吻镜头,爱情片中也会存在打斗场景,我们 不能单纯依靠是否存在打斗或者亲吻来判断影片的类型。但是爱情片中的亲吻镜头更多,动作片中 的打斗场景也更频繁,基于此类场景在某部电影中出现的次数可以用来进行电影分类。
1、k-近邻算法原理 简单地说,K-近邻算法采用测量不同特征值之间的距离方法进行分类。
工作原理 存在一个样本数据集合,也称作训练样本集,并且样本集中每个数据都存在标签,即我们知道样本集中每一数据 与所属分类的对应关系。输人没有标签的新数据后,将新数据的每个特征与样本集中数据对应的 特征进行比较,然后算法提取样本集中特征最相似数据(最近邻)的分类标签。一般来说,我们 只选择样本数据集中前K个最相似的数据,这就是K-近邻算法中K的出处,通常K是不大于20的整数。 最后 ,选择K个最相似数据中出现次数最多的分类,作为新数据的分类 。
现在我们得到了样本集中所有电影与未知电影的距离,按照距离递增排序,可以找到K个距 离最近的电影。假定k=3,则三个最靠近的电影依次是California Man、He’s Not Really into Dudes、Beautiful Woman。K-近邻算法按照距离最近的三部电影的类型,决定未知电影的类型,而这三部电影全是爱情片,因此我们判定未知电影是爱情片。
欧几里得距离(Euclidean Distance) 欧氏距离是最常见的距离度量,衡量的是多维空间中各个点之间的绝对距离。公式如下:
0)一个最简单的例子 身高、体重、鞋子尺码数据对应性别
1 2 3 4 from sklearn.neighbors import KNeighborsClassifierfrom sklearn.neighbors import KNeighborsRegressor
1 2 3 knnclf = KNeighborsClassifier(n_neighbors=3 )
1 2 3 import numpy as npimport pandas as pdfrom pandas import Series,DataFrame
1 2 3 4 X_train = np.array([[19 ,1 ],[2 ,18 ],[25 ,1 ],[24 ,3 ],[3 ,17 ]]) y_train = np.array(['动作' ,'爱情' ,'动作' ,'动作' ,'爱情' ]) display(X_train,y_train)
array([[19, 1],
[ 2, 18],
[25, 1],
[24, 3],
[ 3, 17]])
array(['动作', '爱情', '动作', '动作', '爱情'], dtype='<U2')
1 2 knnclf.fit(X_train,y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=3, p=2,
1 2 X_test = np.array([[13 ,10 ],[5 ,10 ]]) knnclf.predict(X_test)
array(['动作', '爱情'], dtype='<U2')
1)用于分类 导包,机器学习的算法KNN、数据鸢尾花
1 2 from sklearn.datasets import load_iris
{'DESCR': 'Iris Plants Database\n====================\n\nNotes\n-----\nData Set Characteristics:\n :Number of Instances: 150 (50 in each of three classes)\n :Number of Attributes: 4 numeric, predictive attributes and the class\n :Attribute Information:\n - sepal length in cm\n - sepal width in cm\n - petal length in cm\n - petal width in cm\n - class:\n - Iris-Setosa\n - Iris-Versicolour\n - Iris-Virginica\n :Summary Statistics:\n\n ============== ==== ==== ======= ===== ====================\n Min Max Mean SD Class Correlation\n ============== ==== ==== ======= ===== ====================\n sepal length: 4.3 7.9 5.84 0.83 0.7826\n sepal width: 2.0 4.4 3.05 0.43 -0.4194\n petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)\n petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)\n ============== ==== ==== ======= ===== ====================\n\n :Missing Attribute Values: None\n :Class Distribution: 33.3% for each of 3 classes.\n :Creator: R.A. Fisher\n :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n :Date: July, 1988\n\nThis is a copy of UCI ML iris datasets.\nhttp://archive.ics.uci.edu/ml/datasets/Iris\n\nThe famous Iris database, first used by Sir R.A Fisher\n\nThis is perhaps the best known database to be found in the\npattern recognition literature. Fisher\'s paper is a classic in the field and\nis referenced frequently to this day. (See Duda & Hart, for example.) The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant. One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\nReferences\n----------\n - Fisher,R.A. "The use of multiple measurements in taxonomic problems"\n Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n Mathematical Statistics" (John Wiley, NY, 1950).\n - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.\n (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.\n - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n Structure and Classification Rule for Recognition in Partially Exposed\n Environments". IEEE Transactions on Pattern Analysis and Machine\n Intelligence, Vol. PAMI-2, No. 1, 67-71.\n - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions\n on Information Theory, May 1972, 431-433.\n - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II\n conceptual clustering system finds 3 classes in the data.\n - Many, many more ...\n',
'feature_names': ['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)'],
'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10')}
1 2 3 4 data = iris.data target = iris.target target_names = iris.target_names feature_names = iris.feature_names
1 2 features = DataFrame(data=data,columns = feature_names) features.head()
1 features.iloc[:,0 ].std()
1 features.iloc[:,2 ].std()
1 features.iloc[:,1 ].std()
1 features.iloc[:,3 ].std()
1 2 3 4 5 6 7 X_train = features.iloc[:130 ,2 :4 ] y_train = target[:130 ] X_test = features.iloc[130 :,2 :4 ] y_test = target[130 :]
1 display(X_train.shape,y_train.shape,X_test.shape,y_test.shape)
(130, 2)
(20, 2)
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
1 2 3 4 5 6 import matplotlib.pyplot as plt%matplotlib inline samples = features.iloc[:,2 :4 ] plt.scatter(samples.iloc[:,0 ],samples.iloc[:,1 ],c=target)
1 knnclf = KNeighborsClassifier(n_neighbors=5 )
1 knnclf.fit(X_train,y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
1 2 3 4 y_ = knnclf.predict(X_test)
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
array([2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
1 2 3 4 5 6 7 8 9 10 xmin,xmax = samples.iloc[:,0 ].min(),samples.iloc[:,0 ].max() ymin,ymax = samples.iloc[:,1 ].min(),samples.iloc[:,1 ].max() x = np.linspace(xmin,xmax,100 ) y = np.linspace(ymin,ymax,100 ) xx,yy = np.meshgrid(x,y) X_test = np.c_[xx.ravel(),yy.ravel()]
1 y_ = knnclf.predict(X_test)
1 2 3 from matplotlib.colors import ListedColormapcmap = ListedColormap(['#aa00ff' ,'#00aaff' ,'#ffaa00' ])
1 2 3 4 plt.scatter(X_test[:,0 ],X_test[:,1 ],c=y_,cmap=cmap) plt.scatter(samples.iloc[:,0 ],samples.iloc[:,1 ],c=target)
<matplotlib.collections.PathCollection at 0x189b2990>
2)用于回归 回归用于对趋势的预测
1 from sklearn.neighbors import KNeighborsRegressor
1 2 3 4 5 x = np.linspace(-np.pi,np.pi,40 ) y = np.sin(x) plt.scatter(x,y)
<matplotlib.collections.PathCollection at 0x19be92b0>
1 2 noise = np.random.random(size=20 ) - 0.5 noise
array([-0.01419036, -0.05222776, 0.4114977 , -0.48535771, 0.4725629 ,
0.49193969, -0.4352523 , -0.48704335, 0.39377464, -0.32509247,
0.09969959, -0.10353899, 0.35402717, 0.09005099, -0.32349592,
-0.41517568, 0.13719123, 0.40893228, 0.25830619, 0.00900481])
<matplotlib.collections.PathCollection at 0x18594890>
1 2 X_train = x.reshape(-1 ,1 ) y_train = y
1 2 3 4 5 6 7 8 9 10 11 12 knn = KNeighborsRegressor(n_neighbors=7 ) knn.fit(X_train,y_train) X_test = np.linspace(-np.pi,np.pi,100 ).reshape(-1 ,1 ) y_ = knn.predict(X_test) plt.plot(X_test,y_,color='green' ) plt.scatter(X_train,y_train,color='orange' )
<matplotlib.collections.PathCollection at 0x19b872b0>
练习 人类动作识别 步行,上楼,下楼,坐着,站立和躺着
1 2 3 4 5 6 7 X_train = np.load('x_train.npy' ) y_train = np.load('y_train.npy' ) X_test = np.load('x_test.npy' ) y_test = np.load('y_test.npy' ) display(X_train.shape,y_train.shape,X_test.shape,y_test.shape)
(7352, 561)
(2947, 561)
1 DataFrame(X_train).head(
1 Series(y_train).unique()
array([5, 4, 6, 1, 3, 2], dtype=int64)
1 2 3 4 5 6 label = {1 :'WALKING' , 2 :'WALKING UPSTAIRS' , 3 :'WALKING DOWNSTAIRS' , 4 :'SITTING' , 5 :'STANDING' , 6 :'LAYING' }
1 2 knnclf = KNeighborsClassifier(n_neighbors=9 )
1 2 knnclf.fit(X_train,y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=9, p=2,
1 2 3 4 5 y_ = knnclf.predict(X_test[:1000 ]) (y_ == y_test[:1000 ]).sum()/y_.size
1 2 knnclf.score(X_test[:500 ],y_test[:500 ])
1 2 3 4 5 6 plt.figure(figsize=(16 ,8 )) colors = ['red' ,'yellow' ,'blue' ,'green' ,'cyan' ,'purple' ] for i in range(1 ,7 ): plt.subplot(2 ,3 ,i) title = label[i] Series(X_train[y_train==i][0 ]).plot(color=colors[i-1 ],title=title)