R 中的数据挖掘算法/聚类/模糊聚类 - 模糊 C 均值

介绍

聚类的目标是将一组数据点最小化到自相似的组中，使得属于同一组的点比属于不同组的点更相似。每个组被称为一个簇。

本文将介绍文献中讨论的一种算法，称为模糊 C 均值。在介绍完技术后，将介绍一个实现该算法的 R 包。最后，将使用该算法进行一个案例研究，并展示获得的结果。

技术/算法

模糊 C 均值（FCM - 经常使用 C 方法）是一种聚类方法，它允许一个点属于一个或多个簇。该方法由 Dunn 于 1973 年提出，并由 Bezdek 于 1981 年改进，它经常用于模式识别。FCM 算法试图根据某些给定标准将有限的点集合划分成 C 个模糊簇的集合。因此，位于簇边缘的点可能与位于簇中心的点相比，在簇中的程度更低。FCM 算法基于以下目标函数的最小化

J(U,c_{1},c_{2},\dots ,c_{c})=\sum _{i=1}^{c}J_{i}=\sum _{i=1}^{c}\sum _{j=1}^{n}u_{ij}^{m}d_{ij}^{2}

算法

FCM 也被称为模糊 c 均值，因为它使用模糊逻辑 [Zadeh 1965]，因此每个实例并不只与一个簇相关联，而是对每个存在的质心都有特定的隶属度。为此，算法创建了一个关联矩阵 U，其中每个项 μ_ij 表示样本 i 对簇 j 的隶属度。在 FCM 算法中，有一个模糊度变量 m，使得 1.0 < m < $\infty$ ，其中 m 是一个实数。m 越接近无穷大 ( $\infty$ )，解决方案的模糊度就越高，越接近 1，解决方案就越类似于二进制 k 均值的聚类 [Bezdek 1981]。一个不错的选择是将 m 设置为 2.0 [Hathaway and Bezdek 2001]。

你可以在同一个伪代码中看到 k 均值和 FCM，如（算法 1）中所述。其中，我们只有通过更改计算 μ_ij 项的公式来获得 k 均值或 FCM，将模糊平均 [Zadeh 1965] 更改为二进制选择，表明 FCM 实际上是模糊的 K 均值。

在（算法 1）中，方式表明 $|A-B|^{2}$ 是 a 到 b 的欧几里得距离，输入：样本集 x_i（1 < i < N）、簇数 K、模糊因子 m、容差因子，我们输出：簇向量 c_i（1 < i < K）和矩阵 U，它确定每个样本与每个簇的关联性。需要注意的是，矩阵 U 的值只取决于 H（存储示例到簇的距离的数组）和 m 的值。簇的更新完全取决于迭代中矩阵 U 的值。

实现

上面描述的算法由 R 包“e1071”实现，该包于 2010 年 4 月 21 日发布。该包具有 GPL-2 许可证，可以在 CRAN 仓库中找到。

安装 CBA 包

install.packages("e1071")

导入方法和算法

library("e1071")

用法

cmeans(x, centers, iter.max = 100, verbose = FALSE,
       dist = "euclidean", method = "cmeans", m = 2,
       rate.par = NULL, weights = 1, control = list())

参数

• x: The data matrix where columns correspond to variables and rows to observations.
• centers: Number of clusters or initial values for cluster centers.
• iter.max: Maximum number of iterations.
• verbose: If TRUE, make some output during learning.
• dist: Must be one of the following: If "euclidean", the mean square error,
if "manhattan", the mean absolute error is computed. Abbreviations are also accepted.
• method:  If "cmeans", then we have the c-means fuzzy clustering method, 
if "ufcl" we have the on-line update. Abbreviations are also accepted.
• m: A number greater than 1 giving the degree of fuzzification.
• rate.par: A number between 0 and 1 giving the parameter of the learning rate for the 
on-line variant. The default corresponds to 0:3.
• weights: a numeric vector with non-negative case weights. Recycled to the 
number of observations in x if necessary.
• control: a list of control parameters. See Details.

如果一切正常，将返回一个 fclust 对象。该对象具有以下组件

• centers: the final cluster centers.
• size: the number of data points in each cluster of the closest hard clustering.
• cluster: a vector of integers containing the indices of the clusters where the data points are assigned 
to for the closest hard clustering, as obtained by assigning points to the (first) class with maximal membership.
• iter: the number of iterations performed.
• membership: a matrix with the membership values of the data points to the clusters.
• withinerror: the value of the objective function.
• call: the call used to create the object.

查看

有一种方法可以显示该算法的结果。该方法是打印 fclust 对象。

print(fclust)

示例

为了说明所研究的包实现的方法，我们创建了两个示例。第一个示例创建了一个二维数组，第二个示例创建了一个三维数组。在两个示例中，我们都使用了正态分布，其平均值范围不同，并使用 0.3 的标准差。

示例 1

# a 2-dimensional example
x<-rbind(matrix(rnorm(100,sd=0.3),ncol=2),
matrix(rnorm(100,mean=1,sd=0.3),ncol=2))
cl<-cmeans(x,2,20,verbose=TRUE,method="cmeans",m=2)
print(cl)

示例 2

# a 3-dimensional example
x<-rbind(matrix(rnorm(150,sd=0.3),ncol=3),
matrix(rnorm(150,mean=1,sd=0.3),ncol=3),
matrix(rnorm(150,mean=2,sd=0.3),ncol=3))
cl<-cmeans(x,6,20,verbose=TRUE,method="cmeans")
print(cl)

输出 - 示例 1

输出 - 示例 2

案例研究

以下主题描述了案例研究，以展示使用模糊 C 均值获得的结果。

情景

在提出的案例研究中，我们使用了 R 包 Iris 中可用的数据库。该数据库包含 150 个实例，每个实例包含花瓣的特征。数据库的每个实例都有其分类。数据库的实例有三种不同的分类（setosa、versicolor 和 virginica）。所研究的数据库中每个类别都有 50 个示例，总共 150 个实例。除了每个实例中的类别定义之外，其他特征还有

数据集

可以使用以下命令在 R 语言中加载实验中使用的鸢尾花数据库。

data(iris)

下表显示了鸢尾花数据库中的一部分样本数据。

表 1: 鸢尾花数据库样本

实例	萼片长度	萼片宽度	花瓣长度	花瓣宽度	物种
1	5.1	3.5	1.4	0.2	layak
2	4.9	3.0	1.4	0.2	Tidak layak
3	4.7	3.2	1.3	0.2	layak
72	6.1	2.8	4.0	1.3	layak
73	6.3	2.5	4.9	1.5	layak
114	5.7	2.5	5.0	2.0	layak
115	5.8	2.8	5.1	2.4	layak

执行

为了使用模糊 C 均值算法生成结果，请运行以下脚本。

data(iris)
x<-rbind(iris$Sepal.Length, iris$Sepal.Width, iris$Petal.Length)
x<-t(x)
result<-cmeans(x,3,50,verbose=TRUE,method="cmeans")
print(result)
s3d <- scatterplot3d(result$membership, color=result$cluster, type="h", 
angle=55, scale.y=0.7, pch=16, main="Pertinence")
plot(iris, col=result$cluster)

输出

运行脚本后得到的结果如下所示。

Fuzzy c-means clustering with 3 clusters

Cluster centers:
      [,1]     [,2]     [,3]
1 5.003653 3.412805 1.484775
2 5.874034 2.760272 4.382520
3 6.793622 3.054510 5.644347

Memberships:
                  1            2            3
  [1,] 0.9964733414 2.388793e-03 1.137865e-03
  [2,] 0.9730096494 1.850758e-02 8.482767e-03
  [3,] 0.9776389508 1.515266e-02 7.208389e-03
  [4,] 0.9635322892 2.503070e-02 1.143701e-02
  [5,] 0.9939984763 4.051202e-03 1.950322e-03
  [6,] 0.9304507689 4.703382e-02 2.251542e-02
  [7,] 0.9775132049 1.523242e-02 7.254371e-03
  [8,] 0.9999369153 4.314160e-05 1.994308e-05
  [9,] 0.9225703038 5.279889e-02 2.463081e-02
 [10,] 0.9834280681 1.141773e-02 5.154205e-03
 [11,] 0.9636309639 2.453957e-02 1.182947e-02
 [12,] 0.9914862878 5.851313e-03 2.662399e-03
 [13,] 0.9693327053 2.101145e-02 9.655842e-03
 [14,] 0.9162524471 5.600693e-02 2.774062e-02
 [15,] 0.8773228961 7.968730e-02 4.298980e-02
 [16,] 0.8300898328 1.098729e-01 6.003725e-02
 [17,] 0.9444844043 3.671434e-02 1.880126e-02
 [18,] 0.9964733414 2.388793e-03 1.137865e-03
 [19,] 0.8917869903 7.312086e-02 3.509215e-02
 [20,] 0.9768481880 1.559135e-02 7.560459e-03
 [21,] 0.9638052097 2.505889e-02 1.113590e-02
 [22,] 0.9862983266 9.267056e-03 4.434618e-03
 [23,] 0.9544955743 3.001379e-02 1.549064e-02
 [24,] 0.9879996300 8.348552e-03 3.651818e-03
 [25,] 0.9619796590 2.665281e-02 1.136753e-02
 [26,] 0.9700883809 2.080884e-02 9.102779e-03
 [27,] 0.9978125723 1.505741e-03 6.816871e-04
 [28,] 0.9927414381 4.946076e-03 2.312486e-03
 [29,] 0.9931235860 4.671305e-03 2.205109e-03
 [30,] 0.9770882461 1.581612e-02 7.095630e-03
 [31,] 0.9761442368 1.652997e-02 7.325795e-03
 [32,] 0.9748518908 1.716740e-02 7.980706e-03
 [33,] 0.9320017983 4.510961e-02 2.288859e-02
 [34,] 0.8927033428 7.017637e-02 3.712029e-02
 [35,] 0.9834280681 1.141773e-02 5.154205e-03
 [36,] 0.9833813658 1.121223e-02 5.406405e-03
 [37,] 0.9595420958 2.713202e-02 1.332588e-02
 [38,] 0.9926144115 4.984365e-03 2.401224e-03
 [39,] 0.9331691292 4.525365e-02 2.157722e-02
 [40,] 0.9984835370 1.037187e-03 4.792755e-04
 [41,] 0.9942973683 3.839814e-03 1.862818e-03
 [42,] 0.8379960489 1.102335e-01 5.177041e-02
 [43,] 0.9474894758 3.543286e-02 1.707767e-02
 [44,] 0.9966468801 2.299980e-03 1.053140e-03
 [45,] 0.9422485213 3.983938e-02 1.791210e-02
 [46,] 0.9693327053 2.101145e-02 9.655842e-03
 [47,] 0.9736866788 1.782324e-02 8.490078e-03
 [48,] 0.9713101708 1.953226e-02 9.157566e-03
 [49,] 0.9741716274 1.744780e-02 8.380577e-03
 [50,] 0.9970704014 1.996528e-03 9.330710e-04
 [51,] 0.0396263985 3.645217e-01 5.958519e-01
 [52,] 0.0318692598 7.303043e-01 2.378264e-01
 [53,] 0.0257989245 2.759514e-01 6.982496e-01
 [54,] 0.0547597033 8.587710e-01 8.646929e-02
 [55,] 0.0257236267 7.190557e-01 2.552207e-01
 [56,] 0.0044884340 9.781328e-01 1.737878e-02
 [57,] 0.0312129593 6.547331e-01 3.140540e-01
 [58,] 0.2958341637 5.694232e-01 1.347426e-01
 [59,] 0.0303580096 6.398237e-01 3.298183e-01
 [60,] 0.0880779511 8.134764e-01 9.844565e-02
 [61,] 0.2205273013 6.298451e-01 1.496276e-01
 [62,] 0.0105098293 9.591134e-01 3.037677e-02
 [63,] 0.0462415004 8.537411e-01 1.000174e-01
 [64,] 0.0127683343 8.793406e-01 1.078911e-01
 [65,] 0.1097850604 7.908698e-01 9.934511e-02
 [66,] 0.0439788476 6.323928e-01 3.236284e-01
 [67,] 0.0142403104 9.357246e-01 5.003511e-02
 [68,] 0.0107489244 9.647242e-01 2.452687e-02
 [69,] 0.0297161608 8.212906e-01 1.489932e-01
 [70,] 0.0472514422 8.832597e-01 6.948890e-02
 [71,] 0.0244685053 7.865167e-01 1.890147e-01
 [72,] 0.0231707557 9.204749e-01 5.635439e-02
 [73,] 0.0242412459 6.647915e-01 3.109672e-01
 [74,] 0.0115014923 8.931765e-01 9.532196e-02
 [75,] 0.0252735484 8.457138e-01 1.290126e-01
 [76,] 0.0367090402 7.041271e-01 2.591639e-01
 [77,] 0.0295101840 4.167745e-01 5.537153e-01
 [78,] 0.0196749600 2.703807e-01 7.099444e-01
 [79,] 0.0046165774 9.710517e-01 2.433168e-02
 [80,] 0.1233870645 7.695546e-01 1.070584e-01
 [81,] 0.0763633248 8.316087e-01 9.202798e-02
 [82,] 0.0956781157 8.038125e-01 1.005094e-01
 [83,] 0.0317355256 9.149948e-01 5.326970e-02
 [84,] 0.0237392176 6.474085e-01 3.288522e-01
 [85,] 0.0279975037 8.909775e-01 8.102497e-02
 [86,] 0.0346333192 7.957193e-01 1.696474e-01
 [87,] 0.0327144512 4.847696e-01 4.825159e-01
 [88,] 0.0287006017 8.325287e-01 1.387707e-01
 [89,] 0.0265872664 9.220512e-01 5.136156e-02
 [90,] 0.0425465997 8.901942e-01 6.725921e-02
 [91,] 0.0165454512 9.380639e-01 4.539066e-02
 [92,] 0.0126391727 8.984548e-01 8.890606e-02
 [93,] 0.0217893662 9.356063e-01 4.260431e-02
 [94,] 0.2778346035 5.864740e-01 1.356914e-01
 [95,] 0.0130250339 9.574755e-01 2.949948e-02
 [96,] 0.0143369500 9.506283e-01 3.503480e-02
 [97,] 0.0098869130 9.658287e-01 2.428443e-02
 [98,] 0.0128272275 9.306614e-01 5.651133e-02
 [99,] 0.3958967535 4.819128e-01 1.221905e-01
[100,] 0.0138782641 9.568112e-01 2.931057e-02
[101,] 0.0168213369 1.202409e-01 8.629378e-01
[102,] 0.0261693250 7.099212e-01 2.639094e-01
[103,] 0.0064283347 4.003438e-02 9.535373e-01
[104,] 0.0121558071 1.363357e-01 8.515085e-01
[105,] 0.0051285341 4.386983e-02 9.510016e-01
[106,] 0.0380603711 1.582822e-01 8.036574e-01
[107,] 0.0796613905 7.682136e-01 1.521250e-01
[108,] 0.0215250742 1.079049e-01 8.705700e-01
[109,] 0.0133896976 1.083710e-01 8.782393e-01
[110,] 0.0222927779 1.077325e-01 8.699748e-01
[111,] 0.0188704621 2.634073e-01 7.177222e-01
[112,] 0.0170114260 2.579486e-01 7.250400e-01
[113,] 0.0012069886 1.088884e-02 9.879042e-01
[114,] 0.0272794217 7.782927e-01 1.944279e-01
[115,] 0.0260263274 7.022101e-01 2.717635e-01
[116,] 0.0143300831 1.808070e-01 8.048629e-01
[117,] 0.0055448160 6.051225e-02 9.339429e-01
[118,] 0.0542553271 1.919346e-01 7.538101e-01
[119,] 0.0522340053 2.006703e-01 7.470957e-01
[120,] 0.0331219678 6.903576e-01 2.765205e-01
[121,] 0.0016396213 1.177291e-02 9.865875e-01
[122,] 0.0232292512 8.358772e-01 1.408936e-01
[123,] 0.0446065451 1.785215e-01 7.768719e-01
[124,] 0.0214638942 6.565434e-01 3.219927e-01
[125,] 0.0033893690 2.584416e-02 9.707665e-01
[126,] 0.0114583443 6.335614e-02 9.251855e-01
[127,] 0.0173352405 7.863541e-01 1.963106e-01
[128,] 0.0207474129 7.187218e-01 2.605308e-01
[129,] 0.0101190052 1.107064e-01 8.791746e-01
[130,] 0.0076951023 4.751066e-02 9.447942e-01
[131,] 0.0203964684 1.059183e-01 8.736853e-01
[132,] 0.0542244082 1.915598e-01 7.542157e-01
[133,] 0.0101190052 1.107064e-01 8.791746e-01
[134,] 0.0209695496 4.545456e-01 5.244849e-01
[135,] 0.0248052462 2.990893e-01 6.761055e-01
[136,] 0.0299588767 1.357829e-01 8.342583e-01
[137,] 0.0163979126 1.472580e-01 8.363440e-01
[138,] 0.0087535054 9.693234e-02 8.943141e-01
[139,] 0.0169168345 8.303003e-01 1.527828e-01
[140,] 0.0037050972 3.198944e-02 9.643055e-01
[141,] 0.0006389323 5.579855e-03 9.937812e-01
[142,] 0.0153630352 1.530446e-01 8.315924e-01
[143,] 0.0261693250 7.099212e-01 2.639094e-01
[144,] 0.0036930153 2.507113e-02 9.712359e-01
[145,] 0.0033893690 2.584416e-02 9.707665e-01
[146,] 0.0106923302 1.279688e-01 8.613389e-01
[147,] 0.0250155507 5.900274e-01 3.849571e-01
[148,] 0.0138756337 2.012898e-01 7.848346e-01
[149,] 0.0230709595 2.493458e-01 7.275832e-01
[150,] 0.0261065981 6.399365e-01 3.339569e-01

Closest hard clustering:
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [46] 1 1 1 1 1 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2
 [91] 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 3 3
[136] 3 3 3 2 3 3 3 2 3 3 3 2 3 3 2

Available components:
[1] "centers"     "size"        "cluster"     "membership"  "iter"        "withinerror"
[7] "call"

下图显示了每个样本对每个生成集群的相关性。每个生成的集群可以用一种颜色（绿色、红色或黑色）来识别。

下面的图表显示了每个实例的分类，通过比较数据库中存在的每对属性来实现。

分析

模糊 C 均值算法的应用允许对类别进行同质分组，正如预期的那样。很快，三个生成的类别具有非常相似的实例数量。除了对给定实例进行分类的类别之外，该算法还展示了该实例与该类别的相关性。此信息允许负责分析结果的人员将注意力集中在与特定类别无关程度较高的事件上。

应该分析那些相关性不高的数据库实例（相关性程度由用户定义），以确认它们是否确实属于算法确定的类别。在显示相关性的图表中，可以发现标为红色的类别 2 中的一些实例相关性程度并不高。这可能表明算法在分类方面可能存在错误，因为这些实例的属性值并没有以高度确定的程度识别这些实例。

参考文献

J. C. Dunn (1973): "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters", Journal of Cybernetics 3: 32-57

J. C. Bezdek (1981): "Pattern Recognition with Fuzzy Objective Function Algorithms", Plenum Press, New York Tariq Rashid: “Clustering”

L. A. Zadeh (1965): "Fuzzy sets and systems". In: Fox J, editor. System Theory. Brooklyn, NY: Polytechnic Press, 1965: 29–39.

鸢尾花数据库: http://archive.ics.uci.edu/ml/datasets/Iris

R 统计计算项目: https://r-project.cn/

W. Meira; M. Zaki: 数据挖掘算法基础: http://www.dcc.ufmg.br/miningalgorithms/DokuWiki/doku.php