R语言中的数据挖掘算法/频繁模式挖掘/Eclat算法

简介

Eclat算法

Eclat算法用于执行项集挖掘。项集挖掘使我们能够在数据中找到频繁模式，例如，如果消费者购买牛奶，他也会购买面包。这种类型的模式称为关联规则，并用于许多应用领域。

Eclat算法的基本思想是使用tid集交集来计算候选项集的支持度，避免生成前缀树中不存在的子集。它最初由Zaki、Parthasarathy等人提出。

算法

Eclat算法是递归定义的。初始调用使用所有具有其tid集的单个项目。在每次递归调用中，函数IntersectTidsets验证每个项集-tid集对 $\left\langle {X,t(X)}\right\rangle$ 与所有其他对 $\left\langle {Y,t(Y)}\right\rangle$ 以生成新的候选 $N_{XY}$ 。如果新的候选频繁，则将其添加到集合 $P_{X}$ 中。然后，递归地，它在 $X$ 分支中查找所有频繁项集。该算法以DFS方式搜索以找到所有频繁集。

实现

Eclat算法可以在R系统的arule包中找到。

   * R package: arules
   * Method: eclat
   * Documentation : arules package

用法

eclat(data, parameter = NULL, control = NULL)

参数

data: transactions类对象或任何可以强制转换为transactions的数据结构（例如，二进制矩阵、data.frame）。

parameter: ECparameter类对象或命名列表（默认值为：support 0.1和maxlen 5）

control: ECcontrol类对象或命名列表，用于算法控制。

返回值

返回itemsets类对象

示例

data("Adult")

## Mine itemsets with minimum support of 0.1.
itemsets <- eclat(Adult, parameter = list(supp = 0.1, maxlen = 15))

可视化

arules包为项集实现了某些可视化方法，这些方法是Eclat算法的返回类型。以下是一些示例

示例1

data("Adult") 

## Mine frequent itemsets with Eclat.
fsets <- eclat(Adult, parameter = list(supp = 0.5))

## Display the 5 itemsets with the highest support.
fsets.top5 <- sort(fsets)
inspect(fsets.top5)

## Get the itemsets as a list
as(items(fsets.top5), "list")

## Get the itemsets as a binary matrix
as(items(fsets.top5), "matrix")

## Get the itemsets as a sparse matrix, a ngCMatrix from package Matrix.
## Warning: for efficiency reasons, the ngCMatrix you get is transposed
as(items(fsets.top5), "ngCMatrix")

示例2

## Create transaction data set.
data <- list(
  c("a","b","c"),
  c("a","b"),
  c("a","b","d"),
  c("b","e"),
  c("b","c","e"),
  c("a","d","e"),
  c("a","c"),
  c("a","b","d"),
  c("c","e"),
  c("a","b","d","e")
)

t <- as(data, "transactions")

## Mine itemsets with tidLists.
f <- eclat(data, parameter = list(support = 0, tidLists = TRUE))

## Get dimensions of the tidLists.
dim(tidLists(f))

## Coerce tidLists to list.
as(tidLists(f), "list")

## Inspect visually.
image(tidLists(f))

##Show the Frequent itemsets and respectives supports
inspect(f)

示例2的结果

#	项目	支持度
1	{b, c, e}	0.1
2	{a, b, c}	0.1
3	{a, c}	0.2
4	{b, c}	0.2
5	{c, e}	0.2
6	{a, b,d, e}	0.1
7	{a, d, e}	0.2
8	{b, d, e}	0.1
9	{a, b, d}	0.3
10	{a, d}	0.4
11	{b, d}	0.3
12	{d, e}	0.2
13	{a, b, e}	0.1
14	{a, e}	0.2
15	{b, e}	0.3
16	{a, b}	0.5
17	{a}	0.7
18	{b}	0.7
19	{e}	0.5
20	{d}	0.4
21	{c}	0.4

用例

为了查看Eclat算法使用的一些实际示例，我们将使用来自northwind数据库的一些数据。northwind数据库可免费下载，并表示企业数据。在本示例中，我们将使用数据库中的订单明细表。订单明细表用于将订单与产品相关联（在多对多关系中）。Eclat算法将用于从此数据中查找频繁模式，以查看是否存在一起购买的任何产品。

场景

给定来自northwind数据库订单明细表的数据，查找所有支持度=0.1且长度至少为2的频繁项集。

输入数据

订单明细表包含以下字段

ID: 主键
订单ID: 来自Orders表的外部键
产品ID: 来自Products表的外部键
数量: 购买的数量
折扣: 提供的折扣
单价: 产品的单价

要使用数据，需要进行一些预处理。该表可能包含许多属于同一订单的行，因此该表被转换为以下方式：一个订单的所有行在新的表中仅成为一行，其中包含属于该订单的产品ID。字段ID、订单ID、数量、折扣和单价被丢弃。数据保存在名为northwind-orders.txt的txt文件中。该文件以一种方式编写，以便在R中作为列表对象加载。

实现

要运行示例，需要在R中加载arules包。

首先，数据在R中加载到列表对象中

## 1041 transactions is loaded, the listing below was shortened
## some duplicates transactions was introduced to produce some results for the algorithm
data = list(
 c("2","3","4","6","7","8","10","12","13","14","16","20","23","32","39","41","46","52","55","60","64","66","73","75","77"),
 c("11","42","72"),
 c("14","51"),
 c("41","51","65"),
 c("22","57","65"),
 ...)

其次，使用Eclat算法。

itemsets <- eclat(data, parameter = list(support = 0.1, minlen=2, tidLists = TRUE, target="frequent itemsets"))

parameter specification:
 tidLists support minlen maxlen            target   ext
     TRUE     0.1      2      5 frequent itemsets FALSE

algorithmic control:
 sparse sort verbose
      7   -2    TRUE

eclat - find frequent item sets with the eclat algorithm
version 2.6 (2004.08.16)         (c) 2002-2004   Christian Borgelt
create itemset ... 
set transactions ...[78 item(s), 1041 transaction(s)] done [0.00s].
sorting and recoding items ... [3 item(s)] done [0.00s].
creating bit matrix ... [3 row(s), 1041 column(s)] done [0.00s].
writing  ... [4 set(s)] done [0.00s].
Creating S4 object  ... done [0.00s].

输出数据

itemsets对象保存Eclat算法执行的输出。如上所示，生成了4个集合。要查看结果，可以使用

inspect(itemsets)

  items   support
1 {11,           
   42,           
   72}  0.1940442
2 {42,           
   72}  0.1940442
3 {11,           
   42}  0.1940442
4 {11,           
   72}  0.1959654

分析

如上所示，eclat算法的结果有4个频繁项集。这个输出是由数据中多次重复事务{11, 42, 72}引起的。结果表明，元组{11,42,72}、{42,72}和{11,42}的支持度为19.40%；元组{11,72}的支持度为19.60%。

产品ID 11、42和72分别代表产品Queso Cabrales、Singaporean Hokkien Fried Mee和Mozzarella di Giovanni。因此，eclat算法的输出表明，购买这些商品一起的购物模式非常频繁。

变体

PPV、PrePost和FIN算法

这三种算法由Deng等人提出^[1] ^[2] ^[3]，并基于三种新颖的数据结构，分别称为节点列表^[1]、N列表^[2]和节点集^[3]，以方便频繁项集的挖掘过程。它们是FP树中节点的集合，每个节点都使用先序遍历和后序遍历进行编码。与节点列表相比，N列表和节点集效率更高。这导致PrePost^[2]和FIN^[3]的效率高于PPV^[1]。更多细节请参见^[1] ^[2] ^[3]。

参考文献

Hahsler, M.；Buchta, C.；Gruen, B. & Hornik, K. arules：挖掘关联规则和频繁项集。2009年。
Hahsler, M.；Gruen, B. & Hornik, K. arules -- 挖掘关联规则和频繁项集的计算环境。《统计软件杂志》，2005年，14卷，1-25页。
Mohammed Javeed Zaki，Srinivasan Parthasarathy，Wei Li。用于并行关联挖掘的局部算法。SPAA 1997：321-330。
Mohammed Javeed Zaki，Srinivasan Parthasarathy，Mitsunori Ogihara，Wei Li。快速发现关联规则的新算法。KDD 1997：283-286。
Mohammed Javeed Zaki，Srinivasan Parthasarathy，Mitsunori Ogihara，Wei Li。关联规则发现的并行算法。《数据挖掘与知识发现》，1卷（4期）：343-373（1997）。
Christian Borgelt（2003）Apriori和Eclat的有效实现。频繁项集挖掘实现研讨会（FIMI 2003，美国佛罗里达州梅尔本）。

↑ ^a ^b ^c ^d Deng，Z. & Wang，Z. 用于挖掘频繁模式的一种新的快速垂直方法[1]。《国际计算智能系统杂志》，3卷（6期）：733-744，2010年。
↑ ^a ^b ^c ^d Deng，Z.；Wang，Z. & Jiang，J. 使用N列表快速挖掘频繁项集的一种新算法[2]。《中国科学信息科学》，55卷（9期）：2008-2030，2012年。
↑ ^a ^b ^c ^d Deng，Z. & Lv，S. 使用节点集快速挖掘频繁项集[3]。《应用专家系统》，41卷（10期）：4505–4512，2014年。

[Deng2010-1] Deng，Z. & Wang，Z. 用于挖掘频繁模式的一种新的快速垂直方法[1]。《国际计算智能系统杂志》，3卷（6期）：733-744，2010年。

[Deng2012-2] Deng，Z.；Wang，Z. & Jiang，J. 使用N列表快速挖掘频繁项集的一种新算法[2]。《中国科学信息科学》，55卷（9期）：2008-2030，2012年。

[Deng2014-3] Deng，Z. & Lv，S. 使用节点集快速挖掘频繁项集[3]。《应用专家系统》，41卷（10期）：4505–4512，2014年。

[1]

[2]

[3]