R 编程/描述性统计

在本节中，我们将介绍描述性统计，即一套用于描述和探索数据的工具。这主要包括单变量和双变量统计工具。

通用函数

我们介绍一些用于描述数据集的函数。

names()给出每个变量的名称
str()给出数据集的结构
summary()给出数据中每个变量的均值、中位数、最小值、最大值、第一和第三四分位数。

> summary(mydat)

describe()(Hmisc 包) 比summary()

> library("Hmisc")
> describe(mydat)

contents()(Hmisc 包)
dims()在 Zelig 包中提供更多详细信息。
descr()在 descr 包中给出连续变量的最小值、最大值、均值和四分位数，因子频率表和字符向量的长度。
whatis()(YaleToolkit) 对数据集进行良好描述。
detail()在 SciencesPo 包中提供了大量用于连续变量的统计信息、用于因子的频率表和用于字符向量的长度。
describe()在 psych 包中也提供了汇总统计信息。

> x = runif(100)
> y = rnorm(100)
> z = rt(100,1)
> sample.data = x*y*z
> require(psych)
Loading required package: psych
> describe(cbind(sample.data,x,z,y))
            var   n  mean   sd median trimmed  mad    min   max range  skew kurtosis   se
sample.data   1 100  0.37 3.21   0.00    0.07 0.31  -9.02 24.84 33.86  4.79    36.91 0.32
x             2 100  0.54 0.28   0.56    0.55 0.35   0.02  1.00  0.98 -0.12    -1.13 0.03
z             3 100  0.12 6.28   0.02   -0.01 1.14 -30.40 37.93 68.33  1.49    22.33 0.63
y             4 100 -0.01 1.07   0.09   -0.02 1.12  -2.81  2.35  5.16  0.00    -0.30 0.11

单变量分析

连续变量

矩

mean()计算平均值
方差 var().
标准差sd().
偏度skewness()(fUtilities, moment 或 e1071)
峰度 kurtosis()(fUtilities, moment 或 e1071)
所有矩 moment()(moment) 和all.moments()(moment).

> library(moments)
>  x <- rnorm(1000)
> moment(x,order = 2) # the variance
[1] 0.999782
> all.moments(x, order.max = 4) # mean, variance, skewness and kurtosis
[1] 1.000000000 0.006935727 0.999781992 0.062650605 2.972802009
> library("e1071")
> moment(x,order = 3) # the skewness
[1] 0.0626506

顺序统计

范围、最小值和最大值 range()返回向量的范围（向量的最小值和最大值），min()最小值和max()最大值。
IQR()计算四分位间距。median()计算中位数和mad()中位数绝对偏差。
quantile(), hdquantile()在 Hmisc 包中和kuantile()在 quantreg 包中计算连续向量的样本分位数。kuantile()当样本量很大时可能效率更高。

> library(Hmisc)
> library(quantreg)
> x <- rnorm(1000)
> seq <- seq(0, 1, 0.25)
> quantile(x, probs = seq, na.rm = FALSE, names = TRUE)
         0%         25%         50%         75%        100% 
-3.07328999 -0.66800917  0.02010969  0.72620061  2.92897970 
> hdquantile(x, probs = seq, se = FALSE, na.rm = FALSE, names = TRUE, weights=FALSE)
       0.00        0.25        0.50        0.75        1.00 
-3.07328999 -0.66901899  0.02157989  0.72378407  2.92897970 
> kuantile(x, probs = seq(0, 1, .25), na.rm = FALSE, names = TRUE)
         0%         25%         50%         75%        100% 
-3.07328999 -0.66800917  0.02010969  0.72620061  2.92897970 
attr(,"class")
[1] "kuantile"

不平等指数

基尼系数 Gini()(ineq) 和gini()(reldist).
ineq()(ineq) 给出所有不平等指数。

> library(ineq)
> x <- rlnorm(1000)
> Gini(x)
[1] 0.5330694
> RS(x) #  Ricci-Schutz coefficient
[1] 0.3935813
> Atkinson(x, parameter = 0.5)
[1] 0.2336169
> Theil(x, parameter = 0)
[1] 0.537657
> Kolm(x, parameter = 1)
[1] 0.7216194
> var.coeff(x, square = FALSE)
[1] 1.446085
> entropy(x, parameter = 0.5)
[1] 0.4982675
> library("reldist")
> gini(x)
[1] 0.5330694

集中指数

> library(ineq)
> Herfindahl(x)
[1] 0.003091162
>  Rosenbluth(x)
[1] 0.002141646

贫困指数

> library(ineq)
> Sen(x,median(x)/2)
[1] 0.1342289
> ?pov # learn more about poverty index

绘制分布图

我们可以使用箱线图绘制分布图（boxplot()），直方图（hist()），核密度估计器（plot()使用density()）或经验累积分布函数（plot()使用ecdf()）。有关直方图和核密度估计器的更多信息，请参见非参数部分。qqnorm()生成一个正态 QQ 图，并且qqline()向 QQ 图添加一条穿过第一和第三四分位数的直线。

箱线图是最小值、第一四分位数、中位数、第三四分位数和最大值的图形表示。
stripchart()和stem()也可使用。

> x <- rnorm(10^3)
> hist(x)
> plot(density(x))
> boxplot(x)
> plot(ecdf(x)) # plots the empirical distribution function
 
> qqnorm(x)
> qqline(x, col="red") # it does not do the plot but adds a line to existing one

拟合优度检验

Kolmogorov Smirnov 检验

KS 检验是一种单样本拟合优度检验。检验统计量只是经验累积分布函数和理论累积分布函数之间差异的绝对值的最大值。KSd()(sfsmisc) 给出 KS 统计量的临界值。例如，我们从 Beta(2,2) 分布中抽取一个样本，并检验它是否符合 Beta(2,2) Beta(1,1) 和均匀分布。

> y <- rbeta(1000,2,2) # Draw y in a Beta(2,2) distribution
> ks.test(y,"pbeta",2,2) # Test if it fits a beta(2,2) distribution
> ks.test(y,"pbeta",1,1) # Test if it fits a beta(1,1) distribution
> ks.test(y,"punif") # Test if its fit a uniform distribution (in fact the beta(1,1) is a uniform distribution)

有些检验特定于正态分布。Lillie 检验是当参数未知时 KS 检验的扩展。在 nortest 包中使用以下命令实现它：lillie.test()。shapiro.test()实现了Shapiro Wilk 正态性检验

> N <- 100
> x <- rnorm(N)
> library("nortest")
> lillie.test(x)

         Lilliefors (Kolmogorov-Smirnov) normality test

data:  x 
D = 0.0955, p-value = 0.9982*
> shapiro.test(x)

	Shapiro-Wilk normality test

data:  x 
W = 0.9916, p-value = 0.7902

Andersen Darling 检验

> library("nortest")
> ad.test(x)

	Anderson-Darling normality test

data:  x 
A = 0.2541, p-value = 0.7247

另请参见 ADGofTest 包以获取此检验的另一个版本^[1]。

Shapiro-Francia 正态性检验

> sf.test(x)

	Shapiro-Francia normality test

data:  x 
W = 0.9866, p-value = 0.9953

Pearson 卡方正态性检验

> library("nortest")
> pearson.test(x)

	Pearson chi-square normality test

data:  x 
P = 0.8, p-value = 0.8495

Cramer-von Mises 正态性检验

> cvm.test(x)

	Cramer-von Mises normality test

data:  x 
W = 0.0182, p-value = 0.9756

Jarque-Bera 检验

> jarque.bera.test(x)

	Jarque Bera Test

data:  x 
X-squared = 0.6245, df = 2, p-value = 0.7318

离散变量

我们使用以下命令生成离散变量：sample()，并使用以下命令对其进行制表：table()。我们可以使用饼图（pie()），条形图（barplot()或barchart()(lattice)) 或点图（dotchart()或dotplot()(lattice)) 进行绘制。

freq()(descr) 打印频率、百分比并生成条形图。它支持权重。

> x <- sample(c("A","B","C"),100,replace=T)
> tab <- table(x)
> tab
> prop.table(tab)
> pie(tab)
> barplot(tab)
> dotchart(tab)
> library("descr")
> freq(x) 
x 
      Frequency Percent
A            32      32
B            34      34
C            34      34
Total       100     100

多变量分析

连续变量

协方差 cov()
Pearson 线性相关 cor().
Pearson 相关检验cor.test()执行检验。
Spearman 等级相关
- cor()使用method = "spearman".
- spearman()(Hmisc)
Spearman 等级相关检验
- spearman2()(Hmisc)
- spearman.test()(Hmisc)
- spearman.test()(pspearman 包) 执行 Spearman 等级相关检验，对于 n <= 22 使用预先计算的精确零分布。
Kendall 相关 cor()使用method = "kendall"。另请参见 Kendall 包。

> N <- 100
> x1 <- rnorm(N)
> x2 <- rnorm(N) + x1 + 1
> y <- 1 + x1 + x2 + rnorm(N)
> plot(y ~ x1 ) # Scatter plot 
> mydat <- data.frame(y,x1,x2)
> cor(mydat)
> cor(mydat, method = "spearman")
> cor(mydat, method = "kendall")
> cor.test(mydat$x1,mydat$x2, method = "pearson")
> cor.test(mydat$x1,mydat$x2, method = "spearman")
> cor.test(mydat$x1,mydat$x2, method = "kendall")

离散变量

table(), xtabs()和prop.table()用于列联表。ftable()(stats 包) 用于扁平（嵌套）表。
assocplot()和mosaicplot()用于列联表的图形显示。
CrossTable()(descr) 类似于 SAS Proc Freq。它返回一个列联表，其中包含卡方检验和费舍尔独立性检验。
my.table.NA()和my.table.margin()(cwhmisc)
chisq.detail()(TeachingDemos)

离散变量和连续变量

bystats()Hmisc 包中的按类别统计
summaryBy()(doBy)
多个箱线图 plot()或boxplot()

> N <- 100
> x <- sample(1:4,N, replace = T) 
> y <- x + rnorm(N)
> plot(y ~ x) # scatter plot
> plot(y ~ as.factor(x)) # multiple box plot
> boxplot(y ~ x) # multiple box plot
> bystats(y , as.factor(x), fun = mean) 
> bystats(y , as.factor(x), fun = quantile)

两个样本均值的相等性t.test()和wilcox.test()，方差相等性var.test()，两个分布的相等性ks.test().

N <- 100
x <- sample(0:1,N, replace = T) 
y <- x + rnorm(N)
t.test(y ~ x )
wilcox.test(y ~ x)

参考文献

↑ Carlos J. Gil Bellosta (2009). ADGofTest: Anderson-Darling GoF test. R package version 0.1. https://cran.r-project.cn/package=ADGofTest

上一页：图形

索引

下一页：线性模型

[1] Carlos J. Gil Bellosta (2009). ADGofTest: Anderson-Darling GoF test. R package version 0.1. https://cran.r-project.cn/package=ADGofTest

[1]