数据科学：入门/像数据工程师一样思考

数据科学：入门

第 06 章：像数据工程师一样思考

数据科学：入门100% 已开发于 2012 年 7 月 3 日

欢迎来到数据科学
思考世界
分析和可视化，第一部分
- 13：单变量分析0% 已开发于 2012 年 4 月 16 日
- 14：单变量表格和图表0% 已开发于 2012 年 4 月 16 日
建立问题
收集、导入和转换数据
分析和可视化，第二部分
对自由格式问题的答案出现
- 24：非理论性探究
- 25：探索性分析
分析和可视化，第三部分
展示结果
附录

编辑此框

致贡献者（章节完成时删除此部分）

首先，请您在注册并把自己列在下方，以便我们了解共同贡献者。另外，请遵守的编辑指南、风格指南和政策与指南。感谢您的配合！

其次，我们只需要每个章节中简洁、清晰、直接的信息。我们并不追求面面俱到或完整——本书的价值在于跨学科的简单整合。对于某个特定主题的深入和复杂性，还有其他地方可以进行更详细的阐述。在进行贡献时，请保持“初学者的心态”。另外，请确保每个章节的内容可以在一个小时的课堂时间内讲完。如果章节需要超过一个小时才能讲完，可能就过于详细了。

在尽可能的情况下，请使用维基百科和维基词典中定义的术语和概念。这样，学生可以参考相应的维基百科/维基词典页面，更深入地理解这些概念。

第三，这是一本跨学科的书籍。我们希望帮助人们将数据科学应用于所有领域。因此，我们需要各种简单易懂的示例和练习。

第四，请遵守每个章节的简单结构：要点概括、讨论、扩展阅读、练习和参考文献。我们希望“扩展阅读”部分链接到在线资源。参考文献部分可能包含离线资源。要创建新页面，您应该使用此原型页面中的维基标记。

第五，如同任何，请随时进行必要的更正、扩展解释和补充，即使它不是“你的”章节。请使用讨论页面解释可能引起争议的更改。

第六，一些语法规则

请将学生应该学习的关键术语和短语用粗体标出。
使用 'code' 标签显示函数名和代码片段：<code>lm()</code>
使用行内链接 [[ ]] 连接到维基百科、维基词典、维基共享资源、和其他维基媒体基金会属性。
使用参考文献 (<ref> </ref>) 连接到“外部”资源——包括在线和离线资源。
- 使用引用模板创建引用：模板：Cite book、模板：Cite web、模板：Cite journal
如果您想添加图片或图表，应该将其加载到维基共享资源中，而不是上传到中。
- 如果合适，在上传图表时添加标签{{Created with R}})。
如果使用的是R标准软件包以外的软件包，请在每个函数后用粗体括号括起软件包名称：<code>MCMCprobit()</code> ('''MCMCpack''')
您可以使用第三章数据定义作为如何构建章节的示例。

最后，非常感谢您自愿加入我们的团队！

章节总结

当数据科学家以数据工程师的思维方式思考时，他们会以表格的形式进行思考。他们的任务是定义表格的行、列和单元格；将表格相互关联；以及创建用于摄取、存储和检索表格的系统。

（在接下来的讨论中，我们需要更多关于表格思维（行和列）以及多个表格如何相互关联（模式）的内容。也许可以简单地介绍一下范式和索引。还可以谈谈不同的数据管理方案，如平面 CSV 文件、关系型数据库管理系统、NoSQL 等。）

讨论

数据工程是数据科学中与数据相关的部分。根据维基百科，数据工程包括获取、摄取、转换、存储和检索数据。数据工程与数据收集、信息工程、知识工程、信息管理和知识管理密切相关。

数据工程始于对要解决的问题的总体性质的理解。必须制定一个数据获取和管理计划，其中指定数据来源（RSS 源、传感器网络、现有数据存储库）、传入数据的格式（文本、数字、图像、视频）以及数据将如何存储和检索（文件系统、数据库管理系统）。原始数据是“脏”的。原始数据中会存在不符合已商定的数据定义的记录。例如，在一家医院的数据集中，一些 7 到 11 岁的男孩生下了婴儿。^[1] 显然，这些数据中存在错误。数据获取和管理计划的一部分是要决定如何处理脏数据（保留、删除、推断更正）。

大多数情况下，原始数据的格式与分析工具所需的格式不同。实际上，每个工具都希望以自己的特定方式查看数据。因此，数据工程的一项任务是转换数据，使其可以被数据科学团队将要使用的分析工具使用。例如，一个团队可能会收到以下格式的产蛋数据，每个观察结果都在单独的行中：

母鸡	日期	鸡蛋数量
A	1	3
A	2	4
A	3	2
B	1	1
B	2	0
B	3	2

但是，团队想要进行的分析需要将关于每只母鸡的所有观察结果放在一行中，如下所示：

母鸡	第 1 天	第 2 天	第 3 天
A	3	4	2
B	1	0	2

好的数据工程需要具备操作数据的能力，以及对将要使用数据的分析目的的理解。

在上面的产蛋示例中，第一个表格处于规范化形式，这有利于进一步分析，第二个表格则以用户可以理解的格式呈现数据。通常，这种格式对要向数据提出的问题会做出隐式假设，例如“母鸡产蛋量随时间的趋势如何？”。而对于“一只母鸡在多少次情况下没有产蛋？”等其他问题，则更容易用规范化形式的数据来回答。

分析的来源通常是另一个系统的输出，因此，例如，一个产蛋数据库可能在内部以 3 列格式存储数据，但以“多列”格式导出报表。数据工程师的任务之一是转换捕获的数据，这可能包括重新规范化来自输出报表的數據。

维基百科将数据库规范化定义为组织关系型数据库的字段和表格的过程，以最大限度地减少冗余和依赖性，通常是通过将较大的表格拆分为较小的（且冗余较少的）表格，并定义它们之间的关系来实现。规范化的主要目标是：

避免更新和删除异常
在扩展数据库结构时最小化重新设计
支持通用查询，包括在设计时未预料到的未来查询

假设产蛋数据被扩展为存储每只母鸡的年龄和颜色。这可以在一个表格中表示，如下所示：

母鸡	年龄	颜色	日期	鸡蛋数量
A	2	棕色	1	3
A	2	棕色	2	4
A	2	棕色	3	2
B	1	白色	1	1
B	1	白色	2	0
B	1	白色	3	2

此表格现在包含冗余信息，因为我们重复存储了每只母鸡的年龄和颜色 3 次。如果我们要存储每只母鸡数百天的数据，就会变得效率低下。此外，如果母鸡 B 变成 2 岁，我们必须同步更新记录 4、5 和 6 中的更改，以更新年龄数据。规范化的解决方案是为与产蛋表格通过唯一标识符或键关联的鸡相关事实创建一个单独的“母鸡”表格。

维基百科将主键定义为关系型数据库关系型数据库中表格中记录的唯一标识符。一些数据集具有天然的唯一键（例如员工表格中的 employee_id），而在其他情况下，则需要通过系统生成唯一键，这可以通过内部“递增”计数器或通过组合多个属性来创建一个唯一键（例如上面的示例中的 Chicken_Day）。其他表格可以通过使用主键来交叉引用此表格。例如，“项目”表格可以包含一个包含项目关联的每个团队成员的 employee_id 列。此“交叉引用”列称为外键。

实体关系图（也称为逻辑数据模型）用于设计关系型数据库，并且可以很好地理解数据集中的结构。实体关系模型的三个构建块是实体、属性和关系。实体是离散且可识别的“事物”，可以是物理对象，例如汽车（或鸡），也可以是概念，例如银行交易或电话呼叫。每个实体可以物理地表示为一个表，其中表的每一列都是实体的属性（例如 employee_id、forename、surname、date of joining）。关系是连接两个或多个实体的动词。例如，一只鸡“下”蛋，或者一名员工“属于”某个部门。重要的是，关系也有一个基数，可以是“一对一”、“多对一”、“一对多”或“多对多”。例如，一只鸡可以下很多蛋，但每个蛋只由一只鸡下，所以“下”关系是一对多。多对多关系通常表明设计需要进一步阐述。例如，大学教师和学生之间的“教学”关系将是多对多，需要引入诸如班级和日期之类的实体来完全理解这种关系。下面显示了一个实体关系图示例。

实体关系图示例，显示学生和讲师之间的关系。

更高级的数据工程还需要了解计算机编程和结构化查询语言，以及关系型和非关系型数据库管理系统。出于本书的目的，我们将使用R编程语言来完成简单的数据工程任务。

作业/练习

此作业是关于将数据集读入 R 数据框。组成 3 或 4 名学生的组。每个学生必须完成本练习的每个部分。组建团队的目的是互相帮助理解正在发生的事情。一些作业需要一些试错。不同的学生会进行不同的尝试和错误，因此所有学生都将从彼此的尝试和错误中学习。

3 个部分中的第 1 部分：在 R 中创建 4 个变量，每个变量有 12 个观测值。

#Create data frame
#
#This work is licensed under a
#Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
#D. Calvin Andrus, Ph.D.
#30 August 2012

#Remove Objects in workspace
rm(list=ls())

#Create four variables with 12 oberservations each
#Weather data for Sterling, VA from http://www.weather.com/weather/wxclimatology/monthly/USVA0735
#Retrieved 30 August 2012
#Average Temperature (Farenheit) 
#Average Precipitation (inches)
Num <- 1:12
Month <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
AveTemp <-c(32, 35, 43, 53, 62, 72, 76, 75, 67, 55, 46, 36)
AvePrcp <-c(2.85, 2.86, 3.60, 3.62, 4.72, 3.92, 3.70, 3.49, 4.00, 3.59, 3.58, 3.09)

#List out the objects that exist in the R environment
ls()

#Verify each variable
Num
Month
AveTemp
AvePrcp

#Link these four variables together into a dataset where each of the 12 observations correspond to each other
#Give the dataset a name (Wthr) using the dataframe command

Wthr <- data.frame(Num, Month, AveTemp, AvePrcp)

#List out the objects that exist in the R environment
ls()

#Notice that the 4 variables are still part of the R environment in addition to the dataframe
#The variables are now also part of the data frame 
#Verify the contents of the dataset
Wthr

#Verify the formats within the data frame using the "structure" (str) command
str(Wthr)

#Notice that as part of the data frame the variables have a dollar sign ($) as a prefix
#Compare the Month variable inside and outside the data frame
str(Month)
str(Wthr$Month)

#Whoops! What happened? When we inserted the character variable Month into the data frame, it was converted to a factor variable.
#We call the values of a Factor variable "levels"
#Factor variables are nominal variables, which means the default is that order does not matter, which is called an "unordered" factor. 
#Therefore R does two things as a default:
#  1) R prints out the levels in alphabetical order
#  2) R associates an random integer to each level, in this case 5, 4, 8, 1, 9, etc.
#For this particular problem the order of the months does matter.
#We can force an order on a factor by using the factor() function
#This is called an "ordered" factor
levels(Wthr$Month)
Wthr$Month <- factor(Wthr$Month, levels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))

#Note we could have also specified, levels=Month, can you explain why?
#Verify that the factor levels are now ordered properly, with the assigned integers in order
levels(Wthr$Month)
str(Wthr$Month)
Wthr

#We can now remove the redundant variables from the R workspace
rm("AvePrcp", "AveTemp", "Month", "Num")
ls()

#The dataframe is the only object left
#Now let's do some plots
plot(x=Wthr$Month, y=Wthr$AveTemp)
lines(Wthr$Month,fitted(loess(Wthr$AveTemp~Wthr$Num)))
plot(x=Wthr$Month, y=Wthr$AvePrcp)
plot(x=Wthr$AveTemp, y=Wthr$AvePrcp, pch=16, cex=1.5)
abline(lm(Wthr$AvePrcp~Wthr$AveTemp))

3 个部分中的第 2 部分。将一个示例数据集加载到数据框中。

#Put Example Data into Data Frame 
#
#This work is licensed under a
#Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
#D. Calvin Andrus, Ph.D.
#31 August 2012

#Remove Objects in workspace
rm(list=ls())

#Find out the available datasets
data()

#Pick a dataset and get the help file
?state

#Load the dataset into the R workspace
data(state)

#Find out what got loaded
ls()

#Examine the objects that were loaded
str(state.abb)
str(state.area)
str(state.x77)

#Notice that the last object was not a simple variable with a single set of observations, but
#it is a matrix that is 50 rows long and 8 columns wide
#Inspect a summary of these data
summary(state.abb)
summary(state.x77)

#Print out the contents of these objects
state.abb
state.x77

#Now let's put these objects into a data frame called "state" and inspect it
state <- data.frame(state.abb, state.area, state.center, state.division, state.name, state.region, state.x77)
ls()
str(state)

#Remove the old objects, now that we have put the data set into a data frame
rm(state.abb, state.area, state.center, state.division, state.name, state.region, state.x77)
ls()

#Print out the data frame
state

#Examine the relationships among the variables using table() and plot(), then
#Try about 10 different variations on both the table() and the plot() functions
table(state$state.region,state$state.division)
plot(state$Illiteracy,state$Murder)

3 个部分中的第 3 部分 - 导入外部数据集。

在维基百科中找到费雪的鸢尾花数据集。
复制数据表并将其粘贴到 Microsoft Excel、Apple Numbers 或 Google Docs 电子表格中。
将数据集以逗号分隔值 (CSV) 格式保存在您的桌面上，文件名“iris.csv”。
将数据集读入 R。
检查数据，确保所有数据都在，然后使用 summary()、 table() 和 plot() 函数查看数据。

#Read External Data into Data Frame 
#
#This work is licensed under a
#Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
#D. Calvin Andrus, Ph.D.
#30 August 2012

#Remove Objects in workspace
rm(list=ls())

#Findout what our default working directory is
getwd()

#Set your working directory to the "desktop" and verify
#You will need to use your own directory structure
setwd("/Users/Calvin/Desktop/")
getwd()

#Read the iris.csv into a dataframe -- and verify
#  The first line of the file should be the variable names, hence header=TRUE
#  Tell R that the separator is a comma
#  If there are other line on top of the variable names, then you will need to skip them
iris <- read.table("iris.csv", header=TRUE, sep=",", skip=0)
str(iris)
iris

#You should have gotten 150 observations on 5 variables
#Explore the data using summary(), table(), and plot()
summary(iris)
table(iris$Species)
plot(iris$Sepal.length,iris$Sepal.width)

#Create a character variable to match a color to the factor variable Species
#Note how the R code implements the follow English statement
#  If the variable "iris$species" has the value "I.setosa" then set the "iris$plotcolor" variable to the value "blue"
iris$plotcolor <- as.character("black")
iris$plotcolor [iris$Species == "I. setosa"] <- "blue"
iris$plotcolor [iris$Species == "I. versicolor"] <- "green"
iris$plotcolor [iris$Species == "I. virginica"] <- "red"

plot(
   main="Plot of Sepal Size for Three Iris Species",
   x=iris$Sepal.width, xlim=c(1,5), xlab="Sepal Width",
   y=iris$Sepal.length, ylim=c(3,8), ylab="Sepal Length",
   pch=16,
   col=iris$plotcolor
  )
legend(1.5, 3.5,"Setosa=Blue, Versicolor=Green, Virginica=Red")

#Now, plot the Petal Length and Width
#Compare Sepal Width with Petal Width
#Compare Sepal Length with Petal Length

参考文献

↑ "Heritage Provider Network Heath Prize". Heritage Provider Network, Inc. Retrieved 13 July 2012.

版权声明

您可以自由

共享 - 复制、分发、展示和表演作品（此维基百科中的页面）。
混音 - 改编或制作衍生作品。

在以下条件下

署名 - 您必须将此作品归功于。您不得暗示以任何方式认可您或您对本作品的使用。
相同方式共享 - 如果你更改、转换或构建此作品，你只能在与本许可证相同或相似的许可证下分发生成的著作。
放弃 - 如果您获得版权持有者的许可，则可以放弃上述任何条件。
公有领域 - 如果该作品或其任何部分在适用法律下属于公有领域，则该状态不受许可证的影响。
其他权利 - 许可证不会影响以下任何权利。

您的公平交易或合理使用权，或其他适用的版权例外和限制；
作者的道德权利；
其他人可能在作品本身或作品的使用方式中拥有的权利，例如宣传权或隐私权。

注意 - 对于任何重复使用或分发，您必须向其他人明确说明此作品的许可条款。执行此操作的最佳方法是链接到以下网页。

http://creativecommons.org/licenses/by-nc-sa/3.0/

[1] "Heritage Provider Network Heath Prize". Heritage Provider Network, Inc. Retrieved 13 July 2012.

[1]

致贡献者（章节完成时删除此部分）

章节总结

讨论

作业/练习

更多阅读

参考文献

版权声明