可计算性和复杂性/形式语言/乔姆斯基层次结构/正则语言
正则语言是乔姆斯基层次结构中最受限制且最简单的语言。此类语言通常像数学集合一样描述,用大括号描述。任何匹配该描述的单词都是该语言的一部分,而任何不匹配该描述的单词都不是该语言的一部分。
非正式地,正则语言类是指可以用其字母表的字符以及连接、括号、∪(或)运算符和*运算符描述的语言。例如,,包含任意数量的a和b的语言,其中每个a后面都跟着一个b。
需要注意的是,任何有限语言,即包含一些有限的单词序列的语言,可以写成。因此,每种有限语言都是正则的。然而,反过来却不是真的。前面给出的示例语言,,包含无限个单词,并且是正则的。
正则语言的另一种等效定义是那些可以通过正则文法生成的语言。正则文法是一组规则,从初始符号开始,通过替换规则将“非终结符”(不在语言字母表中的符号)替换为“终结符”(在语言字母表中的符号),这些规则将每个非终结符替换为一个终结符,一个终结符后跟一个非终结符,或ε(空字符串)。例如,语言可以通过以下文法描述(令 S 为起始符号)
- S -> ε
- S -> aB
- S -> b
- S -> bA
- A -> aB
- A -> b
- A -> bA
- B -> bA
- B -> b
注意,通过对要应用的规则进行不同的选择,该文法可以从“S”的起始字符串生成空字符串或包含a和b的字符串,其中每个a后面都跟着一个b。这与匹配上面集合符号的单词完全相同,使得两种方法生成的语言相同。
有限自动机是机器,包含有限个状态,并且仅根据当前状态和字符输入在这些状态之间转换。当输入都被消耗后,机器会根据机器的最终状态“接受”或“拒绝”输入字符串。这些机器恰好对应于正则语言,任何有限自动机接受的字符串集都是正则语言,并且每个正则语言都有一个机器接受该语言中所有且仅有的字符串。因此,我们说正则语言类等效于有限自动机识别的语言类。
有限自动机有两种类型,确定性和非确定性。确定性有限自动机 (DFA) 针对输入中的每个字符执行一次转换,并且从每个状态到字母表中的每个字符都包含一个转换。
非确定性有限自动机 (NFA) 每次转换可以消耗一个字符或不消耗字符,不需要从每个状态对每个字符执行转换,并且对于特定输入和特定状态,可以有多个可能的转换。这些属性允许 NFA 包含计算分支。如果由字符串生成的任何计算分支接受,则机器接受该字符串,并且仅当没有分支接受时才会拒绝。由于 DFA 比 NFA 更严格,因此每个 DFA 也是 NFA,并且任何 NFA 都可以转换为 DFA,因此所有 NFA 和 DFA 识别的语言集是相同的,并且这两个机器是等效的。
对于两种类型,自动机都可以通过其状态集、字母表、转换集、起始状态和接受状态来完全指定。
并非所有语言都是正则的。以语言 为例,其单词包含一定数量的 a 接着相同数量的 b。有限自动机唯一的记忆就是其当前状态,因此它只能计数不超过其状态数的 a 数量。由于该语言包含所有此类单词,因此单词可以拥有任意数量的 a,所以对于任何机器而言,都必须存在一些字符串,其 a 的数量无法存储。如果无法存储 a 的数量,则无法比较 a 和 b 的数量,因此无法验证任何字符串是否属于该语言。因此,不存在识别该语言的有限自动机,因此它不能是正则语言。
以下代码是 Perl 中的示例 DFA 模拟器。给定机器的描述和输入字符串,它模拟机器处理输入字符串,并显示机器是否接受。
语法为:progname.pl DFAFile inputFile,其中 DFAFile 是包含 DFA 指令的文本文件,inputFile 是包含输入字符串的文本文件。一些示例输入(包括一组 DFA 指令)可以在 示例 DFA 输入 中找到。
#!usr/bin/perl
use Text::ParseWords;
use strict;
use warnings;
our (%states, %accepts, %alphabet, @rules, @string);
# Grabs the filenames for the machine and the word to be run on it.
my $dfaFile = $ARGV[0];
my $input = $ARGV[1];
# Uses subroutines to parse and verify the data in the input files.
# The data is stored in the global hashes and arrays, with the exception of the start state, which is provided here.
my $state = readDFA($dfaFile);
readInput($input);
# The newstate is a temporary storage point for when a new state is calculated.
my $newstate;
# For each symbol in the input word, the next state is calculated.
for (0..@string-1)
{
# The input word is printed, with the next symbol highlighted.
print "@string[0..$_-1] <".$string[$_]."> @string[$_+1..@string-1]\n";
# A new state is calculated.
$newstate = $rules[$states{$state}][$alphabet{$string[$_]}];
# The state transition is printed.
print "State: ".$state." -> ".$newstate."\n\n";
# The state changes to the new state.
$state = $newstate;
}
# When the input is exhausted, the machine is in its final state.
print "Final state is ".$state."\n";
# If that state is in the accept states list, the machine accepts the input.
if (defined($accepts{$state})) { print "The machine accepts the string.\n"; }
# If not, the machine rejects.
else { print "The machine does not accept the string.\n" }
###################################################
sub readDFA
# This subroutine reads the machine data from the specified file into variables (mostly hashes).
{
open(INFILE, shift) or die "Can't open ".$dfaFile.": $!";
# This block reads the list of states from the machine file.
# Discards the section header,
<INFILE>;
my $line = <INFILE>;
chomp($line);
my @words = &parse_line('\s+', 0, $line);
my $counter = 0;
for (@words)
{
# assigns each state name a number by creating a hash,
$states{$_} = $counter;
$counter++;
}
# and records the number of states.
my $stateCount = $counter;
# This block reads the start state from the machine file.
# Discards the header,
<INFILE>;
my $startState = <INFILE>;
# takes the whole line as the start state,
chomp($startState);
# and makes sure that the start state is defined in the list of states.
defined($states{$startState}) or die "The start state $startState isn't a state!";
# This block reads the list of accepting states from the machine file.
# Discards the header,
<INFILE>;
$line = <INFILE>;
chomp($line);
# breaks up the line into state names,
@words = &parse_line('\s+', 0, $line);
for (@words)
{
# checks to make sure that the accept states are defined states,
defined($states{$_}) or die "The start state $_ isn't a state!";
# and defines those names in a new hash. The use of a hash makes it easier to determine later if a specific state name accepts or not.
$accepts{$_} = 1;
}
# This block reads the list of symbols in the alphabet from the machine file.
# Discards the header,
<INFILE>;
$line = <INFILE>;
chomp($line);
# breaks up the line into alphabet symbols (note that the symbols can be of arbitrary length),
@words = &parse_line('\s+', 0, $line);
$counter = 0;
for (@words)
{
# assigns each symbol a number, and creates a hash to reference them,
$alphabet{$_} = $counter;
$counter++;
}
# and then records the number of symbols in the alphabet.
my $alphabetCount = $counter;
# This block reads the state transition rules from the machine file.
# Discards the header,
<INFILE>;
while(<INFILE>)
{
# breaks each rule into start state, input symbol, and end state,
@words = &parse_line('\s+', 0, $_);
# checks that all three pieces are defined in the state and alphabet hashes,
defined($alphabet{$words[1]}) or die "$words[1] isn't defined in the alphabet!";
defined($states{$words[0]}) or die "$words[0] isn't a defined state!";
defined($states{$words[2]}) or die "$words[2] isn't a defined state!";
# then uses the numbers assigned to the start state and the input symbol as indexes in a 2-dimensional array whose value is the name of the end state.
$rules[$states{$words[0]}][$alphabet{$words[1]}] = $words[2];
}
# This last block makes sure the set of rules is comprehensive.
for my $i (0..$stateCount-1)
{
defined($rules[$i]) or die "Rules are incomplete.";
for my $j (0..$alphabetCount-1)
{
defined($rules[$i][$j]) or die "Rules are incomplete.";
}
}
# Reading complete, the subroutine closes the file and returns the name of the start state.
close INFILE;
return $startState;
}
sub readInput
# This subroutine reads the input string from the specified file into an array of symbols.
{
open(INFILE, shift) or die "Can't open ".$input.": $!";
# The first line of the file is read as the input, with symbols delimited by spaces.
my $line = <INFILE>."";
chomp($line);
@string = &parse_line('\s+', 0, $line);
# This makes sure every symbol in the input string was defined in the machine's alphabet.
for (@string)
{ defined($alphabet{$_}) or die "$_ in $input isn't in this machine's alphabet!"; }
close INFILE;
}