应用程序/伪随机重复子字符串的 Visual Basic
外观
此页面描述了一些适用于 VBA 的 Rnd() 函数的问题。特别是它说明了当 Randomize() 函数错误地放置在同一个循环 *内部* 而不是 *之前* 时,可能会出现重复的子字符串。
- Rnd() 函数是伪随机的,而不是真正随机的。真正的随机性很少见,一个著名的例子是可以通过白噪声获得的数据序列。白噪声,就像来自太阳的无线电噪声,或者可能是无线电或其他电子设备中的无意噪声,具有相当均匀的频率分布,并且可以用来产生随机分布的数据;也称为 *线性概率分布*,因为它们的频率分布是平行于水平轴的直线。
- 伪随机性 可以通过反馈算法获得,其中函数的一系列输出值被反馈并有助于生成输出流的下一部分。这些被称为伪随机数生成器 (PRNG)。这样的过程虽然复杂,但仍然是确定性的,完全基于它的起始值。这样的生成器,根据其设计,可以产生长序列的值,所有这些值都是唯一的,然后整个流最终会重复自身。
- 如果生成足够长的值集,PRNG 输出流总是会重复自身。VBA 中的 *Rnd* 函数可以在任何一个数字重复之前生成最多 16,777,216 个数字的序列,此时整个序列本身会重复。这在大多数情况下足够了。微软将 Rnd() 函数描述为属于称为线性同余生成器 (LCG) 的 PRNG 集,尽管不清楚该算法是否已被修改。
- Rnd 函数不适合大型表格或加密使用,而 VBA 本身在很大程度上是不安全的。对于给定的起始值,生成器总是会产生相同的序列。显然,如果流的任何一部分已知,这将允许预测序列中的其他值,这种情况对于加密使用是不安全的。也许令人惊讶的是,大量使用随机值的建模方法需要比 *Rnd()* 生成的更长的唯一序列。
- 微软 Rnd() 函数的精确编码不可用,他们对此的描述资料非常简略。我最近尝试在 VBA 代码中实现假设的算法失败了,因为发生了溢出,因此那些打算在 VBA 中研究此类生成器的人需要使用其他算法。也许研究 Wichmann-Hill (1982) CLCG 算法是一个更好的选择,该算法可以在 VBA 中实现。另一个页面提供了 Wichmann-Hill (1982) 算法的 VBA 实现(由其他人完成),以及一些更简单的生成器示例。
- 一个设计良好的 PRNG 流由唯一数字组成,但这只适用于设计者从零到一 [0,1] 范围内的未过滤数字集。一旦我们开始从流中获取一些值,而忽略其他值,例如为了生成自定义输出,新流将呈现出不同的特征。从自然序列中挑选元素和将大型集合映射到非常小的集合的组合会造成损害。在观察新集合时,循环重复点的字符计数会缩短,并且整个集合中重复子字符串的数量会增加。
- 下面的代码列表允许使用预设过滤器设置检查 Rnd() 流中的子字符串,例如;大写字母、小写字母、整数等,此外,还包括一个基于哈希的类似生成器,供那些希望比较它的人使用。
- 重复子字符串过程相当慢,因为它依赖于重复的位置。最坏的情况是*没有找到重复*,其中循环的数量在*(0.5*n)^2* 处达到最大值,即测试字符串中字符数的一半的平方。当然,最小的循环数只有 *1*,当一个简单的字符串重复时,例如;abcabc。显然,字符串长度增加十倍可能会使运行时间增加一百倍。(大约 2 秒内 1000 个字符,4 秒内 2000 个字符,200 秒内 10000 个字符,到目前为止,是最好的时间!)。
- 编码布局也会影响重复子字符串的长度。读者可以比较将 *Randomize* 函数放置在随机数循环 *外部*,然后放置在循环 *内部* 的效果,同时只输出大写字母。(见 *MakeLongRndStr* 中的代码)。在过去,当放置在内部时,重复字符串会大大恶化。此处列出的用于测试 Rnd() 的代码(使用 1000 个大写字母样本,不使用 DoEvents,并且 Randomize 错误地放置在循环 *内部*),对于本文作者来说,将返回长达 400 个字符的重复子字符串。循环中的代码行增加,影响每个循环迭代的运行时间(?)也会影响任何子字符串的长度。
Option Explicit
Sub TestRndForRepeats()
'run this to make a pseudo random string
'and to test it for longest repeated substring
Dim strRnd As String, sOut As String
Dim nOut As Long, nLen As Long
strRnd = MakeLongRndStr(1000)
MsgBox strRnd,, "Long string..."
sOut = LongestRepeatSubstring(strRnd, nOut)
MsgBox "Repeated substring found is : " & _
vbCrLf & sOut & vbCrLf & _
"Number of these found : " & nOut & vbCrLf & _
"Length of each : " & Len(sOut),, "Repeat substring..."
End Sub
Sub TestHashForRepeats()
'run this to make a long hash-based output
'and to test it for longest repeated substring
Dim sOut As String, sHash As String, nOut As Long
sHash = LongHash("String to hash", 1000)
MsgBox "The following sha256-based hash has " & _
Len(sHash) & " characters." & _
vbCrLf & vbCrLf & sHash,, "Long hash..."
sOut = LongestRepeatSubstring(sHash, nOut)
MsgBox "Repeated substring found is : " & _
vbCrLf & sOut & vbCrLf & _
"Number of these found : " & nOut & vbCrLf & _
"Length of each : " & Len(sOut),, "Repeat substring..."
End Sub
Function MakeLongRndStr(nNumChr As Long) As String
'Makes a long capital letters string using rnd VBA function
Dim n As Long, sChr As String, nAsc As Long
Dim nSamp As Long, sRec As String
'========================================================================
' Notes and Conclusions:
' The VBA function rnd is UNSUITED to generation of long random strings.
' Both length and number of repeats increases rapidly near 256 charas.
' Reasonable results can be obtained by keeping below 128 characters.
' For longer strings, consider hash-based methods of generation.
'========================================================================
'Randomize 'right place
Do Until n >= nNumChr
'DoEvents
Randomize 'wrong place
nSamp = Int((122 - 48 + 1) * Rnd + 48) 'range includes all charas
sChr = Chr(nSamp)
'cherry-picks 10, 26, 36, 52, or 62 from a set of 75
Select Case nSamp 'chara filter
Case 65 To 90 'upper case letters
sRec = sRec & sChr
Case 48 To 57 'integers
'sRec = sRec & sChr
Case 97 To 122 'lower case letters
'sRec = sRec & sChr
Case Else
'disregard
End Select
n = Len(sRec)
Loop
'MsgBox sAccum
MakeLongRndStr = Left$(sRec, nNumChr)
End Function
Function LongHash(sIn As String, nReq As Long, Optional sSeed As String = "") As String
'makes a long sha256 hash - length specified by user
'Parameters: sIn; the string to hash
'nReq; the length of output needed
'sSeed; optional added string modifier
Dim n As Long, m As Long, c As Long, nAsc As Integer, sChr As String
Dim sF As String, sHash As String, sRec As String, sAccum As String
Do Until m >= nReq
DoEvents
n = n + 1 'increment
'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
'you set your own cycle increment here
sF = sIn & sSeed & sAccum & (7 * n * m / 3)
'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
'get a single hash of sF
sHash = HashSHA256(sF)
'filter output for chara type
For c = 1 To Len(sHash)
sChr = Mid$(sHash, c, 1)
nAsc = Asc(sChr)
'cherry-picks 10, 26, 36 ,52, or 62 from a set of 64
Select Case nAsc 'chara filter
Case 65 To 90 'upper case letters
sRec = sRec & sChr
Case 48 To 57 'integers
'sRec = sRec & sChr
Case 97 To 122 'lower case letters
'sRec = sRec & sChr
Case Else
'disregard
End Select
Next c
'accumulate
sAccum = sAccum & sRec
m = Len(sAccum)
sRec = "" 'delete line at your peril!
Loop
LongHash = Left$(sAccum, nReq)
End Function
Function HashSHA256(sIn As String) As String
'Set a reference to mscorlib 4.0 64-bit
'HASHES sIn string using SHA2-256 algorithm
'NOTE
'total 88 output text charas of base 64
'Standard empty string input gives : 47DEQpj8HBSa+/...etc,
Dim oT As Object, oSHA256 As Object
Dim TextToHash() As Byte, bytes() As Byte
Set oT = CreateObject("System.Text.UTF8Encoding")
Set oSHA256 = CreateObject("System.Security.Cryptography.SHA256Managed")
TextToHash = oT.GetBytes_4(sIn)
bytes = oSHA256.ComputeHash_2((TextToHash))
HashSHA256 = ConvB64(bytes)
Set oT = Nothing
Set oSHA256 = Nothing
End Function
Function ConvB64(vIn As Variant) As Variant
'used to produce a base-64 output
'Set a reference to mscorlib 4.0 64-bit
Dim oD As Object
Set oD = CreateObject("MSXML2.DOMDocument")
With oD
.LoadXML "<root />"
.DocumentElement.DataType = "bin.base64"
.DocumentElement.nodeTypedValue = vIn
End With
ConvB64 = Replace(oD.DocumentElement.Text, vbLf, "")
Set oD = Nothing
End Function
Function LongestRepeatSubstring(sIn As String, Optional nSS As Long) As String
'finds longest repeated non-overlapping substring (in function name) and number of repeats (in nSS)
'greatest number cycles = (0.5*n)^2 for when "none found", eg; abcdef (9)
'shortest number cycles = 1 for one simple duplicated string; eg abcabc
Dim s1 As String, s2 As String, X As Long
Dim sPrev As String, nPrev As Long, nLPrev As Long
Dim nL As Long, nTrial As Long, nPos As Long, vAr As Variant
nL = Len(sIn)
For nTrial = Int(nL / 2) To 1 Step -1
DoEvents
For nPos = 1 To (nL - (2 * nTrial) + 1)
X = 0
s1 = Mid(sIn, nPos, nTrial)
s2 = Right(sIn, (nL - nPos - nTrial + 1))
vAr = Split(s2, s1)
X = UBound(vAr) - LBound(vAr)
If X > 0 Then
If nPrev < X Then
sPrev = s1
nPrev = X
End If
End If
Next nPos
If nPrev <> 0 Then
LongestRepeatSubstring = sPrev
nSS = nPrev
Exit Function
End If
Next nTrial
End Function