XML - 管理数据交换/XMLWebAudio

存在可以将任何文本文件转换为音频文件的软件。可以使用 Windows 或 Mac 操作系统中提供的软件或非常便宜的独立软件（例如 TextAloud）将文本文件转换为音频文件。TextAloud 允许用户修改语音、阅读速度和其他功能。它可以在网上找到免费版本。这些系统可以通过多种方式修改语音，以符合用户的个人喜好。这些系统不会通过互联网提供文件，供用户搜索和收听。

潜力

通过正确结合 XML 技术、移动通信服务以及已有的软件/硬件，互联网广播的概念可以扩展到比目前更大量的內容。大多数互联网广播以音乐文件和节目广播内容的形式存在。互联网广播的选择可以扩展到包括任何现有的文本文件，其中包括新闻报道、政府文件、教育材料和各种官方记录。一个商业示例是，一名推销员在前往与客户进行销售拜访的路上，通过在汽车中收听文件来详细了解客户的购买历史。另一个例子包括现有的语言转换软件，它可以让远在异国他乡的人收听和学习其他地方正在开发的技术。

要求

这项技术需要三个领域共同努力才能使该流程正常运行。1. XML 技术必须包括一组商定的 XML 标签，用于在内容生成器/分发者和用户之间传输文件。2. 移动通信服务必须能够以可用的格式将数据传递到最终用户系统。3. 硬件和软件必须能够使用发送的文档并为用户播放它们。这包括语音处理浏览器的进一步开发。

第二个和第三个要求超出了本章关于 XML 的范围。但是，正在进行相关工作。W3C（万维网联盟）目前正在进行移动网络倡议，该倡议将为软件供应商、内容提供商、硬件（手机）制造商、浏览器开发人员和移动服务运营商设定一些标准。一个正在考虑的建议是最大页面重量为 10K（一篇典型的杂志文章可以容纳在这个范围内）。广告的可用性和其形式目前正在讨论中。交付协议预计为 http。移动设备的连接可能很慢，但音频文件不需要流式传输。目前参与的供应商包括诺基亚、爱立信、惠普、法国电信和 Opera。

第一个要求将包括一组 XML 标签，所有文本文件内容生成器（例如新闻机构、政府、教育机构和官方记录生成器）都可以使用这些标签来生成其内容文件。因此，他们的内容可以被访问并存储在可搜索的数据库中，并且可以随时从支持移动浏览器设备的任何地方下载和播放。

现有的标签集

存在一组称为 SSML（合成语音标记语言）的 XML 标签。这组标签允许控制语音生成的足够方面，以便用户可以生成和操作个性化的语音。文本到语音系统使用这些标签来获取文本文件并生成可听的文本语音。

文档结构、文本处理和发音元素与属性

speak - 根元素 xml:lang - 属性

                Language (indicates the natural language of the file, such as “en-US”); this 
                is preferred to be indicated only on the voice element so as to eliminate 
                changes in a voice in the midst of a voice file.

xml:base - 属性

                base URI Attribute (optional)

示例

<speak version="1.0"

        xmlns="http://www.w3.org/2001/10/synthesis"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                  http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
        xml:lang="en-US">
 ... the body ...

</speak>

lexicon - 元素

       for pronunciation, (an empty element)

meta - 元素

       (an empty element); includes a string that contains some information about the  
       ensuing data; it can declare a content type of “http” in the case of a file that 
       doesn’t have generated header fields from the originating server.

metadata - 元素

       can provide broader information about data as it accesses a metadata schema.

p - 元素

       text structure, represents a paragraph. It can only contain the following elements:   
       audio, break, emphasis, mark, phoneme, prosody, say-as, sub, s, voice.

s - 元素

       text structure, Element; represents a sentence. It can only contain the following   
       elements: audio, break, emphasis, mark, phoneme, prosody, say-as, sub, voice.

say-as - 元素

       available attributes: interpret-as, format, and detail phoneme with interpret-as being 
       the only required one. The tag set may only contain text to be rendered by a voice    
       synthesizer. This tag helps a browser to know more about the manner in which the  
       enclosed text is to be voiced.

format - 属性

               this attribute gives additional hints as to the rendering of voiced text.	detail - Attribute
               this attribute is for indicating the level of detail to be applied to voiced  
               text. An example would be a special form of emphasis such as the reading of 
               computer code in a block of text.

Phoneme - 元素

       a pronunciation indicator for the text to speech engine. The engine does not render 
       the contents of the tag, thus the tag can be empty. The attributes for the tag provide 
       what the engine will use to help with language specific pronunciation factors.     
       However, any text between the tag set will be rendered on screen in  a visual browser  
       for hearing impaired users. This tag can only contain text, no elements.
       alphabet - attribute 
               for Phoneme, used to specify a particular version of an alphabet, optional
       ph - Attribute
               a required attribute for phoneme, used to specify the string to be pronounced.

示例

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"

        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                  http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
        xml:lang="en-US">
 <phoneme alphabet="ipa" ph="təmei̥ɾou̥"> pomegranate </phoneme>

</speak>

sub - 元素

       an element used to specify within its “alias” attribute the pronounced version of some 
       written text that is between the tag set. Example:

_AARP

韵律和风格 - 韵律涵盖诸如音调、语调、对话节奏、音高、响度、声音持续时间、分块（单词单元，不一定句子）等方面。

voice - 元素

       indicates the type of voice to use, all the attributes are optional, however, not 
       indicating any attributes at all is considered an error. The “lang” attribute takes  
       precedence; all other attributes are equal.

lang - 属性

                for voice element, indicates the language for the voice.

gender - 属性 age - 属性 variant - 属性 name - 属性

示例

<voice gender="male">Show me a person without a goal</voice>

 <voice gender="male" variant="2">
 and I'll show you a stock clerk.
 </voice>
 <voice name="James">Show me a stock clerk with a goal and I'll show you someone who will change the world.</voice>

emphasis - 元素

       contains text to be emphasized by the speech processor (with stress or intensity). It  
       has one attribute:

level - 属性

                indicating the degree of emphasis.

示例

天才本身不会谈论天才的礼物，他们只会谈论

<emphasis level="strong"> hard work and long hours. </emphasis>

“emphasis”元素可以包含文本以及以下元素：audio - 元素 desc - 元素

                if the content is not speech then the “desc” tag should be used to describe   
                the content. This description can be used in a text output for the hearing   
                impaired.

break - 元素 emphasis - 元素 mark - 元素 phoneme - 元素 prosody - 元素 say-as - 元素 sub - 元素 voice - 元素

break - 元素

       wherever the element is used between words it indicates a pause in the reading of the  
       text; attributes are: “strength” with values of: none (meaning no pause even if the 
       system would normally put one there), x-weak, weak, medium, strong, x-strong; “time” 
       with values of either milliseconds: 250ms or seconds: 2s.

prosody - 元素

       controls the pitch, speaking rate and volume of a generated voice. Attributes   
       are optional but it is considered an error if no attributes are set. 
       pitch - Attribute
       contour - Attribute
       range - Attribute
       rate - Attribute
       duration - Attribute
       volume - Attribute

其他允许插入音频文件以及生成语音内容的元素。

audio - 元素

       may be empty but if it contains anything it should be the text that the speech 
       generator could convert to a voice in place of the audio file.

示例

 <audio src="JCPennyQuote.au">Every business is built on friendship.</audio>

mark - 元素

       an empty tag that places a named marker into the content. When the processor  
       reaches a “mark” element one of two things happens. One, the processor is provided 
       with the info to retrieve the desired position in the content, two, an event is issued 
       that includes the content at the desired position. It has one attribute which is:
       name - Attribute

desc - 元素

XML Web Audio 的未来潜力

可以引入额外的标签来包含日期、文件标题、作者、源语言以及有关文件的其他元数据。扩展现有的标签集将使文件能够使用多种方法存储和搜索在数据库中。它们将允许存储与实际文本/音频文件相关的數據，这些數據对潜在用户来说非常有价值。用户可以根据文件的来源日期、文件的来源国家以及文件的主题或标题进行搜索。

结论

使用 SSML（XML 的子集），可以从任何文本文件（例如新闻报道、政府文件、教育材料或官方记录）生成音频文件。这些内容可以通过移动通信服务和网络进行传递。这些文件可以在移动浏览器设备上播放。这将构成比现有的严格音乐或节目内容形式更大的互联网广播市场。这将为移动用户提供对大量信息来源的按需访问，从而产生许多用途。