Llaisdy

Trefnydd Usage Scenarios

Author: Ivan A. Uemlianin
Contact: ivan@llaisdy.com

IMPORTANT NOTE

this document refers to an earlier version of the toolkit. Some of the details have changed, but the general patterns of use are similar. I shall update this document as soon as possible. In the meantime, the Quick Start Guide is accurate and up-to-date.

Introduction

Usage scenarios "describe typical uses of the system as narratives." (Gottesdiener, 2002: 42). They provide a basis from which to abstract the design.

Recording and transcribing data

In this scenario, the user (U) starts from scratch with no speech or text data. The first step is to collect some data which can be input into FieldAssistant (FA). Starting from scratch involves collecting and processing a lot of data, so U is probably a group or team of some kind.

The manual contains some pointers on what kind of data to collect: for example, for a desktop dictation application, read and spontaneous indoor speech could be collected; for personal use, texts can be collected from anywhere, but for community or commercial use, there are copyright issues; the texts collected should be as varied as possible to collect all the sounds of a language; and so on. The manual contains some exercises for beginner users.

U collects some texts (enough for around 10 hours of speech) and uses FA to build a TextCorpus (see Using FA to build a GrammarModelObject below). The corpus diagnostics gives feedback on weak areas of the corpus, so that U can collect more texts. Once finalised, the corpus produces a blank pronunciation dictionary, which should be completed (see Using PDBuilder to build a Pronunciation Dictionary below). Once the PD has been completed, the corpus generates transcription templates (orthographic and phonemic) for each text. These will facilitate transcription of the recordings.

U must now collect recordings of people reading these texts. Depending on the proposed application, at least 100 hours of speech should be collected from as many speakers as possible.

As data can be recorded in all manner of ways, FA does not include an application to record speech data.

Once the data has been collected, U can go to Using FA to build a SpeechModelObject

Note on collecting spontaneous speech: For many applications, spontaneous speech and/or speech in noisy situations must be collected. By definition spontaneous speech does not have a pre-existing text. However, an orthographic transcription can be written down afterwards using Recorder/Transcriber (#todo). Once a number of texts have been processed in this way, a textcorpus can be built as above.

Using FA to build a SpeechModelObject

The user (U) has speech data: audio files and (orthographic, untimed) transcriptions. For each audio file there is a text file containing a transcription of the speech in that file.

U uses FieldAssistant (FA) to preprocess his data into a SpeechModelObject. He then uses EngineBuilder (EB) to build an ASR Speech Engine (AE) for his chosen (sub-)language.

Assume U has raw data in a directory called rawDataDir (e.g., in his home directory or on the desktop), and that FA and EB are installed on his system.

Fire up the FA GUI, Start Project, etc.

U starts the FA GUI (e.g., by choosing FieldAssistant from their 'Start' menu, by clicking on the FA icon on the desktop, or by typing fieldAssistant at a termainal) and chooses File | New | Project from the menu bar.

This calls up a dialog which asks U for various details about the project, e.g., the name of the project, the location/path where project files should be saved, and so on.

Import training data

U clicks on File | Import | Corpus and is prompted to choose the directory containing the raw data.

FA copies the data into the project space (as identified in the New | Project dialog.), building internal dataObjects as it goes. As this may take some time, FA displays a Progress Bar.

The main project dataObject is the SpeechCluster (SC), an association between speech audio, transcription and other metadata.

#todo: notes on SpeechCluster from CB; link to Bolzano paper

Once imported, the corpus shows up in the Browser Pane, along with any corpus-level metadata that can be derived during the import.

Check corpus

U now views the corpus diagnostics to see if anything needs adjusting.

The FA GUI is essentially a browser/viewer similar to Windows Explorer. The left Browser Pane allows U to browse through the tree of data entities in the Project. The right View/Edit Pane displays the entity selected, as well as any dialogs or results of processing.

U chooses Corpus/Metadata/Diagnostics in the Browser Pane, and the View/Edit Pane displays the diagnostics. Corpus Diagnostics shows things like:

  • which source files were unreadable
  • suspiciously low frequency phones/words
  • etc

Edit corpus

Here I list some possible issues as reported by Corpus Diagnostics, and how FA supports addressing them.

  • Corpus Diagnostics reports any files with unrecognised formats. For each of these a 'Specify File Format' dialog can be called up. which has sets of questions for audio and transcription files.
    • FA will support .wav format natively. Several other formats will be supported by use of the FOSS package sox. There may be other as yet unidentified audio file formats which need to be supported. I hope to uncover these through continued requirements gathering, but (a) there is always the possibility of unsupported formats cropping up and (b) FA should support raw (i.e., headerless) audio data. For raw and/or unsupported audio file types the following characteristics are listed, which must be specified by the user (they can be left blank, in which case FA must determine defaults): sample size, sample rate, encoding, header size.
    • Specifying transcription formats is more involved than specifying audio formats, as U must provide a grammar (e.g., in a BNF-style format). The Manual provides (a) tutorials on simple grammar design for transcription files, (b) overviews of supported transcription formats (e.g. esps, TextGrid) in case U is able to convert transcriptions to these formats. The 'Specify File Format' dialog for transcription files simply asks for the location of the grammar file for each kind of unsupported transcription file (kinds identified by file extension).
  • If there is an audio file and at least one other file with the same stem, FA associates the files together into a SC. Otherwise files must be associated manually.

Corpus Diagnostics reports a list of 'singleton' files, i.e., files which were not associated with any others during the import. Often this may be due to typos in file naming (e.g., data_1234.wav and data1234.TextGrid); a text file may not have been recorded, or the recording misplaced; a set of recordings may not have been transcribed, etc. In the case of missing files - they shall have to be retrieved or produced. In the case of files with different stems, an SC can be created manually, and added to the Corpus.

U clicks on File | New | SC. This calls up a dialog for initialisation options (eg name). Files to be associated into the SC can be dragged-and-dropped from the Corpus Diagnostics Singleton files report. Once U has finished entering the SC details, he clicks on 'Ok' and the SC is added to the Corpus (and the files are removed from the singleton files list).

  • Corpus Diagnostics reports on subsets of the Lexicon (words and/or phones depending on the level of transcription provided) which have suspiciously low frequency (i.e. any outliers at the low frequency end of the frequency distribution). The idea here is that some of these labels may be typos which should be corrected.

    U clicks on 'recieved' in the Suspicious Labels list. This pings the View/Edit Pane to that word's entry in the Lexicon. For each label in the Corpus, the Lexicon gives a summary of information available: at a minimum this includes which SC the label occurs in, but may also include for example pronunciation(s).

    The Lexicon shows that 'recieved' occurs in seven SCs. 'recieved' is clearly a typo for 'received', so U right-clicks on the label itself and is shown an option dialog. U chooses 'Relabel' and corrects the spelling. This causes FA to relabel 'recieved' to 'received' in the seven SCs implicated and to rejig the Lexicon and the Suspicious Labels list accordingly.

    While doing this U notices that the Lexicon lists two pronunciations for 'record' - /r eh1 k ao d/ and /r ih k ao1 d/. U reflects that this is probably 'record' the noun and 'record' the verb. U right-clicks anywhere in the lexicon entry for 'record' and from the option dialog chooses 'Split'. This launches a Split Label dialog.

    The Split Label Dialog is a double paned GUI. The left pane shows all the tokens of the chosen entry (orderable by any field); the right pane lists the new labels. U clicks inside the right pane and enters 'record_N' for the first label. U sorts the left pane by pronunciation and drags into 'record_N' all the entries with pronunciation /r eh1 k ao d/. U clicks again inside the right pane and enters 'record_V' for a second label. U drags into 'record_V' all the entries in the left pane with pronunciation /r ih k ao1 d/. Finally U clicks 'Ok'. This causes FA to relabel the SCs implicated and to rejig the Lexicon and the Suspicious Labels list accordingly.

  • During Import Corpus any SCs which are not segmented (i.e., which do not have boundary timings for their labels) will be auto-segmented by FA. Once the whole Corpus has been segmented, Corpus/Metadata/Phoneset and Corpus/Metadata/Lexicon will show durations for all labels (including median, mean and variation). Corpus/Metadata/Diagnostics shows any labels which have suspicious distributions.

n.b.: Auto-segmentation can (probably) only be performed if there is either phonological transcription or a pronunciation dictionary.

Clicking on a label in the suspicious labels list pings the View/Edit Pane to that label's entry in the Phoneset or Lexicon (as appropriate).

Configure the SpeechModel

U chooses Model | Configure in the menu bar. A series of dialog boxes in the View/Edit Pane guide him through the configuration decisions in setting up the SpeechModel. Defaults are given. These decisions are to do with things like API coverage, target platform and so on.

Validate the Configuration

U clicks on Model | Validate Configuration. FA runs checks to determine whether the current corpus provided can be used to build the model as configured. A report is provided in the View/Edit Pane. The report includes hints/options on how to expand the corpus if necessary.

Build the SpeechModel

U clicks on Model | Build. FA is ready to build the model. Because building a model is likely to take a long time (maybe a few days!) FA offers several options about how the building is scheduled:

  • switch off everything else. FA takes over the machine. Fastest but machine is otherwise unusable.
  • background run. FA runs in the background. Takes a long time but machine is still usable. The machine will probably be a lot slower than usual.
  • scheduled run. FA only runs between hours specified by U (e.g., overnight), saving state between sessions. It should be possible to reboot/turn off the machine between sessions.

n.b.: By default FA::BuildModel incorporates tests (see next step). An option allows this to be switched off, as it will add significantly to the build time.

FA::BuildModel produces comprehensive logs which can be viewed by clicking on Model/Diagnostics in the Browser pane (or viewed in a web browser? logs will be in xml or html). The logs show any problems and offer hints as to how they can be fixed for a rebuild.

n.b.: the log shows the location of the final model once built. The model (if all is well with diagnostics and tests) can be taken away and used independently of FA.

Test the SpeechModel

FA uses the model to perform recognition on the corpus data. As this will take some time a Progress Bar is displayed. Test results are displayed in the View/Edit Pane by choosing Model/Test Results in the Browser Pane.

As with Validate the Configuration, the test results hints about how the model could be improved. For example, the model might bet particularly poor results for certain phones, or certain contexts, or for (fe)male speakers. FA might suggest collected more of these kinds of data.

Using PDBuilder to build a Pronunciation Dictionary

U has an orthographic transcription of their data, but no phonemic transcrption, and no complete pronunciation dictionary (PD). U selects File | New | PD (or File | Open | PD to update an already existing PD) and FA launches PDBuilder.

PDBuilder helps U create a PD without having manually to enter a pronunciation for every word. U can define grapheme sets, create grapheme-to-phoneme rules (i.e., pronunciation rules), manually edit exceptions, and so on. PDBuilder is a dual-pane GUI similar to FA, with a browser pane and a view/edit pane. The browser pane lists the components of the PD: Phones, Graphemes, G2P rules, the Lexicon itself, and Metadata.

When PDBuilder is launched from FA it automatically imports the wordlist from FA. All single character graphemes are then listed in the Graphemes section of the browser pane. All common grapheme combinations are listed in G2P Rules (with blank right-hand-sides). The process of building a PD goes as follows:

Unless the default phone set (the International Phonetic Alphabet) is used, U must enter in the phones manually. Phonesets can also be exported and imported (File | Import | Phoneset). To add a new phone, U right-clicks on Phones in the browser pane; this launches a prompt dialogue which includes the option 'New Phone'. U selects this: the view/edit pane then shows a blank Phone form with spaces for phone label and features (e.g., in IPA place, manner, voicing and so on). PDBuilder checks (and warns) if more than one Phone has the same label or feature set.

U similarly right-clicks on Browser/Graphemes to add or delete graphemes. For example, for Welsh U will want to add graphemes 'ch', 'dd', 'ng', 'ngh', etc. PDBuilder will warn if deleting a grapheme leaves parts of words uncovered. Note that, with multi-character graphemes it is possible for graphemes to overlap (e.g., in Welsh 'g', 'ng', and 'ngh'). Therefore, each Grapheme has a priority level attribute to set order of preference (default is length in characters). U can also add/delete grapheme sets (e.g., consonants) which will aid in the writing of G2P rules.

To write a new G2P rule, U right-clicks on Browser/G2P Rules and selects New G2P Rule. A blank G2P Rules form appears in the view/edit pane. This has spaces for the following:

Name:
An (optional) name for the rule (e.g., "Word-final aCe -> eiC").
Left Hand Side:
A space-separated sequence of graphemes or grapheme sets (e.g., for English, "a Consonant e #").
Right Hand Side:
A space-separated sequence of phones or grapheme sets (e.g., for English, "ei Consonant #").
Overrides:
A list of rule names or numbers with overlapping LHSs which the currnet rule overrides. When a LHS is specified PDBuilder will list all overlapping rules here.

Once some G2P rules have been entered, the resulting pronunciations will be displayed in the lexicon. Clicking on a word in Browser/Lexicon will display that word's entry in the view/edit pane. The word form has the following spaces (apart from orthography these are generated automatically but can be edited manually):

Orthography:
This shows the spelling as imported from the source text (e.g., Welsh "angharad").
Grapheme Sequence:
This shows a speace-separated sequence of graphemes (e.g., "a ngh a r a d").
Phone Sequence:
This shows a speace-separated sequence of phones (e.g., "a ngh a r a d").
G2P Rule Sequence:
This lists the rules which have been used to generate the phone sequence from the grapheme sequence.
Exceptional:
This shows (for each of the sequences above) whether the sequence has been edited manually or had been generated completely automatically.
Complete:
This shows (with a big red cross or a big green tick) whether all the graphemes have been accounted for.

U can build a PD iteratively by writing some G2P Rules, viewing the results and either manually adjusting exceptions, rewriting rules or writing more rules.

The Metadata section of Browser contains diagnostics such as suspiciously underused phones or G2P Rules.

Using FA to build a GrammarModelObject

User (U) has textual data. This can be any supported text format (sgml/xml including html, Word, OOo, pdf, ...?).

U uses FieldAssistant (FA) to preprocess his data into a GrammarModelObject. He then uses EngineBuilder (EB) to build an ASR Speech Engine (AE) for his chosen (sub-)language.

Assume U has raw data in a directory called rawDataDir (e.g., in his home directory or on the desktop), and that FA and EB are installed on his system.

Fire up the FA GUI, Start Project, etc.

U starts the FA GUI (e.g., by choosing FieldAssistant from their 'Start' menu, by clicking on the FA icon on the desktop, or by typing fieldAssistant at a termainal) and chooses File | New | Project from the menu bar.

This calls up a dialog which asks U for various details about the project, e.g., the name of the project, the location/path where project files should be saved, and so on.

Import data

U clicks on File | Import | Corpus and is prompted to choose the directory containing the raw data.

FA copies the data into the project space (as identified in the New | Project dialog.), building internal dataObjects as it goes. As this may take some time, FA displays a Progress Bar.

The main project dataObject is the TextComplex.

Once imported, the corpus shows up in the Browser Pane, along with any corpus-level metadata that can be derived during the import.

check corpus

U now views the corpus diagnostics to see if anything needs adjusting.

The FA GUI is essentially a browser/viewer similar to Windows Explorer. The left Browser Pane allows U to browse through the tree of data entities in the Project. The right View/Edit Pane displays the entity selected, as well as any dialogs or results of processing.

U chooses Corpus/Metadata/Diagnostics in the Browser Pane, and the View/Edit Pane displays the diagnostics. Corpus Diagnostics shows things like:

  • which source files were unreadable
  • suspiciously low frequency words
  • etc

edit corpus

### a few things need changing ... # todo

configure the GrammarModel

U chooses Model | Configure in the menu bar. A series of dialog boxes in the View/Edit Pane guide him through the configuration decisions in setting up the TextModel. Defaults are given. These decisions are to do with things like API coverage, target platform and so on.

Validate the Model Configuration

U clicks on Model | Validate Configuration. FA runs checks to determine whether the current corpus can be used to build the model as configured. A report is provided in the view/edit pane. The report includes hints/options on how to expand the corpus if necessary.

build the GrammarModel

U clicks on Model | Build. FA is ready to build the model. Because building a model is likely to take a long time FA offers several options about how the building is scheduled:

  • switch off everything else. FA takes over the machine. Fastest but machine is otherwise unusable.
  • background run. FA runs in the background. Takes a long time but machine is still usable. The machine will probably be a lot slower than usual.
  • scheduled run. FA only runs between hours specified by U (e.g., overnight), saving state between sessions. It should be possible to reboot/turn off the machine between sessions.

n.b.: By default FA::BuildModel incorporates tests (see next step). An option allows this to be switched off, as it will add significantly to the build time.

FA::BuildModel produces comprehensive logs which can be viewed by clicking on Model/Diagnostics in the Browser pane (or viewed in a web browser? logs will be in xml or html). The logs show any problems and offer hints as to how they can be fixed for a rebuild.

n.b.: the log shows the location of the final model once built. The model (if all is well with diagnostics and tests) can be taken away and used independently of FA.

test the GrammarModel

FA uses the model to perform recognition on the corpus data. As this will take some time a Progress Bar is displayed. Test results are displayed in the View/Edit Pane by choosing Model/Test Results in the Browser Pane.

As with Validate the Configuration, the test results hints about how the model could be improved. For example, the model might bet particularly poor results for certain phones, or certain contexts, or for (fe)male speakers. FA might suggest collected more of these kinds of data.

Using EB to build an ASR Speech Engine

Assume U has a SMO and a GMO in a directory called modelDir (e.g., in his home directory or on the desktop), and that FA and EB are installed on his system.

U uses EngineBuilder (EB) to build an ASR Speech Engine (AE) for his chosen (sub-)language.

Fire up the EB GUI, Start Project, etc.

U starts the EB GUI (e.g., by choosing EngineBuilder from their 'Start' menu, by clicking on the EB icon on the desktop, or by typing engineBuilder at a termainal) and chooses File | New | Project from the menu bar.

This calls up a dialog which asks U for various details about the project, e.g., the name of the project, the location/path where project files should be saved, and so on.

Import Models

U clicks on File | Import | Models and is prompted to choose the directory containing the SMO(s) and TMO(s).

Once imported, the models show up in the EB::GUI::BrowsePane, along with metadata derived during the import.

Check models

U now views the diagnostics to see if anything needs adjusting.

The EB GUI is essentially a browser/viewer similar to FA. U chooses SpeechModel/Metadata/Diagnostics (or GrammarModel/Metadata/Diagnostics) in the Browser Pane, and the View/Edit Pane displays the diagnostics. Diagnostics shows things like:

  • weaknesses in the model
  • ...?

edit models

### a few things need changing ... # todo

configure the Engine

U chooses Engine | Configure. A series of dialog boxes in the view/edit pane guide him through the configuration decisions in setting up the ASR SpeechEngine. Defaults are given.

Validate the Engine Configuration

U clicks on Engine | Validate Configuration. EB runs checks to determine whether the current models provided can be used to build the engine as configured. A report is provided in the view/edit pane. The report includes hints/options on how to change the models if necessary.

build the Engine

U clicks on Engine | Build. EB is ready to build the model. Because building a model is likely to take a long time (maybe a few days!) EB offers several options about how the building is scheduled:

  • switch off everything else. EB takes over the machine. Fastest but machine is otherwise unusable (unless building is stopped).
  • run in the background. EB runs in the background. Takes a long time but machine is still usable. Will probably be a lot slower than usual
  • scheduled run. EB only runs between hours specified by the user (eg overnight), saving state between sessions. Should be possible to reboot/turn off mahcine between sessions.

n.b.: By default build incorporates test (see next step). An option allows this to be switched off, as it will add a while to the build time.

EB::BuildEngine produces comprehensive logs which can be viewed by cliking on Engine/Diagnostics in the Browser pane (or viewed in a web browser? logs will be in xml/html). The logs show any problems and offer hints as to how they can be fixed for a rebuild.

n.b.: the log of course shows the location of the model. The model (if all is well with diagnostics and tests) can be used independently of EB.

test the Engine

EB uses the engine to perform recognition on corpus data, if available; it also runs various engine/api-specific tasks as specified in the configuration (eg out-of-vocabulary behaviours). This will take some time. U is warned (alter box, progress bar, estimated time left, etc). Test results viewable from Browser::Model/Test Results.

References

Gottesdiener, E. (2002). Requirements by Collaboration. Addison-Wesley.