Week 4

This week, Dr. Anderson and I delved more into the Kaldi documentation to better understand the best way to format the metadata files for the CORAAL audio data as to mimic how it would be done with other datasets already included in the Kaldi toolkit. We also realized that each audio file was actually an interview and I would have to segment each audio file into an utterance audio file after untarring them. Dr. Anderson helped me get started on a Python script to do this. One more importnat point is that she also helped me understand the ordering of the utt2spk (utterance to speaker) file which maps utterance IDs to speaker IDs. Each speaker has many utterances but an utterance only has one speaker. The pertinent detail is that the utt2spk file is ordered the same as the spk2utt file but there is a script already included in the toolkit to do this for me. All of this means I will not have to write a script to sort the utt2spk file. One frustration this week was deciding which script to create the metadata files in but I am deciding to use the getdata and run design of other data directories. Another was understanding how all the scripts fit and execute one after the other. Dr. Anderson has given me metadata files from the voxforge directory as well as a log of the scripts running so I can better understand the process. In these directories, the user would execute a getdata.sh script to get the needed data then a run.sh script to move the data and create the metadata files. These files describe speakers, contain transcriptions, and contain features to use in training models. Currently, my getdata.sh, run.sh, and python segmentation file are written and I am simply testing each. I am also continuing to look at other scripts to better develop the ones in the CORAAL directory which is where all of my work is.

Written on June 24, 2022