Week 10

For this week, I began collecting data from the HistoryMakers Website. I am creating transcripts in the CORAAL format with timestamps. One aspect that I am wondering about is that the utterances in the HistoryMakers interviews are much longer than the CORAAL ones. I and Dr. Anderson do not think this will matter but it would definitely be more work to parse the utterances into smaller chunks since I am having to create the transcripts manually right now. The end goal is to run some of the HistoryMakers data through my scripts so I will need to download the audio as well which my scripts take care of as well.

Week 9

This week, nothing much of note occurred. I have begun attempting to modify my CORAAL scripts to work on data from the HistoryMakers website as a proof of concept. Currently, I am gathering transcript material from the website and preformatting it to work with the style of processing in my getdata and run scripts. It is different from CORAAL since those transcripts are already clearly labeled in terms of who’s a subject and this makes it easier to extract utterances from the CORAAL data.

Week 8

This week, I solved the text file issue after experimenting with a few text substitution functions such as awk and sed in Bash. I ended up choosing awk which I started with from the beginning anyway and which uses regular expressions to choose words or patterns to target for whatever action. Besides that, there are no other issues to report.

Week 7

For this week, the only issue I have had is that the code in my run script that creates a text file is not working as intended. When creating a text file for the training data, non-conversational descriptors such as pauses are included. As for the training data, the text file only has one line from the transcript of that interview. That line is repeated over and over again for every utterance from that interview. Besides this bug, all the code that I have written is working as intended. After this, I plan on creating a complete model from the data I have right now. After that, I will use data from the HistoryMakers website to create a model as a way to test my code on real world data.

Week 6

This week, I have started adding scripts from the voxforge example directory to my run.sh script in CORAAL. These scripts mainly have the job of creating the components of a model then training it. In the process, several files are created for the model to refer to such as files containing phones or sounds. There have been no major issues with the code this week.

Week 5

This week, I managed to get my scripts for getting data and running meta procedures working to where Kaldi was able to extract CMVN stats and MFCC features from the data. These are descriptors that are important in preparing and training a model. The data and metadata files being formatted correctly and in the right locations also meant that it will be possible to extract the multiple “phone” files related to the sounds in the CORAAL audio. The biggest frustration this week was understanding how exactly to sort my files. It was easy enough to call Bash’s sorting function but Bash seems to have not built in way to overwrite then replace file contents. It was also a challenge figuring out in which script to do this as to be most efficient and if it would make sense to create temporary files to help in overwriting and copying the sorted data. I decided not to.

Week 4

This week, Dr. Anderson and I delved more into the Kaldi documentation to better understand the best way to format the metadata files for the CORAAL audio data as to mimic how it would be done with other datasets already included in the Kaldi toolkit. We also realized that each audio file was actually an interview and I would have to segment each audio file into an utterance audio file after untarring them. Dr. Anderson helped me get started on a Python script to do this. One more importnat point is that she also helped me understand the ordering of the utt2spk (utterance to speaker) file which maps utterance IDs to speaker IDs. Each speaker has many utterances but an utterance only has one speaker. The pertinent detail is that the utt2spk file is ordered the same as the spk2utt file but there is a script already included in the toolkit to do this for me. All of this means I will not have to write a script to sort the utt2spk file. One frustration this week was deciding which script to create the metadata files in but I am deciding to use the getdata and run design of other data directories. Another was understanding how all the scripts fit and execute one after the other. Dr. Anderson has given me metadata files from the voxforge directory as well as a log of the scripts running so I can better understand the process. In these directories, the user would execute a getdata.sh script to get the needed data then a run.sh script to move the data and create the metadata files. These files describe speakers, contain transcriptions, and contain features to use in training models. Currently, my getdata.sh, run.sh, and python segmentation file are written and I am simply testing each. I am also continuing to look at other scripts to better develop the ones in the CORAAL directory which is where all of my work is.

Week 3

So far this week, I have been working on automating the creation of metadata files for the audio data we will be working with. For the CORAAL data we have been using so far, it should be easier in theory to create metadata files for that data since it has a standard and fairly straightforward file naming convention. To make the scripts more generalizable, I have decided to use the CORAAL data naming convention as a standard for any file that the user may choose to work with. I spoke with my graduate student supervisor and we agree that a format where the original file name is preserved but has a “CORAAL-like” ID attached would be good. So far, I have 3 scripts. One to download the data, one to create the metadata files, and one to sort the metadata files. I must modify the first script so that it can rename files that it downloads using information that the user would want reflected in the titles such as socio-economic grouping, gender, and age group. The sorting script is in its second version currently. A broad issue has been designing the scripts to be general enough but also we cannot predict every use case so there must be some rules for standardization of the file names. Hopefully, once all three scripts have undergone the necessary modifications, they can be placed in any new speech directory and the user would only need to feed the download script and sorting script a few pieces of information along the way.

Week 2

For this week, I began and finished implementing the first bash script to automate the training and testing of the chatbot AI. The logic of the script was fairly straightforward. The script randomly divides the data from a folder, for this one, the yesno folder, into training and testing datasets. 80% of the original data would go to testing and the other 20% to training. The main issue I had were in the grammar of the portions of the code that handled copying the data as well as creating a folder for each audio file in the test and train folders. Another issue was that my main machine does not have all that much space left so I cannot download extremely large datasets but the graduate student working with me has assured me that that probably will not be an issue. There are also other machines in the lab I can use. The preliminary steps of this process have familiarized me more with the data in the kaldi workspace as well as other prep scripts in the workspace. I have started on our second main task, similar to task one but using the larger Valdosta (VLD) dataset. I need to automate the downloading, extraction, and partitioning of this data. Finally, I need to create 3 metadata files using a provided metadata text file. I may need to read into this file at some point along the process and figure out a concise way to sift through the metadata within.

Week 1

For the first week, I got acquainted with the research space as well as the people I would be working with. I also got acquainted with the project itself. The end goal is to create a community asset mapping system that can be accessed via a automated speech recognition system (ASRs). Such systems must be trained to recognize certain types of speech and therefore are prone to bias. My side of the project more explicitly deals with curbing this bias. Mainstream ASRs tested in Pickens County, Alabama by my mentor and her graduate student especially have a hard time recognizing the speech of African Americans living in senior care centers in that area. We are looking to design a more speech inclusive system. The toolkit we are going to be using is called Kaldi. We will be using Docker to create a virtual environment to use Kaldi so I have been studying the Kaldi documentation as well as audio repositories such as CORAAL that can be used to supply Kaldi with sound files. Learning to use Docker has been fairly stress free. One issue that has been resolved was a node version issue that was keeping me from creating an image.