Unlike alphabetic languages, Chinese is a language where the gap between being able to speak and being able to read is vast. To help bridge this gap for learners of Chinese, I wrote Yin, an iOS app that annotates Chinese web pages with pinyin. The initial version was released in 2012 and is no longer available, since I have not had the time to port it over to new versions of iOS, however, I did learn a lot in building it.
I wanted the user interface to be as unobtrusive as possible and fit naturally into the web page. Luckily, HTML has a little known tag called ruby which does most of the heavy lifting in the UI. For example, annotating the word for tomorrow can look like this in HTML:
If you’ve ever seen a Chinese children’s book, it renders very similarly: 明天.
Well the UI was easier than expected thanks to web standards, but the harder problem is actually picking the right pronunciations to use. Any Chinese reader can tell you that there are a number of characters with multiple pronunciations. Most of these pronunciations, however, can be disambiguated by looking at the term the character is used in. There are a handful that are read differently depending on the grammatical construct they serve, but I wanted to go for the biggest gains with the lowest effort first.
For a canonical reference to all possible pronunciations, I found CC-CEDICT to be the most comprehensive reference. Since I only wanted a subset of information from the dictionary, I wrote a set of scripts to serve as a data pipeline.
I identified which characters had only one pronunciation, recorded that single pronunciation, and discarded all entries that contained characters that did not need to be disambiguated.
For characters with multiple pronunciations, I saved all phrases that contained each offending character. The first version of the app just looked at each character one by one, and if it had multiple readings, would look at the character before and after (what I would later learn in NLP is called a “bigram”). If either one matched, it would take that pronunciation. I didn’t really do any tie-breaking here, since Chinese terms often come in pairs of characters.
A few years later, after taking an NLP class, I realised I could do better. I wanted to improve the multiple reading selection by training my basic model on real data. While there isn’t much parallel text by the way of Chinese to pinyin, I did manage to find a decent corpus through translations of the Bible from WordProject.
I was uncomfortably familiar with cleaning data at this point, and I had to clean a lot of data. I used my previous dictionary pronunciations to validate the matching of characters from the Chinese version of the Bible to the Pinyin version. Many characters were missing and unsurprisingly for a corpus of this size, there were many typos. However, I’m grateful to have the data at all.
Using this as the training set, I collected frequencies for each bigram with multiple pronunciations and update the code to account for terms that occured more frequently.
Given the corpus, I could embark on allowing the model to account for grammatical structures, however, given my other responsibilities, this has yet to be realized.
The iOS app itself was very simple, as it only needed to parse sentences and look in the saved database of pronunciations for each character. It ended up being a single
UIWebView with buttons to refresh and annotate.