Adventures in Vowel Harmony 2003

Research Diary, Summer 2003

Emily Thomforde '04
Swarthmore College
ethomfo1@swarthmore.edu

"'The language' is a statistical abstraction." (Steels 1999)

Swarm Doc

Allspice

Mark

David

Edinburgh

VHFC Results

Swarm Primer

Simulation

Last Week
Next Week
Current

27 July 2003

I'm working today and taking off tomorrow. My goals are to make progress on the paper and to have something of a presentation to make to David on Tuesday. I also have to make sure all my files are in order and I'm confident in the VHFC results, as well as able to explain why they're better than the old ones.

I made the VHFC results chart more readable. I'm now putting together a summary for David of why his prediction for final vowels didn't pan out.

Note to self: Are Hungarian_new and Hungarian_anna different corpora?

An S-curve is the product of a positive feedback loop. One exists in the harmonisation process; it is self-reinforcing because the probability of harmonising depends on lexical harmony, which in turn is increased by harmonisation. Maybe the reason I'm not getting smooth downward curves is that the process of harmonisation is not in a positive feedback loop. It depends on the product of harmony and homophony, which does not uniformly decrease with deharmonisation. If I were to make it depend instead on the amount of lexical disharmony, that would create the self reinforcement I'm looking for: P(deharm) = 1-[self evalHarmonyOfLexicon].

Daily total hours: 5

_________________________________________________________

29 July 2003

I'm trying to figure out if changing to positive feedback made any difference in my results. It seems that I don't get the curve I want all the time. Maybe this is because I still haven't found those golden parameters, or maybe it's a feature of the system. It's all got to do with the random numbers generated by the program at runtime; maybe that's analogous to language dynamics and is evidence for why not all language families have harmony turned on.

On a slightly different note, what is the difference between lanugage evolution and lanugage change? I have previously remarked that evolution attempts to build lanugage from scratch, while change starts with some initial conditions. We're doing change, but maybe we should be doing evolution. What rules are needed for a harmonic lanugage to develop, not what rules are needed for a language to develop harmony. This would solve the problem of Ho. Agents could build a lexicon in Steel's paradigm, but in a noisy channel. Costs associated with imitating words are lowered if the vowels have some coocurrance restrictions. This would work the same with consonants. Homophony could incur some cost, as well. The population would have to find a middle ground.

Daily total hours: 6

______________________________________________________

30 July 2003

I spent the morning running the simulation and setting up a file system to keep track of the results. Baseline figures can be found here.

The parameters to be varied are: maximumProbabilityHarmony, thresholdToZeroHarmony, probMisspeak, and something ruling the homophony threshold. These need to be renamed. We're freezing the bugDensity at 0.5 and the world size at 10x10.

The most daunting task will be to find a representative sample of the variable permutations. I'll be setting up a structure for those after lunch.

**********************

VHFC and Neutral Vowels:

The previous incarnation of the VHFC calculated the harmony threshold with the following formula:

P(harm)=P(back word)+P(front word)

This was insufficient for a system with neutral vowels becuase it does not include the instances of words containing all neutral vowels, for one. So, in order to be able to accurately calculate the harmony threshold, the formula is now as follows:

P(harm)=P(back/neutral word) + P(front/neutral word) + P(neutral word)

This is realised in the program thus:

P(harm) = (P(f&1)*P((f|n)&2)^avg-1))
+ (P(b&1)*P((b|n)&2)^avg-1)
+ (P(n&1)*P((n|f)&2)^avg-1)
+ (P(n&1)*P((n|b)&2)^avg-1)
- (P(n&1)*P(n&2)^avg-1)

This is based on the original method for calculating domain harmony for one class:

P(x word) = P(x&1)*P(x&2)^avg-1 (The probability of getting an x vowel in the first syllable, multiplied by the probability of getting an x vowel in subsequent syllables, raised to the power of the average number of subsequent syllables.)

The new formula is then:

The probability of getting a word with a front vowel in the first syllable and front or neutral vowels in all other syllables
PLUS
The probability of getting a word with a back vowel in the first syllable and back or neutral vowels in all other syllabes
PLUS
The probability of getting a word with a neutral vowel in the first syllable and front or neutral vowels in all other syllables
PLUS
The probability of getting a word with a neutral vowel in the first syllable and back or neutral vowels in all other syllables
MINUS
The probability of getting a word with all neutral vowels (because the last two values caused this to be counted twice.)

In addition, the algorithm for measuring actual harmony had to be adjusted. Previously, the program looked at the class of the first syllable in the word, then if any subsequent syllable was in the other class, the word was disharmonic. This does not work for neutral vowels, which are either in both classes, or in one of their own. Now, whether or not the neutral vowel option is triggered, the VHFC calcluates domain harmony as follows:

Go through the word until you find a non-neutral vowel.
If the word is exhausted before you find one, the word is harmonic.
Otherwise, if you find a non-neutral vowel, check its class.
Assign the opposite non-neutral class to be the "wrong class."
Go through the rest of the word until you find a "wrong class" vowel.
If the word is exhausted before you find one, the word is harmonic.
Otherwise, if you find a "wrong class" vowel, the word is disharmonic.

**********************

Answers to Wednesday's Questions

Persians get their lexica from the data file 'lexiconPersian.dat' that has preprocessed at runtime to contain only the five Persian vowels.

There is no mishearing built in to the harmonisation algorithm. All mutation is in the harmonic direction, and based on lexical harmony levels.

Vowel mergers take place only in Persian agents. Because we are working with a single language population, there is currently no vowel merging.

Daily total hours: 7.5

____________________________________________________

31 July 2003

The agenda for today includes writing up a summary of lanuguage interaction and planning out the simulation runs. The spreadsheed is currently under construction but can be found here.

I managed to put together a pretty concise summary of lanugage interaction, called "Lifestyles of the Distributed and Autonomous," in the Swarm Primer page. Also, there are several simulation runs archived in the spreadsheet under the "Simulation" link.

Daily total hours: 6

_______________________________________________

1 August 2003

Today I'm looking into constructing a Hungarian corpus. The best site I've found so far is the MTI. I'll have a lot of formatting to do. It will take some time in the library to fix all the compatibility issues, but that probably won't be until Monday.

Daily total hours: 5.5

Weekly total hours: 30