Matlab SPRIT (a prototype)


Matlab SPRIT is a program (about 1000 lines of matlab code) that

It's a prototype (written in matlab) of a program being written in Java. It's hoped that the full version will be available (free) in summer 2007.

Features Identified

Fixed Forms currently identified are More General Forms identified are Statistics gathered include


Single Poems

The following sample output was produced when the program was given 14 poems
*** Sonnet ***
It's a Shakespearian sonnet
Beat regularity=0.47857
Per-stanza endstops: 0.50 
*** Haiku ***
It's a haiku
*** Limerick ***
It's a limerick
*** Villanelle ***
It's a villanelle
Per-stanza endstops: 0.75 0.50 0.50 0.67 0.50 0.88
*** Word Square ***
It's a word square
*** Syllable Square ***
It's a syllable square
*** Dante ***
It's terza rima
Per-stanza endstops: 0.50 0.00 0.50 0.50 0.33
*** Box ***
Except for the final stanza, it's a boxed poem
*** Tyger ***
It's a regular rhyme
Per-stanza endstops: 0.38 0.75 0.88 0.75 0.75 0.38 
*** Wheelbarrow ***
It's a word-stanza poem
*** Rondeau ***
It's a rondeau

*** Hymn ***
It's a syllabic poem
Per-stanza endstops: 0.69 0.62 0.69 0.50 0.69 0.50 0.69 

*** Mona Lisa ***
No Form!
*** Sestina ***
It's a sestina
Per-stanza endstops: 0.50 0.50 0.25 0.42 0.50 0.25 0.67 
Graphical output is still under development - it's not clear how useful it is. Color-coding aims to point out the repeated blocks of lines, the rhyme pattern, and the beats. The length of the rows shows the number of syllables. The picture below illustrates "All Things Bright and Beautiful" and a sonnet (click to see larger versions)
allthings allthings

Groups of Poems

Several poems can be analysed together. Their characteristics can be averaged, or trends can be studied.
Below are graphs for groups of poems showing a) "stanza-length" "Summary" - the overall distribution of stanza-lengths, and b) "stanza-length" "Trend" - how the average stanza length changed from poem to poem
allthings allthings


Statistical analysis of texts pre-dates computers. Initially work was mostly related to word-frequency analysis. More recently, stylistic analysis has developed, and has helped with forgery detection. The work here combines ideas from several fields


As many features as possible are codified into lists of numbers. For example, William Carlos Williams' The Red Wheelbarrow can be codified as follows etc. An advantage of working this way is that the same fuzzy matching algorithms can be used on different features - the code to see if all stanzas have the same rhyme pattern is very similar to the code that checks if all stanzas have the same syllable-per-line pattern.

The final program supports plug-ins so that extra feature-detection can be added by users.

Java Implementation


XML is a format for text files that lets structured information be stored. Routines exist in many languages to extract the information from such files. Rather than write plug-ins, users can write programs that analyse information in the XML files. This offers more flexibility than plug-ins, but these programs won't be able to access the core's utility functions.

Text-to-phoneme (TTP) conversion

Some of the processing done by the core is straightforward (line-counting, etc) but it also performs text-to-phoneme conversion (phonemes are units of sound) so that syllables can be counted and rhyme analysis can be done. This introduces several difficulties

Beat Analysis

Beat analysis is even more subjective than TTP conversion.

Fuzzy Matching

This feature will be especially important when the text-to-phoneme translator is used, because it's bound to get some things 'wrong'. Besides, poets like bending the rules - repeated lines in villanelles often don't match exactly, and in Blake's "Tyger, Tyger" the first and last verses differ by only one word.

Our Fuzzy Matching routine lets us choose the tolerance. Here are examples of its output when matching the phrase "once upon a time"

PhraseTolerance (%)Output
once upon a time 20 1
once upon a tim 20 0.9375
once upon a 20 0
once upon a 40 0.7500
upon a time 20 0
upon a time 40 0.6875
once upon the time 40 0.8333

An output of 1 means an exact match. An output of 0 means that there was no match within the requested tolerance. The routine works just as well when asked to compare 2 lists of numbers. The program uses this routine in several places to offer the user configurable tolerances when making comparisons.

Further Work

See Also

23rd May 2007
Tim Love