Theories and Models of Human Rhythmic Perception | |
|
|
When listening to a piece of music, the listener builds a conceptual structure to represent the rhythmic relationships between notes. The listener must decide where to place bar-lines, how to group notes together, and what durational value to assign to each note. All of this is successfully accomplished despite the fact that, in an actual performance of music, all of the notes are more or less out of time.1 How exactly is this accomplished? Research into the nature of the perception of musical rhythm involves the construction of cognitive theories and the assembling of computer programs to simulate and test these theories. Much of the research on the perception of musical rhythm is based on Lerdahl and Jackendoff's A Generative Theory of Tonal Music. This book provided a basis for research in rhythmic perception, primarily because it postulated what was perhaps the first cohesive perceptual definition of rhythm, and clarified the elements of musical rhythmic structure. To start with, the authors distinguish between grouping and meter. Grouping is the manner in which music is segmented at different hierarchical levels, from groups of notes up to the form of the entire work. Meter is the alternating strong and weak elements in music. Grouping organizes periods of time; meter is concerned with specific points in time. Lerdahl and Jackendoff are clearly establishing grouping and meter as two distinct entities; they state that the most stable arrangement is when hierarchical groups and meter are in congruence. Going further, the authors establish the Grouping Well-Formedness Rules, which are formal conditions for establishing hierarchical grouping structures, and define the Grouping Preference Rules, which describe conditions that determine which of the large number of possible hierarchical groupings of any passage of music are actually likely to be perceived. The Grouping Preference Rules are formalizations of basic Gestalt principles and abstract aesthetic ideals.2 The deeper implications of Lerdahl and Jackendoff's theories about rhythmic structure are quite important. They describe the metrical structure of a piece of as being like a grid. One of the central tasks of a human listening to music, then, is to parse the perceived rhythmic patterns to retrieve their metrical structure, to, in effect, make the perceived rhythms fit neatly on the grid.3 While Lerdahl and Jackendoff provide no empirical justification for this view of metrical structure, it is supported by the psychological phenomenon known as categorical perception. In categorical perception, "listeners assign the continuously variable durations of expressive performance to a relatively small number of rhythmic categories."4 Indeed, music seems to consist of two interacting time scales: the discrete time intervals of a metrical structure and the continuous time scales of tempo changes, expressive timing, and temporal "noise" from limitations in the human motor system during performance.5 The phenomenon of categorical perception, which also affects our perception of other aspects of music, such as pitch, appears to be automatic and obligatory.6 The process of assigning a single rhythmic value to a range of different note durations is called quantization, and is at the heart of all attempts to model human rhythmic perception. Several different methods for accomplishing this have been developed. One could, for instance, simply round durational values to the nearest point on a fixed time grid. This approach is used by most commercial sequencers and notation software, and more often than not yields results that are quite unmusical. Tempo tracking is an improvement on the fixed-grid method; an "error signal" between the onset times of the notes and the time grid is used to adjust the grid to better fit the input notes. Tempo tracking systems exhibit improved performance over rounding systems, but still produce very erroneous results. Another approach is rule-based quantization systems; these systems use knowledge about preferred durational ratios to optimize the quantization of the input data, and even to construct higher-order metric groupings. Perhaps the best example of these systems, the Longuet-Higgins and Lee Musical Parser, is described in detail below. A final approach to quantization is connectionist quantization systems, which are capable of robustness, flexibility, and context-sensitivity that far exceeds rule-based systems.7 The rule-based nature of Lerdahl and Jackendoff's theories are, however, nearly incompatible with connectionist systems. The use of such systems represents more than a change in approach; it represents a change in conception of the nature of rhythmic perception. "The departure from reliance on the explicit mental representation of rules is central, and thus the conception of cognition is fundamentally different."8 Desain and Honing's connectionist quantizer, arguably the best-documented system of this type, is described in detail below. Longuet-Higgins and Lee developed a series of "rhythm parsers"9 that are closely based on Lerdahl and Jackendoff's theories about meter and rhythm: "The metre [sic.] of a melody-as indicated by its time signature-should be regarded as a generative grammar, andÖrhythm is one of the tree structures which that grammar generates."10 Their systems are also loosely based on an examination of Simon's LISTENER rhythmic parsing system (developed in 1968). This system divides music into "phrases" of equal length and combines them to produce larger metrical units. LISTENER is successful in a very limited sense, but, contrary to Lerdahl and Jackendoff's separation of grouping of meter, the program assumes that metrical divisions and other groupings will never cross or overlap. This leaves LISTENER unable to deal with syncopations or phrases that start on an upbeat. The problem, postulate Longuet-Higgins and Lee, is that LISTENER examines a series of notes as a whole, and operates under the assumption that each note in the sequence can be given equal consideration as a possible downbeat. A better system would work left-to-right, so early notes (which generally conform to the meter) can allow a hypothesis of the meter to be established; this hypothesis can then be challenged by, but not necessarily overthrown by, later evidence to the contrary, such as syncopation.11 Longuet-Higgins and Lee's model works using this "principle of consistency": After hearing two notes, the model estimates where the next notes will fall. This estimate is revised in light of verification or rejection of this estimation. In this way, "no event can call in question the key or metre [sic.] of a melody until a sufficient framework has been established for such a challenge to be obvious to the listener."12 In Lonuet-Higgins and Lee's model, therefore, meter is created from the bottom upward over time, but quickly becomes a top-down structure that guides the model's perception of new input.13 Once a meter has been established, the model attempts to move up the metrical hierarchy by combining the established metrical units into a single metrical unit. It accomplishes this by using a CONFLATE operation: the length of the beat is effectively doubled (since two conforming metrical units are of the same length), and a new beat hypothesis is generated. Other Lerdahl and Jackendoff-style rules are used to modify and further expand this new beat hypothesis. For example, relatively long notes tend to be heard as downbeats because they are more salient than short notes; the UPDATE operation moves the hypothesized downbeat so that more long notes fall on the downbeat. Longuet-Higgins and Lee postulate that a sequence of notes will be perceived with a metric scheme that produces as few syncopations as possible, especially across bar lines;14 the model therefore uses a STRETCH operation that lengthens the hypothesized metrical unit so that it can include new long notes in a single metrical unit.15 By applying these operations recursively on the proposed metrical unit, the model moves up the metrical hierarchy, producing proposed metric units of greater and greater length that exist higher and higher in the metric hierarchy. According to Lerdahl and Jackendoff's theory, this process could continue until the largest hypothesized metrical unit is that of the entire piece of music. Longuet-Higgins and Lee, however, recognize that higher-order groupings do not necessarily conform to the same rules as do lower-order metrical structures. For example, two sections of a piece might be of different lengths, while measures will, in music considered by Longuet-Higgins and Lee, always be of the same length. An upper limit is therefore placed on the size of the hypothesized metrical unit, and the CONFLATE and STRETCH operations are disabled once the hypothesized metrical unit reaches this limit (generally 2-3 seconds, about the length of an average musical measure). The hypothesis will still be verified, but it will no longer be expanded.16 The Longuet-Higgins and Lee model of rhythmic perception has been expanded and improved by other researchers in several important ways. Rosenthal, Goto, and Moraoka suggest a way of making the model more accurate when faced with music that is rhythmically ambiguous: "At any point in the rhythm-tracking process, several interpretations may appear plausible; only further on in the processing does the correct interpretation become clear. One way of managing this situation is to maintain a number of hypotheses, which are periodically ranked and selected."17 Different hypotheses are created by multiple agents that each use different strategies. The hypotheses are periodically ranked according to Lerdahl and Jackendoff-like rules. Strong beats are for example, more likely to fall on long notes or chords, or, for instance, beat patterns should coincide with melodic patterns. Hypotheses that follow these rules are kept and ranked; poor hypotheses are discarded. In this way, only some of the hypotheses need to be correct at any given time for the model to correctly track the metrical structure.18 The Longuet-Higgins and Lee model could possibly be made more accurate with the inclusion of the ability to manage multiple hypotheses simultaneously. Another possible improvement to the Longuet-Higgins and Lee model is suggested by the work of Richard Parncutt. The Parncutt model works using multiple beat hypotheses-different possible beat tempos are maintained and compared. Each hypothesis is accompanied by a probability, which is an estimate (based once again on Lerdahl and Jackendoff-like rules) of how likely that hypothesis will be perceived by a listener. The novelty in Parncutt's approach is that all the different measurements involved, including pulse salience and the relative strength of metrical accents, are all continuous variables rather than integers. Comparison of the perceptual salience of the different hypotheses is a simple matter of comparing the numerical magnitude of each hypothesis involved. Combining two congruent hypotheses into a single hypothesis is accomplished by overlapping them and adding the different variables together. In this way, individual beat hypotheses give rise to a percept of meter, in which the strong beats and weak beats have correspondingly strong and weak variables attached to them.19 This technique could be used to help the Longuet-Higgins and Lee model manage multiple hypotheses. Many recent models of human rhythmic perception have taken an approach that is entirely different from the rule-based approach that is represented by the Longuet-Higgins and Lee model. The reasons for this shift are numerous, but can be more or less placed into two large categories. First, rule-based methods have some inherent flaws (discussed in more detail shortly), and a change in approach is required to overcome these limitations. Secondly, many researchers feel that the processes involved with rhythmic perception are more immediate and automatic than models such as the Longuet-Higgins and Lee model suggest. For these reasons, many of the newer models of human rhythmic approach take a new approach, one that is grounded in connectionist design philosophy. Connectionist systems "consist of a large number of simple elements, or cells, each of which has its own activation level. These cells are interconnected in a network, the connections serving to excite or inhibit others."20 In connectionist systems, all knowledge is represented implicitly. The behavior of connectionist systems depends not on rules, but on the manner in which these simple cells are interconnected.19 Desain and Honing have developed a model of human rhythmic perception based on a specific class of connectionist systems, a design type called interactive activation and constraint satisfaction network. Unlike in other types of connectionist systems, the behavior of the cells in this type of system is hard-wired; no learning takes place. Rather, the cells in this type of system converge toward an equilibrium. Desain and Honing's model contains a number of basic cells, each of which contains an inter-onset interval (the time lapse between the start of a note and the start of the previous note). Every two basic cells are connected with an interaction cell, which attempts, with each new input note, to "steer" the two basic cells closer towards being integer multiples of each other. Sum cells sum the activity of the two basic cells they are connected to, resulting in more complex integer ratios; interaction cells steer the sum cells in the same manner in which they steer the basic cells. When the value of a sum cell is changed, the basic cells to which it is attached change proportionally, bringing all the cells closer to integer ratios of each other. Equilibrium is reached when none of the cells change much from input event to input event.20 By "clamping" the network of cells-by freezing the values of all but one cell in order to view the activity of a single cell from event to event-the behavior of the system can be analyzed. When examined using this method, the system can be seen as outputting "energy curves" or "curves of expectancy" that closely represent the metrical structure of the piece of music being analyzed. Peaks in expectancy, for example, exist around the simple integer ratios; the peaks are higher around strong beats (ratio of 1:4, for example, in a 4/4 meter), and are lower around weak beats. The expectancy curves can be analyzed as representing the meter and internal rhythm of the music that was input. "The resulting pattern of beats is an emergent property of these expectancies; no symbolic notion of meter or isochrony is modeled."23 An examination of the expectancy curves that this system outputs gives an illustration of the system's power and accuracy. First, the width of each peak in the expectancy curve is the "cache range." Any note durations that fall within this range are "caught" and steered towards the correct integer ratio. The system, therefore, effectively "carves up" rhythmic space into compartments, each of which contains all the values that will be steered toward a particular integer value. The width of the peaks can be controlled using the system's "peak" parameter. Next, empirical observation (as well as theorist such as Lerdahl and Jackendoff) points out that humans are more likely to perceive small ratios (such as 2:1) than they are to perceive large ones (such as 8:1). The Desain and Honing model takes this phenomenon into account by decreasing the effect that high ratios have on the system as a whole. The amount in which their influence is decreased is controlled by the system's "decay" parameter. 24 The Desain and Honing model is quite powerful, and is capable of correctly handling situations that rule-based models cannot handle. The input of a few notes will, regardless of tempo or expressive timing deviations, create a projection of the metrical scheme that will be refined with each new input event. The system is very context-dependent, so syncopations are easily handled. A concept of meter emerges out of the system's extracted global behavior-a symbolic representation of meter is created by a system that works on the sub-symbolic level. 25 The task of modeling human rhythmic perception has been accomplished in a variety of ways, but modeling systems generally fall into two basic categories: rule-based ("good old-fashioned artificial intelligence") systems, and connectionist (distributed or neural network) systems. These classes of modeling systems are exemplified by the rule-based Longuet-Higgins and Lee Musical Parser and the Desain and Honing connectionist quantizer. The Longuet-Higgins and Lee system can be categorized as being: reliant on symbolic representations; centralized; consistent of search-generate-and-test methodology; knowledge-based in that the system tries to create specific types of hierarchies that have been programmed into the system. This system is based around a directed search for a symbolic representation of meter. The Desain and Honing system, on the other hand, can be categorized as being: numerical (rather than symbolic); distributed (rather than centralized); heterarchical (rather than hierarchical); knowledge-free. It is characterized by an automatic movement towards sub-symbolic equilibrium that results in a pattern of output that can be interpreted as a symbolic representation of meter.26 Each of these systems has its strengths and weaknesses. Rule-base systems still generally have problems with irregular metrical structures such as syncopations and upbeats. Such systems are capable of being programmed with knowledge from other domains (such as melody and harmony) in order to improve their performance, but this decreases their flexibility. When rule-based systems are faced with input that doesn't conform to expectations, they break down rapidly. When faced with rhythmic ambiguity, rule-based systems will generally not produce a "best guess" answer-they will, instead, produce no answer at all. But, provided that the input largely conforms to the system's expectations, rule-based systems still manage to accurately describe the metrical structure of a piece of music. "The rule-based models, even though they are simple and ignore effects like tempo, melody, and harmony. . . behave surprisingly well."27 On the other hand, connectionist models, such as that of Desain and Honing, are very flexible, and exhibit graceful degradation when faced with foreign input. They are, however, computationally slower than rule-based models. They tend to have problems with diverse rhythms (such as a triplet followed by a quintuplet) and with sudden obvious changes in meter. Despite these limitations, even rough implementations of connectionist designs produce behavior that is remarkably accurate.28 Researchers are admittedly very far from a complete understanding of how humans perceive rhythm. Models such as those outlined above are powerful tools for researchers, allowing them to test their theories in quantifiable ways. Newer models seem to be getting closer to modeling the processes that are involved in rhythmic perception,29 and it is the hope of researchers that, through the development of new models and the refinement of old ones, we will eventually reach an understanding of the way in which we perceive rhythm, rhythmic structure, and music in general. |
|
Notes
|
|
Bibliography
| |
|
|