Phonetics is a branch of linguistics that studies how humans produce and perceive sounds or, in the case of , the equivalent aspects of sign. Linguists who specialize in studying the physical properties of speech are phoneticians. The field of phonetics is traditionally divided into three sub-disciplines: articulatory phonetics, acoustic phonetics, and auditory phonetics. Traditionally, the minimal linguistic unit of phonetics is the phone—a speech sound in a language which differs from the phonological unit of phoneme; the phoneme is an abstract categorization of phones and it is also defined as the smallest unit that discerns meaning between sounds in any given language.
Phonetics deals with two aspects of human speech: production (the ways humans make sounds) and perception (the way speech is understood). The communicative modality of a language describes the method by which a language produces and perceives languages. Languages with oral-aural modalities such as English produce speech orally and perceive speech aurally (using the ears). Sign languages, such as Australian Sign Language (Auslan) and American Sign Language (ASL), have a manual-visual modality, producing speech manually (using the hands) and perceiving speech visually. ASL and some other sign languages have in addition a manual-manual dialect for use in tactile signing by deafblind speakers where signs are produced with the hands and perceived with the hands as well.
The Sanskrit study of phonetics is called Shiksha, which the 1st-millennium BCE Taittiriya Upanishad defines as follows:
Om! We will explain the Shiksha.
Sounds and accentuation, Quantity (of vowels) and the expression (of consonants),
Balancing (Saman) and connection (of sounds), So much about the study of Shiksha. || 1 |
Taittiriya Upanishad 1.2, Shikshavalli, translated by Paul Deussen
Before the widespread availability of audio recording equipment, phoneticians relied heavily on a tradition of practical phonetics to ensure that transcriptions and findings were able to be consistent across phoneticians. This training involved both ear training—the recognition of speech sounds—as well as production training—the ability to produce sounds. Phoneticians were expected to learn to recognize by ear the various sounds on the International Phonetic Alphabet and the IPA still tests and certifies speakers on their ability to accurately produce the phonetic patterns of English (though they have discontinued this practice for other languages). As a revision of his visible speech method, Melville Bell developed a description of vowels by height and backness resulting in 9 . As part of their training in practical phonetics, phoneticians were expected to learn to produce these cardinal vowels to anchor their perception and transcription of these phones during fieldwork. This approach was critiqued by Peter Ladefoged in the 1960s based on experimental evidence where he found that cardinal vowels were auditory rather than articulatory targets, challenging the claim that they represented articulatory anchors by which phoneticians could judge other articulations.
After an utterance has been planned, it then goes through phonological encoding. In this stage of language production, the mental representation of the words are assigned their phonological content as a sequence of to be produced. The phonemes are specified for articulatory features which denote particular goals such as closed lips or the tongue in a particular location. These phonemes are then coordinated into a sequence of muscle commands that can be sent to the muscles, and when these commands are executed properly the intended sounds are produced. Thus the process of production from message to sound can be summarized as the following sequence:
Sounds are partly categorized by the location of a constriction as well as the part of the body doing the constricting. For example, in English the words fought and thought are a minimal pair differing only in the organ making the construction rather than the location of the construction. The "f" in fought is a labiodental articulation made with the bottom lip against the teeth. The "th" in thought is a linguodental articulation made with the tongue against the teeth. Constrictions made by the lips are called Labialized while those made with the tongue are called lingual.
Constrictions made with the tongue can be made in several parts of the vocal tract, broadly classified into coronal, dorsal and radical places of articulation. Coronal articulations are made with the front of the tongue, Dorsal consonant articulations are made with the back of the tongue, and radical articulations are made in the pharynx. These divisions are not sufficient for distinguishing and describing all speech sounds. For example, in English the sounds and are both coronal, but they are produced in different places of the mouth. To account for this, more detailed places of articulation are needed based upon the area of the mouth in which the constriction occurs.
Labiodental consonants are made by the lower lip rising to the upper teeth. Labiodental consonants are most often while labiodental nasals are also typologically common. There is debate as to whether true labiodental occur in any natural language, though a number of languages are reported to have labiodental plosives including Zulu language, Tonga, and Shubi language.
Crosslinguistically, dental consonants and alveolar consonants are frequently contrasted leading to a number of generalizations of crosslinguistic patterns. The different places of articulation tend to also be contrasted in the part of the tongue used to produce them: most languages with dental stops have laminal dentals, while languages with apical stops usually have apical stops. Languages rarely have two consonants in the same place with a contrast in laminality, though Taa (ǃXóõ) is a counterexample to this pattern. If a language has only one of a dental stop or an alveolar stop, it will usually be laminal if it is a dental stop, and the stop will usually be apical if it is an alveolar stop, though for example Temne language and Bulgarian do not follow this pattern. If a language has both an apical and laminal stop, then the laminal stop is more likely to be affricated like in Isoko language, though Dahalo show the opposite pattern with alveolar stops being more affricated.
Retroflex consonants have several different definitions depending on whether the position of the tongue or the position on the roof of the mouth is given prominence. In general, they represent a group of articulations in which the tip of the tongue is curled upwards to some degree. In this way, retroflex articulations can occur in several different locations on the roof of the mouth including alveolar, post-alveolar, and palatal regions. If the underside of the tongue tip makes contact with the roof of the mouth, it is sub-apical though apical post-alveolar sounds are also described as retroflex. Typical examples of sub-apical retroflex stops are commonly found in Dravidian languages, and in some languages indigenous to the southwest United States the contrastive difference between dental and alveolar stops is a slight retroflexion of the alveolar stop. Acoustically, retroflexion tends to affect the higher formants.
Articulations taking place just behind the alveolar ridge, known as post-alveolar consonants, have been referred to using a number of different terms. Apical post-alveolar consonants are often called retroflex, while laminal articulations are sometimes called palato-alveolar; in the Australianist literature, these laminal stops are often described as 'palatal' though they are produced further forward than the palate region typically described as palatal. Because of individual anatomical variation, the precise articulation of palato-alveolar stops (and coronals in general) can vary widely within a speech community.
Radical consonants either use the root of the tongue or the epiglottis during production and are produced very far back in the vocal tract. Pharyngeal consonants are made by retracting the root of the tongue far enough to almost touch the wall of the pharynx. Due to production difficulties, only fricatives and approximants can be produced this way. Epiglottal consonants are made with the epiglottis and the back wall of the pharynx. Epiglottal stops have been recorded in Dahalo. Voiced epiglottal consonants are not deemed possible due to the cavity between the glottis and epiglottis being too small to permit voicing.
Glottal consonants are those produced using the vocal folds in the larynx. Because the vocal folds are the source of phonation and below the oro-nasal vocal tract, a number of glottal consonants are impossible such as a voiced glottal stop. Three glottal consonants are possible, a voiceless glottal stop and two glottal fricatives, and all are attested in natural languages. , produced by closing the vocal folds, are notably common in the world's languages. While many languages use them to demarcate phrase boundaries, some languages like Arabic and Huautla Mazatec have them as contrastive phonemes. Additionally, glottal stops can be realized as laryngealization of the following vowel in this language. Glottal stops, especially between vowels, do usually not form a complete closure. True glottal stops normally occur only when they are geminated.
In addition to correctly positioning the vocal folds, there must also be air flowing across them or they will not vibrate. The difference in pressure across the glottis required for voicing is estimated at 1 – 2 cm H2O (98.0665 – 196.133 pascals). The pressure differential can fall below levels required for phonation either because of an increase in pressure above the glottis (superglottal pressure) or a decrease in pressure below the glottis (subglottal pressure). The subglottal pressure is maintained by the respiratory muscles. Supraglottal pressure, with no constrictions or articulations, is equal to about atmospheric pressure. However, because articulations—especially consonants—represent constrictions of the airflow, the pressure in the cavity behind those constrictions can increase resulting in a higher supraglottal pressure.
Straight-line movements have been used to argue articulations as planned in extrinsic rather than intrinsic space, though extrinsic coordinate systems also include acoustic coordinate spaces, not just physical coordinate spaces. Models that assume movements are planned in extrinsic space run into an inverse problem of explaining the muscle and joint locations which produce the observed path or acoustic signal. The arm, for example, has seven degrees of freedom and 22 muscles, so multiple different joint and muscle configurations can lead to the same final position. For models of planning in extrinsic acoustic space, the same one-to-many mapping problem applies as well, with no unique mapping from physical or acoustic targets to the muscle movements required to achieve them. Concerns about the inverse problem may be exaggerated, however, as speech is a highly learned skill using neurological structures which evolved for the purpose.
The equilibrium-point model proposes a resolution to the inverse problem by arguing that movement targets be represented as the position of the muscle pairs acting on a joint. Importantly, muscles are modeled as springs, and the target is the equilibrium point for the modeled spring-mass system. By using springs, the equilibrium point model can easily account for compensation and response when movements are disrupted. They are considered a coordinate model because they assume that these muscle positions are represented as points in space, equilibrium points, where the spring-like action of the muscles converges.
Gestural approaches to speech production propose that articulations are represented as movement patterns rather than particular coordinates to hit. The minimal unit is a gesture that represents a group of "functionally equivalent articulatory movement patterns that are actively controlled with reference to a given speech-relevant goal (e.g., a bilabial closure)." These groups represent coordinative structures or "synergies" which view movements not as individual muscle movements but as task-dependent groupings of muscles which work together as a single unit. This reduces the degrees of freedom in articulation planning, a problem especially in intrinsic coordinate models, which allows for any movement that achieves the speech goal, rather than encoding the particular movements in the abstract representation. Coarticulation is well described by gestural models as the articulations at faster speech rates can be explained as composites of the independent gestures at slower speech rates.
]] Speech sounds are created by the modification of an airstream which results in a sound wave. The modification is done by the articulators, with different places and manners of articulation producing different acoustic results. Because the posture of the vocal tract, not just the position of the tongue can affect the resulting sound, the manner of articulation is important for describing the speech sound. The words tack and sack both begin with alveolar sounds in English, but differ in how far the tongue is from the alveolar ridge. This difference has large effects on the air stream and thus the sound that is produced. Similarly, the direction and source of the airstream can affect the sound. The most common airstream mechanism is pulmonic—using the lungs—but the glottis and tongue can also be used to produce airstreams.
Phonation is controlled by the muscles of the larynx, and languages make use of more acoustic detail than binary voicing. During phonation, the vocal folds vibrate at a certain rate. This vibration results in a periodic acoustic waveform comprising a fundamental frequency and its harmonics. The fundamental frequency of the acoustic wave can be controlled by adjusting the muscles of the larynx, and listeners perceive this fundamental frequency as pitch. Languages use pitch manipulation to convey lexical information in tonal languages, and many languages use pitch to mark prosodic or pragmatic information.
For the vocal folds to vibrate, they must be in the proper position and there must be air flowing through the glottis. Phonation types are modeled on a continuum of glottal states from completely open (voiceless) to completely closed (glottal stop). The optimal position for vibration, and the phonation type most used in speech, modal voice, exists in the middle of these two extremes. If the glottis is slightly wider, breathy voice occurs, while bringing the vocal folds closer together results in creaky voice.
The normal phonation pattern used in typical speech is modal voice, where the vocal folds are held close together with moderate tension. The vocal folds vibrate as a single unit periodically and efficiently with a full glottal closure and no aspiration. If they are pulled farther apart, they do not vibrate and so produce voiceless phones. If they are held firmly together they produce a glottal stop.
If the vocal folds are held slightly further apart than in modal voicing, they produce phonation types like breathy voice (or murmur) and whispery voice. The tension across the vocal ligaments (vocal cords) is less than in modal voicing allowing for air to flow more freely. Both breathy voice and whispery voice exist on a continuum loosely characterized as going from the more periodic waveform of breathy voice to the more noisy waveform of whispery voice. Acoustically, both tend to dampen the first formant with whispery voice showing more extreme deviations.
Holding the vocal folds more tightly together results in a creaky voice. The tension across the vocal folds is less than in modal voice, but they are held tightly together resulting in only the ligaments of the vocal folds vibrating. The pulses are highly irregular, with low pitch and frequency amplitude.
Some languages do not maintain a voicing distinction for some consonants, but all languages use voicing to some degree. For example, no language is known to have a phonemic voicing contrast for vowels with all known vowels canonically voiced. Other positions of the glottis, such as breathy and creaky voice, are used in a number of languages, like Jalapa Mazatec, to contrast phonemes while in other languages, like English, they exist allophonically.
There are several ways to determine if a segment is voiced or not, the simplest being to feel the larynx during speech and note when vibrations are felt. More precise measurements can be obtained through acoustic analysis of a spectrogram or spectral slice. In a spectrographic analysis, voiced segments show a voicing bar, a region of high acoustic energy, in the low frequencies of voiced segments. In examining a spectral splice, the acoustic spectrum at a given point in time a model of the vowel pronounced reverses the filtering of the mouth producing the spectrum of the glottis. A computational model of the unfiltered glottal signal is then fitted to the inverse filtered acoustic signal to determine the characteristics of the glottis. Visual analysis is also available using specialized medical equipment such as ultrasound and endoscopy.
Vowel height traditionally refers to the highest point of the tongue during articulation. The height parameter is divided into four primary levels: high (close), close-mid, open-mid, and low (open). Vowels whose height are in the middle are referred to as mid. Slightly opened close vowels and slightly closed open vowels are referred to as near-close and near-open respectively. The lowest vowels are not just articulated with a lowered tongue, but also by lowering the jaw.
While the IPA implies that there are seven levels of vowel height, it is unlikely that a given language can minimally contrast all seven levels. Chomsky and Morris Halle suggest that there are only three levels, although four levels of vowel height seem to be needed to describe Danish language and it is possible that some languages might even need five.
Vowel backness is dividing into three levels: front, central and back. Languages usually do not minimally contrast more than two levels of vowel backness. Some languages claimed to have a three-way backness distinction include Nimboran and Norwegian.
In most languages, the lips during vowel production can be classified as either rounded or unrounded (spread), although other types of lip positions, such as compression and protrusion, have been described. Lip position is correlated with height and backness: front and low vowels tend to be unrounded whereas back and high vowels are usually rounded. Paired vowels on the IPA chart have the spread vowel on the left and the rounded vowel on the right.
Together with the universal vowel features described above, some languages have additional features such as Nasal vowel, Vowel length and different types of phonation such as voiceless or Creaky voice. Sometimes more specialized tongue gestures such as Rhotic vowel, advanced tongue root, pharyngealization, Strident vowel and frication are required to describe a certain vowel.
Stop consonant (also referred to as plosives) are consonants where the airstream is completely obstructed. Pressure builds up in the mouth during the stricture, which is then released as a small burst of sound when the articulators move apart. The velum is raised so that air cannot flow through the nasal cavity. If the velum is lowered and allows for air to flow through the nose, the result in a nasal stop. However, phoneticians almost always refer to nasal stops as just "nasals". Affricates are a sequence of stops followed by a fricative in the same place.
Fricatives are consonants where the airstream is made turbulent by partially, but not completely, obstructing part of the vocal tract. are a special type of fricative where the turbulent airstream is directed towards the teeth, creating a high-pitched hissing sound.
Nasals (sometimes referred to as nasal stops) are consonants in which there's a closure in the oral cavity and the velum is lowered, allowing air to flow through the nose.
In an approximant, the articulators come close together, but not to such an extent that allows a turbulent airstream.
Laterals are consonants in which the airstream is obstructed along the center of the vocal tract, allowing the airstream to flow freely on one or both sides. Laterals have also been defined as consonants in which the tongue is contracted in such a way that the airstream is greater around the sides than over the center of the tongue. The first definition does not allow for air to flow over the tongue.
Trill consonant are consonants in which the tongue or lips are set in motion by the airstream. The stricture is formed in such a way that the airstream causes a repeating pattern of opening and closing of the soft articulator(s). Apical trills typically consist of two or three periods of vibration.
Flap consonant and Flap consonant are single, rapid, usually Apical consonant gestures where the tongue is thrown against the roof of the mouth, comparable to a very rapid stop. These terms are sometimes used interchangeably, but some phoneticians make a distinction. In a tap, the tongue contacts the roof in a single motion whereas in a flap the tongue moves tangentially to the roof of the mouth, striking it in passing.
During a glottalic airstream mechanism, the glottis is closed, trapping a body of air. This allows for the remaining air in the vocal tract to be moved separately. An upward movement of the closed glottis will move this air out, resulting in it an ejective consonant. Alternatively, the glottis can lower, sucking more air into the mouth, which results in an implosive consonant.
Click consonant are stops in which tongue movement causes air to be sucked in the mouth, this is referred to as a velaric airstream. During the click, the air becomes rarefied between two articulatory closures, producing a loud 'click' sound when the anterior closure is released. The release of the anterior closure is referred to as the click influx. The release of the posterior closure, which can be velar or uvular, is the click efflux. Clicks are used in several African language families, such as the Khoisan and Bantu languages languages.
The lungs are used to maintain two kinds of pressure simultaneously to produce and modify phonation. To produce phonation at all, the lungs must maintain a pressure of 3–5 cm H2O higher than the pressure above the glottis. However small and fast adjustments are made to the subglottal pressure to modify speech for suprasegmental features like stress. A number of thoracic muscles are used to make these adjustments. Because the lungs and thorax stretch during inhalation, the elastic forces of the lungs alone can produce pressure differentials sufficient for phonation at lung volumes above 50 percent of vital capacity. Above 50 percent of vital capacity, the respiratory muscles are used to "check" the elastic forces of the thorax to maintain a stable pressure differential. Below that volume, they are used to increase the subglottal pressure by actively exhaling air.
During speech, the respiratory cycle is modified to accommodate both linguistic and biological needs. Exhalation, usually about 60 percent of the respiratory cycle at rest, is increased to about 90 percent of the respiratory cycle. Because metabolic needs are relatively stable, the total volume of air moved in most cases of speech remains about the same as quiet tidal breathing. Increases in speech intensity of 18 dB (a loud conversation) has relatively little impact on the volume of air moved. Because their respiratory systems are not as developed as adults, children tend to use a larger proportion of their vital capacity compared to adults, with more deep inhales.
While listeners can use a variety of information to segment the speech signal, the relationship between acoustic signal and category perception is not a perfect mapping. Because of coarticulation, noisy environments, and individual differences, there is a high degree of acoustic variability within categories. Known as the problem of perceptual invariance, listeners are able to reliably perceive categories despite the variability in acoustic instantiation. To do this, listeners rapidly accommodate to new speakers and will shift their boundaries between categories to match the acoustic distinctions their conversational partner is making.
The differential vibration of the basilar causes the hair cells within the organ of Corti to move. This causes depolarization of the hair cells and ultimately a conversion of the acoustic signal into a neuronal signal. While the hair cells do not produce themselves, they release neurotransmitter at synapses with the fibers of the auditory nerve, which does produce action potentials. In this way, the patterns of oscillations on the basilar membrane are converted to spatiotemporal patterns of firings which transmit information about the sound to the brainstem.
Successor theories of speech perception place the focus on acoustic cues to sound categories and can be grouped into two broad categories: abstractionist theories and episodic theories. In abstractionist theories, speech perception involves the identification of an idealized lexical object based on a signal reduced to its necessary components and normalizing the signal to counteract speaker variability. Episodic theories such as the Exemplar theory argue that speech perception involves accessing detailed memories (i.e., episodic memories) of previously heard tokens. The problem of perceptual invariance is explained by episodic theories as an issue of familiarity: normalization is a byproduct of exposure to more variable distributions rather than a discrete process as abstractionist theories claim.
The mismatch between acoustic analyses and what the listener hears is especially noticeable in speech sounds that have a lot of high-frequency energy, such as certain fricatives. To reconcile this mismatch, functional models of the auditory system have been developed.
Consonants are speech sounds that are articulated with a complete or partial closure of the vocal tract. They are generally produced by the modification of an airstream exhaled from the lungs. The respiratory organs used to create and modify airflow are divided into three regions: the vocal tract (supralaryngeal), the larynx, and the subglottal system. The airstream can be either egressive (out of the vocal tract) or ingressive (into the vocal tract). In pulmonic sounds, the airstream is produced by the lungs in the subglottal system and passes through the larynx and vocal tract. Glottalic sounds use an airstream created by movements of the larynx without airflow from the lungs. Click consonant consonants are articulated through the rarefaction of air using the tongue, followed by releasing the forward closure of the tongue.
Vowels are syllabic speech sounds that are pronounced without any obstruction in the vocal tract. Unlike consonants, which usually have definite places of articulation, vowels are defined in relation to a set of reference vowels called cardinal vowels. Three properties are needed to define vowels: tongue height, tongue backness, and lip roundedness. Vowels that are articulated with a stable quality are called ; a combination of two separate vowels in the same syllable is a diphthong. In the IPA, the vowels are represented on a trapezoid shape representing the human mouth: the vertical axis representing the mouth from floor to roof and the horizontal axis represents the front-back dimension.
While no sign language has a standardized writing system, linguists have developed their own notation systems that describe the handshape, location and movement. The Hamburg Notation System (HamNoSys) is similar to the IPA in that it allows for varying levels of detail. Some notation systems such as KOMVA and the Stokoe notation were designed for use in dictionaries; they also make use of alphabetic letters in the local language for handshapes whereas HamNoSys represents the handshape directly. SignWriting aims to be an easy-to-learn writing system for sign languages, although it has not been officially adopted by any deaf community yet.
Unlike spoken languages, sign languages have two identical articulators: the hands. Signers may use whichever hand they prefer with no disruption in communication. Due to universal neurological limitations, two-handed signs generally have the same kind of articulation in both hands; this is referred to as the Symmetry Condition. The second universal constraint is the Dominance Condition, which holds that when two handshapes are involved, one hand will remain stationary and have a more limited set of handshapes compared to the dominant, moving hand. Additionally, it is common for one hand in a two-handed sign to be dropped during informal conversations, a process referred to as weak drop. Just like words in spoken languages, coarticulation may cause signs to influence each other's form. Examples include the handshapes of neighboring signs becoming more similar to each other (assimilation) or weak drop (an instance of deletion).
|
|