Skip to content

Latest commit

 

History

History

cldf

Wordlist Austronesian Comparative Dictionary

CLDF Metadata: cldf-metadata.json

Sources: sources.bib

property value
dc:conformsTo CLDF Wordlist
dc:license https://creativecommons.org/licenses/by/4.0/
dcat:accessURL https://github.com/lexibank/acd
prov:wasDerivedFrom
  1. lexibank/acd v1.2-33-gbc277b3
  2. Glottolog v5.1
  3. Concepticon v3.2.0
  4. CLTS v2.3.0
prov:wasGeneratedBy
  1. lingpy-rcParams: lingpy-rcParams.json
  2. python: 3.12.3
  3. python-packages: requirements.txt
rdf:ID acd
rdf:type http://www.w3.org/ns/dcat#Distribution

Table forms.csv

property value
dc:conformsTo CLDF FormTable
dc:extent 146733

Columns

Name/Property Datatype Description
ID string Primary key
Local_ID string
Language_ID string References languages.csv::ID
Parameter_ID string References parameters.csv::ID
Value string
Form string
Segments list of string (separated by )
Comment string
Source list of string (separated by ;) References sources.bib::BibTeX-key
Cognacy string
Loan boolean
Graphemes string
Profile string
Description string Description of the meaning of the word (possibly in language-specific terms).
Sic boolean For a form that differs from the expected reflex in some way this flag asserts that a copying mistake has not occurred.
Doubt boolean In particular reconstructions, i.e. proto-forms in etymological dictionaries, are often marked as being somewhat doubtful (typically displayed as proto-form prefixed with a '?' or similar).
property value
dc:conformsTo CLDF LanguageTable
dc:extent 1064

Columns

Name/Property Datatype Description
ID string Primary key
Name string
Glottocode string
Glottolog_Name string
ISO639P3code string
Macroarea string
Latitude decimal
≥ -90
≤ 90
Longitude decimal
≥ -180
≤ 180
Family string
Abbr string Abbreviation for the (proto-)language name.
Group string
Regex: `PAN
Form.
Source list of string (separated by ;) Etymological (or comparative) dictionaries typically compare lexical data from many source dictionaries.
References sources.bib::BibTeX-key
Is_Proto boolean Specifies whether a language is a proto-language (and thus its forms reconstructed proto-forms).
Description string For proto-languages that correspond to ACD reconstruction levels, a description of their extent is provided.
Dialect_Of string References languages.csv::ID
property value
dc:conformsTo CLDF ParameterTable
dc:extent 86502

Columns

Name/Property Datatype Description
ID string Primary key
Name string
Concepticon_ID string
Concepticon_Gloss string
property value
dc:conformsTo CLDF CognateTable
dc:extent 121682

Columns

Name/Property Datatype Description
ID string Primary key
Form_ID string References forms.csv::ID
Form string
Cognateset_ID string References cognatesets.csv::ID
Doubt boolean
Cognate_Detection_Method string
Source list of string (separated by ;) References sources.bib::BibTeX-key
Alignment list of string (separated by )
Alignment_Method string
Alignment_Source string
Metathesis boolean Flag indicating that a process of metathesis is assumed, explaining the apparent irregularity of a cognate.
Assimilation boolean Flag indicating that a process of assimilation is assumed, explaining the apparent irregularity of a cognate.
Doublet_Comment string A comment about the doublet status of the reconstruction.
Doublet_Set string Identifier of a set of variants that are independently supported by the comparative evidence. Doubletting that cannot be traced in any clear way to borrowing is extremely common in AN languages (Blust 2011), and an effort has been made to cross-reference doublets in the ACD wherever possible.
Disjunct_Comment string A comment about the disjunct status of the reconstruction.
Disjunct_Set string Identifier of a set of variants that are supported only by allowing the overlap of cognate sets; i.e. only one reconstruction in a set of disjuncts can be consistent with the evidence, but it is unclear which one. A distinction is drawn between doublets (variants that are independently supported by the comparative evidence), and “disjuncts” (variants that are supported only by allowing the overlap of cognate sets). To illustrate, both Tagalog gumí ‘beard’ and Malay kumis ‘moustache’ show regular correspondences with Fijian kumi ‘the chin or beard’, but they do not correspond regularly with one another. Based on this evidence, it is impossible to posit doublets, since unambiguous support for both variants is lacking. However, since the Tagalog and Malay forms can each be compared with Fijian kumi, two comparisons can be proposed that overlap by including the Fijian form in both (like all Oceanic languages, Fijian has merged PMP *k and *g; in addition, it has lost final consonants) . The result is a pair of PMP disjuncts *gumi (based on Tagalog and Fijian) and *kumis (based on Malay and Fijian), either or both of which could be used to justify an independent doublet if additional comparative support is found.

Comparisons with regular sound correspondences and close semantics. If there are additional forms that are strikingly similar but irregular, or that show strong semantic divergence, these are are added in a note. Every attempt is made to keep the comparison proper free from problems.

Because many reconstructed morphemes contain smaller submorphemic sound-meaning associations of the type that Brandstetter (1916) called ‘roots’ (Wurzeln), these elements are listed as cognate sets, too. They are marked with a true value for the 'Is_Root' property of the linked, reconstructed form.

The roots listed here thus amount to a continuation of the data set presented in Blust 1988.

property value
dc:conformsTo CLDF CognatesetTable
dc:extent 10857

Columns

Name/Property Datatype Description
ID string
Regex: [a-zA-Z0-9_\-]+
Primary key
Description string
Source list of string (separated by ;) References sources.bib::BibTeX-key
Name string A recognizable label for the cognateset, typically the reconstructed proto-form and the reconstructed meaning.
Form_ID string Links to the reconstructed proto-form in FormTable.
References forms.csv::ID
Comment string
Doubt boolean Flag indicating (un)certainty of the reconstruction.
Etymon_ID string References etyma.csv::ID
Is_Main_Entry boolean

Table cf.csv

The ACD includes five additional categories of groups of forms, called 'near cognates', 'noise', 'roots', 'loans' and 'also'. These are marked with respective values in the 'Category' column.

'Near cognates' are forms that are strikingly similar but irregular, and which cannot be included in a note to an established reconstruction. Stated differently, these are forms that appear to be historically related, but do not yet permit a reconstruction.

The 'noise' (in the information-theoretic sense of meaningless data that can be confused with a true signal) category lists chance resemblances. Given the number of languages being compared and the number of forms in many of the sources, forms that resemble one another in shape and meaning by chance will not be uncommon, and the decision as to whether a comparison that appears good is a product of chance must be based on criteria such as

  • how general the semantic category of the form is (e.g. phonologically corresponding forms meaning ‘cut’ are less diagnostic of relationship than phonologically corresponding forms for particular types of cutting),
  • how richly attested the form is (if it is found in just two witnesses the likelihood that it is a product of chance is greatly increased),
  • there is already a well-established reconstruction for the same meaning.

Thus, the search process that results in valid cognate sets inevitably turns up other material that is superficially appealing, but is questionable for various reasons. To simply dispose of this ‘information refuse’ would be unwise for two reasons. First, further searching might show that some of these questionable comparisons are more strongly supported than it initially appeared. Second, even if the material is not upgraded through further comparative work it is always possible that some future researcher with different standards of evaluation will stumble upon some of these comparisons and claim that they are valid, but were overlooked in the ACD. By including a module on ‘Noise’ we can show that we have considered and rejected various possibilities that might be entertained by others.

Because many reconstructed morphemes contain smaller submorphemic sound-meaning associations of the type that Brandstetter (1916) called ‘roots’ (Wurzeln), these elements are included in the 'roots' category. The roots listed here thus amount to a continuation of the data set presented in Blust 1988.

Roots are not listed as regular cognate sets, because the reconstructions are not explicitly assigned to a proto-language.

Loanwords are a perennial problem in historical linguistics. When they involve morphemes that are borrowed between related languages they can provoke questions about the regularity of sound correspondences. When they involve morphemes that are borrowed between unrelated languages they can give rise to invalid reconstructions. Dempwolff (1934-38) included a number of known loanwords among his 2,216 ‘Proto-Austronesian’ reconstructions in order to show that sound correspondences are often regular even with loanwords that are borrowed relatively early, but he marked these with an ‘x’, as with *xbazu ‘shirt’, which he knew to be a Persian loanword in many of the languages of western Indonesia, and (via Malay) in some of the languages of the Philippines. However, he overlooked a number of cases, such as *nanas ‘pineapple’ (an Amazonian cultigen that was introduced to insular Southeast Asia by the Portuguese). Since widely distributed loanwords can easily be confused with native forms it is useful to include them in the dictionary.

A fairly careful (but inevitably imperfect) attempt has been made to identify and document loanwords with a distribution sufficient to justify a reconstruction on one of the nine levels of the ACD, if treated erroneously as native. While this has been done wherever the possibility of confusion with native forms seemed real, there is no reason to include obvious loans that would never be mistaken for native forms.

This issue is especially evident in the Philippines, where hundreds of Spanish loanwords from the colonial period that began late in the 16th century, are scattered from at least Ilokano in northern Luzon to the Bisayan languages of the central Philippines and some of the languages of Mindanao (as Subanon). Comparisons like Ilokano kamarón ‘prawn’, Cebuano kamarún ‘dish of shrimps, split and dipped in eggs, optionally mixed with ground meat’ < Spanish camarón ‘shrimp’, or Ilokano kalábus ‘jail, prison’, Cebuano kalabús, kalabúsu ‘jail; to land in prison, in jail’ < Spanish calabozo ‘dungeon’ seem inappropriate for inclusion in LOANS, but introduced plants have generally been admitted. Some of these, as ‘tomato’ may be widely known as New World plants that were introduced to the Philippines by the Spanish, but others, as ‘chayote’, may be less familiar. As already noted, Dempwolff (1938) posited ‘Uraustronesisch’ *nanas and *kenas as doublets for ‘pineapple’, completely overlooking the fact that this is an Amazonian plant that could hardly have been present in the Austronesian world before the advent of the colonial period. This example shows that errors in the semantic domain of plant names can sometimes escape detection by scholars who are otherwise known for their careful, meticulous work, and for this reason all borrowed cognate sets involving plant names are documented as loanwords to avoid any possible misinterpretation.

The last category, 'also', groups forms related to a particular cognate set. These forms typically show some kind of irregularity with respect to the proposed reconstruction, but provide context to evaluate the validity of the cognate set.

property value
dc:extent 2364

Columns

Name/Property Datatype Description
ID string Primary key
Name string The title of a table of related forms; typically hints at the type of relation between the forms or between the group of forms and an etymon.
Description string
Category string An optional category for groups of forms such as "loans".
Comment string
Cognateset_ID string Links to an etymon, if the group of lexemes is related to one.
References cognatesets.csv::ID
Dempwolff_Etymology string A corresponding (unsupported) reconstruction posited in Dempwolff 1938.

Membership of forms in a "cf" group is mediated through this association table unless more meaningful alternatives are available, like BorrowingTable for loans.

property value
dc:extent 7344

Columns

Name/Property Datatype Description
ID string Primary key
Cfset_ID string References cf.csv::ID
Form_ID string References forms.csv::ID
Comment string
Source list of string (separated by ;) References sources.bib::BibTeX-key
property value
dc:conformsTo CLDF BorrowingTable
dc:extent 7348

Columns

Name/Property Datatype Description
ID string
Regex: [a-zA-Z0-9_\-]+
Primary key
Target_Form_ID string References the loanword, i.e. the form as borrowed into the target language
References forms.csv::ID
Source_Form_ID string References the source word of a borrowing
References forms.csv::ID
Comment string
Source list of string (separated by ;) References sources.bib::BibTeX-key
Cfset_ID string Link to a set description.
References cf.csv::ID

Table etyma.csv

property value
dc:extent 8161

Columns

Name/Property Datatype Description
ID string A numeric identifier for the etymon. For etyma present in the legacy online version of ACD this number will match the cognate set number assigned then.
Primary key
Name string The core reconstruction uniting the cognate sets of the etymon.
Initial string
Valid choices:
a b c C d e g h i j k l m n N ñ ŋ o p q r R s S t u w y z
Description string The reconstructed meaning of the etymon.
Comment string Some notes are several lines, while others are a page or more. Notes are used for a variety of purposes. Among the most common are to report other forms that show a likely historical connection with those cited in the main comparison, but which exhibit irregularities other than the usual sporadic assimilation or metathesis, and so raise more serious questions about comparability, as in entry (2) above; to discuss details of the reconstructed gloss; and to note the occurrence of monosyllabic “roots” or submorphemic sound-meaning correlations in reconstructed morphemes.
Source list of string (separated by ;) Sources mentioned in the comment describing the etymon.
References sources.bib::BibTeX-key

Table trees.csv

property value
dc:conformsTo CLDF TreeTable
dc:extent 1

Columns

Name/Property Datatype Description
ID string
Regex: [a-zA-Z0-9_\-]+
Primary key
Name string Name of tree as used in the tree file, i.e. the tree label in a Nexus file or the 1-based index of the tree in a newick file
Description string Describe the method that was used to create the tree, etc.
Tree_Is_Rooted boolean
Valid choices:
Yes No
Whether the tree is rooted (Yes) or unrooted (No) (or no info is available (null))
Tree_Type string
Valid choices:
summary sample
Whether the tree is a summary (or consensus) tree, i.e. can be analysed in isolation, or whether it is a sample, resulting from a method that creates multiple trees
Tree_Branch_Length_Unit string
Valid choices:
change substitutions years centuries millennia
The unit used to measure evolutionary time in phylogenetic trees.
Media_ID string References a file containing a Newick representation of the tree, labeled with identifiers as described in the LanguageTable (the Media_Type column of this table should provide enough information to chose the appropriate tool to read the newick)
References media.csv::ID
Source list of string (separated by ;) References sources.bib::BibTeX-key

Table media.csv

property value
dc:conformsTo CLDF MediaTable
dc:extent 1

Columns

Name/Property Datatype Description
ID string
Regex: [a-zA-Z0-9_\-]+
Primary key
Name string
Description string
Media_Type string
Regex: [^/]+/.+
Download_URL anyURI
Path_In_Zip string