CLDF Metadata: cldf-metadata.json
Sources: sources.bib
property | value |
---|---|
dc:conformsTo | CLDF Wordlist |
dc:license | https://creativecommons.org/licenses/by/4.0/ |
dcat:accessURL | https://github.com/lexibank/acd |
prov:wasDerivedFrom | |
prov:wasGeneratedBy |
|
rdf:ID | acd |
rdf:type | http://www.w3.org/ns/dcat#Distribution |
Table forms.csv
property | value |
---|---|
dc:conformsTo | CLDF FormTable |
dc:extent | 146733 |
Name/Property | Datatype | Description |
---|---|---|
ID | string |
Primary key |
Local_ID | string |
|
Language_ID | string |
References languages.csv::ID |
Parameter_ID | string |
References parameters.csv::ID |
Value | string |
|
Form | string |
|
Segments | list of string (separated by ) |
|
Comment | string |
|
Source | list of string (separated by ; ) |
References sources.bib::BibTeX-key |
Cognacy |
string |
|
Loan |
boolean |
|
Graphemes |
string |
|
Profile |
string |
|
Description | string |
Description of the meaning of the word (possibly in language-specific terms). |
Sic |
boolean |
For a form that differs from the expected reflex in some way this flag asserts that a copying mistake has not occurred. |
Doubt |
boolean |
In particular reconstructions, i.e. proto-forms in etymological dictionaries, are often marked as being somewhat doubtful (typically displayed as proto-form prefixed with a '?' or similar). |
Table languages.csv
property | value |
---|---|
dc:conformsTo | CLDF LanguageTable |
dc:extent | 1064 |
Name/Property | Datatype | Description |
---|---|---|
ID | string |
Primary key |
Name | string |
|
Glottocode | string |
|
Glottolog_Name |
string |
|
ISO639P3code | string |
|
Macroarea | string |
|
Latitude | decimal ≥ -90 ≤ 90 |
|
Longitude | decimal ≥ -180 ≤ 180 |
|
Family |
string |
|
Abbr |
string |
Abbreviation for the (proto-)language name. |
Group |
string Regex: `PAN |
Form. |
Source | list of string (separated by ; ) |
Etymological (or comparative) dictionaries typically compare lexical data from many source dictionaries. References sources.bib::BibTeX-key |
Is_Proto |
boolean |
Specifies whether a language is a proto-language (and thus its forms reconstructed proto-forms). |
Description | string |
For proto-languages that correspond to ACD reconstruction levels, a description of their extent is provided. |
Dialect_Of |
string |
References languages.csv::ID |
Table parameters.csv
property | value |
---|---|
dc:conformsTo | CLDF ParameterTable |
dc:extent | 86502 |
Name/Property | Datatype | Description |
---|---|---|
ID | string |
Primary key |
Name | string |
|
Concepticon_ID | string |
|
Concepticon_Gloss |
string |
Table cognates.csv
property | value |
---|---|
dc:conformsTo | CLDF CognateTable |
dc:extent | 121682 |
Name/Property | Datatype | Description |
---|---|---|
ID | string |
Primary key |
Form_ID | string |
References forms.csv::ID |
Form | string |
|
Cognateset_ID | string |
References cognatesets.csv::ID |
Doubt |
boolean |
|
Cognate_Detection_Method |
string |
|
Source | list of string (separated by ; ) |
References sources.bib::BibTeX-key |
Alignment | list of string (separated by ) |
|
Alignment_Method |
string |
|
Alignment_Source |
string |
|
Metathesis |
boolean |
Flag indicating that a process of metathesis is assumed, explaining the apparent irregularity of a cognate. |
Assimilation |
boolean |
Flag indicating that a process of assimilation is assumed, explaining the apparent irregularity of a cognate. |
Doublet_Comment |
string |
A comment about the doublet status of the reconstruction. |
Doublet_Set |
string |
Identifier of a set of variants that are independently supported by the comparative evidence. Doubletting that cannot be traced in any clear way to borrowing is extremely common in AN languages (Blust 2011), and an effort has been made to cross-reference doublets in the ACD wherever possible. |
Disjunct_Comment |
string |
A comment about the disjunct status of the reconstruction. |
Disjunct_Set |
string |
Identifier of a set of variants that are supported only by allowing the overlap of cognate sets; i.e. only one reconstruction in a set of disjuncts can be consistent with the evidence, but it is unclear which one. A distinction is drawn between doublets (variants that are independently supported by the comparative evidence), and “disjuncts” (variants that are supported only by allowing the overlap of cognate sets). To illustrate, both Tagalog gumí ‘beard’ and Malay kumis ‘moustache’ show regular correspondences with Fijian kumi ‘the chin or beard’, but they do not correspond regularly with one another. Based on this evidence, it is impossible to posit doublets, since unambiguous support for both variants is lacking. However, since the Tagalog and Malay forms can each be compared with Fijian kumi, two comparisons can be proposed that overlap by including the Fijian form in both (like all Oceanic languages, Fijian has merged PMP *k and *g; in addition, it has lost final consonants) . The result is a pair of PMP disjuncts *gumi (based on Tagalog and Fijian) and *kumis (based on Malay and Fijian), either or both of which could be used to justify an independent doublet if additional comparative support is found. |
Table cognatesets.csv
Comparisons with regular sound correspondences and close semantics. If there are additional forms that are strikingly similar but irregular, or that show strong semantic divergence, these are are added in a note. Every attempt is made to keep the comparison proper free from problems.
Because many reconstructed morphemes contain smaller submorphemic sound-meaning associations of the type that Brandstetter (1916) called ‘roots’ (Wurzeln), these elements are listed as cognate sets, too. They are marked with a true value for the 'Is_Root' property of the linked, reconstructed form.
The roots listed here thus amount to a continuation of the data set presented in Blust 1988.
property | value |
---|---|
dc:conformsTo | CLDF CognatesetTable |
dc:extent | 10857 |
Name/Property | Datatype | Description |
---|---|---|
ID | string Regex: [a-zA-Z0-9_\-]+ |
Primary key |
Description | string |
|
Source | list of string (separated by ; ) |
References sources.bib::BibTeX-key |
Name | string |
A recognizable label for the cognateset, typically the reconstructed proto-form and the reconstructed meaning. |
Form_ID | string |
Links to the reconstructed proto-form in FormTable. References forms.csv::ID |
Comment | string |
|
Doubt |
boolean |
Flag indicating (un)certainty of the reconstruction. |
Etymon_ID |
string |
References etyma.csv::ID |
Is_Main_Entry |
boolean |
Table cf.csv
The ACD includes five additional categories of groups of forms, called 'near cognates', 'noise', 'roots', 'loans' and 'also'. These are marked with respective values in the 'Category' column.
'Near cognates' are forms that are strikingly similar but irregular, and which cannot be included in a note to an established reconstruction. Stated differently, these are forms that appear to be historically related, but do not yet permit a reconstruction.
The 'noise' (in the information-theoretic sense of meaningless data that can be confused with a true signal) category lists chance resemblances. Given the number of languages being compared and the number of forms in many of the sources, forms that resemble one another in shape and meaning by chance will not be uncommon, and the decision as to whether a comparison that appears good is a product of chance must be based on criteria such as
- how general the semantic category of the form is (e.g. phonologically corresponding forms meaning ‘cut’ are less diagnostic of relationship than phonologically corresponding forms for particular types of cutting),
- how richly attested the form is (if it is found in just two witnesses the likelihood that it is a product of chance is greatly increased),
- there is already a well-established reconstruction for the same meaning.
Thus, the search process that results in valid cognate sets inevitably turns up other material that is superficially appealing, but is questionable for various reasons. To simply dispose of this ‘information refuse’ would be unwise for two reasons. First, further searching might show that some of these questionable comparisons are more strongly supported than it initially appeared. Second, even if the material is not upgraded through further comparative work it is always possible that some future researcher with different standards of evaluation will stumble upon some of these comparisons and claim that they are valid, but were overlooked in the ACD. By including a module on ‘Noise’ we can show that we have considered and rejected various possibilities that might be entertained by others.
Because many reconstructed morphemes contain smaller submorphemic sound-meaning associations of the type that Brandstetter (1916) called ‘roots’ (Wurzeln), these elements are included in the 'roots' category. The roots listed here thus amount to a continuation of the data set presented in Blust 1988.
Roots are not listed as regular cognate sets, because the reconstructions are not explicitly assigned to a proto-language.
Loanwords are a perennial problem in historical linguistics. When they involve morphemes that are borrowed between related languages they can provoke questions about the regularity of sound correspondences. When they involve morphemes that are borrowed between unrelated languages they can give rise to invalid reconstructions. Dempwolff (1934-38) included a number of known loanwords among his 2,216 ‘Proto-Austronesian’ reconstructions in order to show that sound correspondences are often regular even with loanwords that are borrowed relatively early, but he marked these with an ‘x’, as with *xbazu ‘shirt’, which he knew to be a Persian loanword in many of the languages of western Indonesia, and (via Malay) in some of the languages of the Philippines. However, he overlooked a number of cases, such as *nanas ‘pineapple’ (an Amazonian cultigen that was introduced to insular Southeast Asia by the Portuguese). Since widely distributed loanwords can easily be confused with native forms it is useful to include them in the dictionary.
A fairly careful (but inevitably imperfect) attempt has been made to identify and document loanwords with a distribution sufficient to justify a reconstruction on one of the nine levels of the ACD, if treated erroneously as native. While this has been done wherever the possibility of confusion with native forms seemed real, there is no reason to include obvious loans that would never be mistaken for native forms.
This issue is especially evident in the Philippines, where hundreds of Spanish loanwords from the colonial period that began late in the 16th century, are scattered from at least Ilokano in northern Luzon to the Bisayan languages of the central Philippines and some of the languages of Mindanao (as Subanon). Comparisons like Ilokano kamarón ‘prawn’, Cebuano kamarún ‘dish of shrimps, split and dipped in eggs, optionally mixed with ground meat’ < Spanish camarón ‘shrimp’, or Ilokano kalábus ‘jail, prison’, Cebuano kalabús, kalabúsu ‘jail; to land in prison, in jail’ < Spanish calabozo ‘dungeon’ seem inappropriate for inclusion in LOANS, but introduced plants have generally been admitted. Some of these, as ‘tomato’ may be widely known as New World plants that were introduced to the Philippines by the Spanish, but others, as ‘chayote’, may be less familiar. As already noted, Dempwolff (1938) posited ‘Uraustronesisch’ *nanas and *kenas as doublets for ‘pineapple’, completely overlooking the fact that this is an Amazonian plant that could hardly have been present in the Austronesian world before the advent of the colonial period. This example shows that errors in the semantic domain of plant names can sometimes escape detection by scholars who are otherwise known for their careful, meticulous work, and for this reason all borrowed cognate sets involving plant names are documented as loanwords to avoid any possible misinterpretation.
The last category, 'also', groups forms related to a particular cognate set. These forms typically show some kind of irregularity with respect to the proposed reconstruction, but provide context to evaluate the validity of the cognate set.
property | value |
---|---|
dc:extent | 2364 |
Name/Property | Datatype | Description |
---|---|---|
ID | string |
Primary key |
Name | string |
The title of a table of related forms; typically hints at the type of relation between the forms or between the group of forms and an etymon. |
Description | string |
|
Category |
string |
An optional category for groups of forms such as "loans". |
Comment | string |
|
Cognateset_ID | string |
Links to an etymon, if the group of lexemes is related to one. References cognatesets.csv::ID |
Dempwolff_Etymology |
string |
A corresponding (unsupported) reconstruction posited in Dempwolff 1938. |
Table cfitems.csv
Membership of forms in a "cf" group is mediated through this association table unless more meaningful alternatives are available, like BorrowingTable for loans.
property | value |
---|---|
dc:extent | 7344 |
Name/Property | Datatype | Description |
---|---|---|
ID | string |
Primary key |
Cfset_ID |
string |
References cf.csv::ID |
Form_ID | string |
References forms.csv::ID |
Comment | string |
|
Source | list of string (separated by ; ) |
References sources.bib::BibTeX-key |
Table borrowings.csv
property | value |
---|---|
dc:conformsTo | CLDF BorrowingTable |
dc:extent | 7348 |
Name/Property | Datatype | Description |
---|---|---|
ID | string Regex: [a-zA-Z0-9_\-]+ |
Primary key |
Target_Form_ID | string |
References the loanword, i.e. the form as borrowed into the target language References forms.csv::ID |
Source_Form_ID | string |
References the source word of a borrowing References forms.csv::ID |
Comment | string |
|
Source | list of string (separated by ; ) |
References sources.bib::BibTeX-key |
Cfset_ID |
string |
Link to a set description. References cf.csv::ID |
Table etyma.csv
property | value |
---|---|
dc:extent | 8161 |
Name/Property | Datatype | Description |
---|---|---|
ID | string |
A numeric identifier for the etymon. For etyma present in the legacy online version of ACD this number will match the cognate set number assigned then. Primary key |
Name | string |
The core reconstruction uniting the cognate sets of the etymon. |
Initial |
string Valid choices: a b c C d e g h i j k l m n N ñ ŋ o p q r R s S t u w y z |
|
Description | string |
The reconstructed meaning of the etymon. |
Comment | string |
Some notes are several lines, while others are a page or more. Notes are used for a variety of purposes. Among the most common are to report other forms that show a likely historical connection with those cited in the main comparison, but which exhibit irregularities other than the usual sporadic assimilation or metathesis, and so raise more serious questions about comparability, as in entry (2) above; to discuss details of the reconstructed gloss; and to note the occurrence of monosyllabic “roots” or submorphemic sound-meaning correlations in reconstructed morphemes. |
Source | list of string (separated by ; ) |
Sources mentioned in the comment describing the etymon. References sources.bib::BibTeX-key |
Table trees.csv
property | value |
---|---|
dc:conformsTo | CLDF TreeTable |
dc:extent | 1 |
Name/Property | Datatype | Description |
---|---|---|
ID | string Regex: [a-zA-Z0-9_\-]+ |
Primary key |
Name | string |
Name of tree as used in the tree file, i.e. the tree label in a Nexus file or the 1-based index of the tree in a newick file |
Description | string |
Describe the method that was used to create the tree, etc. |
Tree_Is_Rooted | boolean Valid choices: Yes No |
Whether the tree is rooted (Yes) or unrooted (No) (or no info is available (null)) |
Tree_Type | string Valid choices: summary sample |
Whether the tree is a summary (or consensus) tree, i.e. can be analysed in isolation, or whether it is a sample, resulting from a method that creates multiple trees |
Tree_Branch_Length_Unit | string Valid choices: change substitutions years centuries millennia |
The unit used to measure evolutionary time in phylogenetic trees. |
Media_ID | string |
References a file containing a Newick representation of the tree, labeled with identifiers as described in the LanguageTable (the Media_Type column of this table should provide enough information to chose the appropriate tool to read the newick) References media.csv::ID |
Source | list of string (separated by ; ) |
References sources.bib::BibTeX-key |
Table media.csv
property | value |
---|---|
dc:conformsTo | CLDF MediaTable |
dc:extent | 1 |
Name/Property | Datatype | Description |
---|---|---|
ID | string Regex: [a-zA-Z0-9_\-]+ |
Primary key |
Name | string |
|
Description | string |
|
Media_Type | string Regex: [^/]+/.+ |
|
Download_URL | anyURI |
|
Path_In_Zip | string |