Name	Name	Last commit message	Last commit date
parent directory ..
.gitattributes	.gitattributes
.transcription-report.json	.transcription-report.json
README.md	README.md
borrowings.csv	borrowings.csv
cf.csv	cf.csv
cfitems.csv	cfitems.csv
cldf-metadata.json	cldf-metadata.json
cognates.csv	cognates.csv
cognatesets.csv	cognatesets.csv
etyma.csv	etyma.csv
forms.csv	forms.csv
languages.csv	languages.csv
lingpy-rcParams.json	lingpy-rcParams.json
media.csv	media.csv
parameters.csv	parameters.csv
requirements.txt	requirements.txt
sources.bib	sources.bib
trees.csv	trees.csv

Wordlist Austronesian Comparative Dictionary

property	value
dc:conformsTo	CLDF Wordlist
dc:license	https://creativecommons.org/licenses/by/4.0/
dcat:accessURL	https://github.com/lexibank/acd
prov:wasDerivedFrom	lexibank/acd v1.2-33-gbc277b3 Glottolog v5.1 Concepticon v3.2.0 CLTS v2.3.0
prov:wasGeneratedBy	lingpy-rcParams: lingpy-rcParams.json python: 3.12.3 python-packages: requirements.txt
rdf:ID	acd
rdf:type	http://www.w3.org/ns/dcat#Distribution

Table forms.csv

property	value
dc:conformsTo	CLDF FormTable
dc:extent	146733

Columns

Name/Property	Datatype	Description
ID	`string`	Primary key
Local_ID	`string`
Language_ID	`string`	References languages.csv::ID
Parameter_ID	`string`	References parameters.csv::ID
Value	`string`
Form	`string`
Segments	list of `string` (separated by )
Comment	`string`
Source	list of `string` (separated by `;`)	References sources.bib::BibTeX-key
`Cognacy`	`string`
`Loan`	`boolean`
`Graphemes`	`string`
`Profile`	`string`
Description	`string`	Description of the meaning of the word (possibly in language-specific terms).
`Sic`	`boolean`	For a form that differs from the expected reflex in some way this flag asserts that a copying mistake has not occurred.
`Doubt`	`boolean`	In particular reconstructions, i.e. proto-forms in etymological dictionaries, are often marked as being somewhat doubtful (typically displayed as proto-form prefixed with a '?' or similar).

Table languages.csv

property	value
dc:conformsTo	CLDF LanguageTable
dc:extent	1064

Columns

Name/Property	Datatype	Description
ID	`string`	Primary key
Name	`string`
Glottocode	`string`
`Glottolog_Name`	`string`
ISO639P3code	`string`
Macroarea	`string`
Latitude	`decimal` ≥ -90 ≤ 90
Longitude	`decimal` ≥ -180 ≤ 180
`Family`	`string`
`Abbr`	`string`	Abbreviation for the (proto-)language name.
`Group`	`string` Regex: `PAN	Form.
Source	list of `string` (separated by `;`)	Etymological (or comparative) dictionaries typically compare lexical data from many source dictionaries. References sources.bib::BibTeX-key
`Is_Proto`	`boolean`	Specifies whether a language is a proto-language (and thus its forms reconstructed proto-forms).
Description	`string`	For proto-languages that correspond to ACD reconstruction levels, a description of their extent is provided.
`Dialect_Of`	`string`	References languages.csv::ID

Table parameters.csv

property	value
dc:conformsTo	CLDF ParameterTable
dc:extent	86502

Columns

Name/Property	Datatype	Description
ID	`string`	Primary key
Name	`string`
Concepticon_ID	`string`
`Concepticon_Gloss`	`string`

Table cognates.csv

property	value
dc:conformsTo	CLDF CognateTable
dc:extent	121682

Columns

Name/Property	Datatype	Description
ID	`string`	Primary key
Form_ID	`string`	References forms.csv::ID
Form	`string`
Cognateset_ID	`string`	References cognatesets.csv::ID
`Doubt`	`boolean`
`Cognate_Detection_Method`	`string`
Source	list of `string` (separated by `;`)	References sources.bib::BibTeX-key
Alignment	list of `string` (separated by )
`Alignment_Method`	`string`
`Alignment_Source`	`string`
`Metathesis`	`boolean`	Flag indicating that a process of metathesis is assumed, explaining the apparent irregularity of a cognate.
`Assimilation`	`boolean`	Flag indicating that a process of assimilation is assumed, explaining the apparent irregularity of a cognate.
`Doublet_Comment`	`string`	A comment about the doublet status of the reconstruction.
`Doublet_Set`	`string`	Identifier of a set of variants that are independently supported by the comparative evidence. Doubletting that cannot be traced in any clear way to borrowing is extremely common in AN languages (Blust 2011), and an effort has been made to cross-reference doublets in the ACD wherever possible.
`Disjunct_Comment`	`string`	A comment about the disjunct status of the reconstruction.
`Disjunct_Set`	`string`	Identifier of a set of variants that are supported only by allowing the overlap of cognate sets; i.e. only one reconstruction in a set of disjuncts can be consistent with the evidence, but it is unclear which one. A distinction is drawn between doublets (variants that are independently supported by the comparative evidence), and “disjuncts” (variants that are supported only by allowing the overlap of cognate sets). To illustrate, both Tagalog gumí ‘beard’ and Malay kumis ‘moustache’ show regular correspondences with Fijian kumi ‘the chin or beard’, but they do not correspond regularly with one another. Based on this evidence, it is impossible to posit doublets, since unambiguous support for both variants is lacking. However, since the Tagalog and Malay forms can each be compared with Fijian kumi, two comparisons can be proposed that overlap by including the Fijian form in both (like all Oceanic languages, Fijian has merged PMP k and g; in addition, it has lost final consonants) . The result is a pair of PMP disjuncts gumi (based on Tagalog and Fijian) and kumis (based on Malay and Fijian), either or both of which could be used to justify an independent doublet if additional comparative support is found.

Comparisons with regular sound correspondences and close semantics. If there are additional forms that are strikingly similar but irregular, or that show strong semantic divergence, these are are added in a note. Every attempt is made to keep the comparison proper free from problems.

Because many reconstructed morphemes contain smaller submorphemic sound-meaning associations of the type that Brandstetter (1916) called ‘roots’ (Wurzeln), these elements are listed as cognate sets, too. They are marked with a true value for the 'Is_Root' property of the linked, reconstructed form.

The roots listed here thus amount to a continuation of the data set presented in Blust 1988.

property	value
dc:conformsTo	CLDF CognatesetTable
dc:extent	10857

Columns

Name/Property	Datatype	Description
ID	`string` Regex: `[a-zA-Z0-9_\-]+`	Primary key
Description	`string`
Source	list of `string` (separated by `;`)	References sources.bib::BibTeX-key
Name	`string`	A recognizable label for the cognateset, typically the reconstructed proto-form and the reconstructed meaning.
Form_ID	`string`	Links to the reconstructed proto-form in FormTable. References forms.csv::ID
Comment	`string`
`Doubt`	`boolean`	Flag indicating (un)certainty of the reconstruction.
`Etymon_ID`	`string`	References etyma.csv::ID
`Is_Main_Entry`	`boolean`

Table cf.csv

The ACD includes five additional categories of groups of forms, called 'near cognates', 'noise', 'roots', 'loans' and 'also'. These are marked with respective values in the 'Category' column.

'Near cognates' are forms that are strikingly similar but irregular, and which cannot be included in a note to an established reconstruction. Stated differently, these are forms that appear to be historically related, but do not yet permit a reconstruction.

The 'noise' (in the information-theoretic sense of meaningless data that can be confused with a true signal) category lists chance resemblances. Given the number of languages being compared and the number of forms in many of the sources, forms that resemble one another in shape and meaning by chance will not be uncommon, and the decision as to whether a comparison that appears good is a product of chance must be based on criteria such as

how general the semantic category of the form is (e.g. phonologically corresponding forms meaning ‘cut’ are less diagnostic of relationship than phonologically corresponding forms for particular types of cutting),
how richly attested the form is (if it is found in just two witnesses the likelihood that it is a product of chance is greatly increased),
there is already a well-established reconstruction for the same meaning.

Thus, the search process that results in valid cognate sets inevitably turns up other material that is superficially appealing, but is questionable for various reasons. To simply dispose of this ‘information refuse’ would be unwise for two reasons. First, further searching might show that some of these questionable comparisons are more strongly supported than it initially appeared. Second, even if the material is not upgraded through further comparative work it is always possible that some future researcher with different standards of evaluation will stumble upon some of these comparisons and claim that they are valid, but were overlooked in the ACD. By including a module on ‘Noise’ we can show that we have considered and rejected various possibilities that might be entertained by others.

Because many reconstructed morphemes contain smaller submorphemic sound-meaning associations of the type that Brandstetter (1916) called ‘roots’ (Wurzeln), these elements are included in the 'roots' category. The roots listed here thus amount to a continuation of the data set presented in Blust 1988.

Roots are not listed as regular cognate sets, because the reconstructions are not explicitly assigned to a proto-language.

Loanwords are a perennial problem in historical linguistics. When they involve morphemes that are borrowed between related languages they can provoke questions about the regularity of sound correspondences. When they involve morphemes that are borrowed between unrelated languages they can give rise to invalid reconstructions. Dempwolff (1934-38) included a number of known loanwords among his 2,216 ‘Proto-Austronesian’ reconstructions in order to show that sound correspondences are often regular even with loanwords that are borrowed relatively early, but he marked these with an ‘x’, as with *xbazu ‘shirt’, which he knew to be a Persian loanword in many of the languages of western Indonesia, and (via Malay) in some of the languages of the Philippines. However, he overlooked a number of cases, such as *nanas ‘pineapple’ (an Amazonian cultigen that was introduced to insular Southeast Asia by the Portuguese). Since widely distributed loanwords can easily be confused with native forms it is useful to include them in the dictionary.

A fairly careful (but inevitably imperfect) attempt has been made to identify and document loanwords with a distribution sufficient to justify a reconstruction on one of the nine levels of the ACD, if treated erroneously as native. While this has been done wherever the possibility of confusion with native forms seemed real, there is no reason to include obvious loans that would never be mistaken for native forms.

This issue is especially evident in the Philippines, where hundreds of Spanish loanwords from the colonial period that began late in the 16th century, are scattered from at least Ilokano in northern Luzon to the Bisayan languages of the central Philippines and some of the languages of Mindanao (as Subanon). Comparisons like Ilokano kamarón ‘prawn’, Cebuano kamarún ‘dish of shrimps, split and dipped in eggs, optionally mixed with ground meat’ < Spanish camarón ‘shrimp’, or Ilokano kalábus ‘jail, prison’, Cebuano kalabús, kalabúsu ‘jail; to land in prison, in jail’ < Spanish calabozo ‘dungeon’ seem inappropriate for inclusion in LOANS, but introduced plants have generally been admitted. Some of these, as ‘tomato’ may be widely known as New World plants that were introduced to the Philippines by the Spanish, but others, as ‘chayote’, may be less familiar. As already noted, Dempwolff (1938) posited ‘Uraustronesisch’ *nanas and *kenas as doublets for ‘pineapple’, completely overlooking the fact that this is an Amazonian plant that could hardly have been present in the Austronesian world before the advent of the colonial period. This example shows that errors in the semantic domain of plant names can sometimes escape detection by scholars who are otherwise known for their careful, meticulous work, and for this reason all borrowed cognate sets involving plant names are documented as loanwords to avoid any possible misinterpretation.

The last category, 'also', groups forms related to a particular cognate set. These forms typically show some kind of irregularity with respect to the proposed reconstruction, but provide context to evaluate the validity of the cognate set.

property	value
dc:extent	2364

Columns

Name/Property	Datatype	Description
ID	`string`	Primary key
Name	`string`	The title of a table of related forms; typically hints at the type of relation between the forms or between the group of forms and an etymon.
Description	`string`
`Category`	`string`	An optional category for groups of forms such as "loans".
Comment	`string`
Cognateset_ID	`string`	Links to an etymon, if the group of lexemes is related to one. References cognatesets.csv::ID
`Dempwolff_Etymology`	`string`	A corresponding (unsupported) reconstruction posited in Dempwolff 1938.

Table cfitems.csv

Membership of forms in a "cf" group is mediated through this association table unless more meaningful alternatives are available, like BorrowingTable for loans.

property	value
dc:extent	7344

Columns

Name/Property	Datatype	Description
ID	`string`	Primary key
`Cfset_ID`	`string`	References cf.csv::ID
Form_ID	`string`	References forms.csv::ID
Comment	`string`
Source	list of `string` (separated by `;`)	References sources.bib::BibTeX-key

Table borrowings.csv

property	value
dc:conformsTo	CLDF BorrowingTable
dc:extent	7348

Columns

Name/Property	Datatype	Description
ID	`string` Regex: `[a-zA-Z0-9_\-]+`	Primary key
Target_Form_ID	`string`	References the loanword, i.e. the form as borrowed into the target language References forms.csv::ID
Source_Form_ID	`string`	References the source word of a borrowing References forms.csv::ID
Comment	`string`
Source	list of `string` (separated by `;`)	References sources.bib::BibTeX-key
`Cfset_ID`	`string`	Link to a set description. References cf.csv::ID

Table etyma.csv

property	value
dc:extent	8161

Columns

Name/Property	Datatype	Description
ID	`string`	A numeric identifier for the etymon. For etyma present in the legacy online version of ACD this number will match the cognate set number assigned then. Primary key
Name	`string`	The core reconstruction uniting the cognate sets of the etymon.
`Initial`	`string` Valid choices: `a` `b` `c` `C` `d` `e` `g` `h` `i` `j` `k` `l` `m` `n` `N` `ñ` `ŋ` `o` `p` `q` `r` `R` `s` `S` `t` `u` `w` `y` `z`
Description	`string`	The reconstructed meaning of the etymon.
Comment	`string`	Some notes are several lines, while others are a page or more. Notes are used for a variety of purposes. Among the most common are to report other forms that show a likely historical connection with those cited in the main comparison, but which exhibit irregularities other than the usual sporadic assimilation or metathesis, and so raise more serious questions about comparability, as in entry (2) above; to discuss details of the reconstructed gloss; and to note the occurrence of monosyllabic “roots” or submorphemic sound-meaning correlations in reconstructed morphemes.
Source	list of `string` (separated by `;`)	Sources mentioned in the comment describing the etymon. References sources.bib::BibTeX-key

Table trees.csv

property	value
dc:conformsTo	CLDF TreeTable
dc:extent	1

Columns

Name/Property	Datatype	Description
ID	`string` Regex: `[a-zA-Z0-9_\-]+`	Primary key
Name	`string`	Name of tree as used in the tree file, i.e. the tree label in a Nexus file or the 1-based index of the tree in a newick file
Description	`string`	Describe the method that was used to create the tree, etc.
Tree_Is_Rooted	`boolean` Valid choices: `Yes` `No`	Whether the tree is rooted (Yes) or unrooted (No) (or no info is available (null))
Tree_Type	`string` Valid choices: `summary` `sample`	Whether the tree is a summary (or consensus) tree, i.e. can be analysed in isolation, or whether it is a sample, resulting from a method that creates multiple trees
Tree_Branch_Length_Unit	`string` Valid choices: `change` `substitutions` `years` `centuries` `millennia`	The unit used to measure evolutionary time in phylogenetic trees.
Media_ID	`string`	References a file containing a Newick representation of the tree, labeled with identifiers as described in the LanguageTable (the Media_Type column of this table should provide enough information to chose the appropriate tool to read the newick) References media.csv::ID
Source	list of `string` (separated by `;`)	References sources.bib::BibTeX-key

Table media.csv

property	value
dc:conformsTo	CLDF MediaTable
dc:extent	1

Columns

Name/Property	Datatype	Description
ID	`string` Regex: `[a-zA-Z0-9_\-]+`	Primary key
Name	`string`
Description	`string`
Media_Type	`string` Regex: `[^/]+/.+`
Download_URL	`anyURI`
Path_In_Zip	`string`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cldf

cldf

README.md

Wordlist Austronesian Comparative Dictionary

Table forms.csv

Columns

Table languages.csv

Columns

Table parameters.csv

Columns

Table cognates.csv

Columns

Table cognatesets.csv

Columns

Table cf.csv

Columns

Table cfitems.csv

Columns

Table borrowings.csv

Columns

Table etyma.csv

Columns

Table trees.csv

Columns

Table media.csv

Columns

Files

cldf

Directory actions

More options

Directory actions

More options

Latest commit

History

cldf

Folders and files

parent directory

README.md

Wordlist Austronesian Comparative Dictionary

Table forms.csv

Columns

Table languages.csv

Columns

Table parameters.csv

Columns

Table cognates.csv

Columns

Table cognatesets.csv

Columns

Table cf.csv

Columns

Table cfitems.csv

Columns

Table borrowings.csv

Columns

Table etyma.csv

Columns

Table trees.csv

Columns

Table media.csv

Columns