Модул:data consistency check/doc

This is the documentation page for Модул:data consistency check

This module checks the validity and internal consistency of the language, language family, and script data used on Wiktionary: the modules in Category:Language data modules as well as Module:scripts/data.

Output

Module:etymology languages/data

Bashkardi language (bsg-bas) has a canonical name that is not unique; it is also used by the code bsg.
Rudbari language (rdb-rud) has a canonical name that is not unique; it is also used by the code rdb.
Chali language (tks-cal) has a canonical name that is not unique; it is also used by the code tgf.

Module:families/data

Middle Iranian family (ira-mid) has no child families or languages.
Old Iranian family (ira-old) has no child families or languages.

Module:scripts/data

Blissymbols script (Blis) is not used by any language and has no characters listed for auto-detection.
Cypro-Minoan script (Cpmn) is not used by any language.
Hieratic script (Egyh) is not used by any language and has no characters listed for auto-detection.
Elymaic script (Elym) is not used by any language.
Hiragana script (Hira) is not used by any language.
Nyiakeng Puachue Hmong script (Hmnp) is not used by any language.
Kana script (Hrkt) is not used by any language.
Image-rendered script (Imag) is not used by any language and has no characters listed for auto-detection.
International Phonetic Alphabet script (Ipach) is not used by any language and has no characters listed for auto-detection.
Kpelle script (Kpel) is not used by any language and has no characters listed for auto-detection.
Loma script (Loma) is not used by any language and has no characters listed for auto-detection.
Moon script (Moon) is not used by any language and has no characters listed for auto-detection.
Morse code (Morse) is not used by any language and has no characters listed for auto-detection.
Musical notation script (Music) is not used by any language.
Nag Mundari script (Nagm) is not used by any language.
Unspecified script (None) is not used by any language and has no characters listed for auto-detection.
Rongorongo script (Roro) is not used by any language and has no characters listed for auto-detection.
Rumi numerals script (Rumin) is not used by any language.
flag semaphore (Semap) is not used by any language and has no characters listed for auto-detection.
Visible Speech script (Visp) is not used by any language and has no characters listed for auto-detection.
Vithkuqi script (Vith) is not used by any language.
Woleai script (Wole) is not used by any language and has no characters listed for auto-detection.
Yezidi script (Yezi) is not used by any language.
mathematical notation script (Zmth) is not used by any language.
symbol script (Zsym) is not used by any language.
undetermined script (Zyyy) is not used by any language and has no characters listed for auto-detection.
uncoded script (Zzzz) is not used by any language and has no characters listed for auto-detection.
The data key sort_by_scraping for Japanese script (Jpan) is invalid.

Checks performed

For multiple data modules:

Codes for languages, families and etymology-only languages must be unique and cannot clash with one another.
Canonical names for languages, families, and etymology-only languages must not be found in the list of other names.
Each name in the list of other names must appear only once.
otherNames, if present, must be an array.
Wikidata item IDs must be a positive integer or a string starting with Q and ending with decimal digits.

The following must be true of the data used by Module:languages:

Each code must be defined in the correct submodule according to whether it is two-letter, three-letter or exceptional.
The canonical name (field 1) must be present and must not be the same as the canonical name of another language.
If field 2 is not nil, it must a valid Wikidata item ID.
If field 3 or family is given and not nil, it must be a valid family code.
If field 4 or scripts is given and not nil, it must be an array, and each string in the array must be a valid script code.
If ancestors is given, it must be an array, and each string in the array must be a valid language or etymology language code.
If family is given, it must be a valid family code.
If type is given, it must be one of the recognised values (regular, reconstructed, appendix-constructed).
If entry_name is given, it must be a table that contains either two arrays (from and to) or a string (remove_diacritics) or both.
If sort_key is given, it may either be a string, or at table that in turn contains either two arrays (from and to) or a string (remove_diacritics).
If entry_name or sort_key is given, the from array must be longer or equal in length to the to array.
If standardChars is given, it must form a valid Lua string pattern when placed between square brackets with ^ before it ("[^...]). (It should match all characters regularly used in the language, but that cannot be tested.)
If override_translit is set, translit must also be set, because there must be a transliteration module that can override manual transliteration.
If link_tr is present, it must be true.
Have no data keys besides these: 1, 2, 3, "entry_name", "sort_key", "display", "otherNames", "aliases", "varieties", "type", "scripts", "ancestors", "wikimedia_codes", "wikipedia_article", "standardChars", "translit", "override_translit", "link_tr".

Checks not performed:

If translit is present, it should be the name of a module, and this module should contain a tr function that takes a pagename (and optionally a language code and script code) as arguments.
If sort_key is a string, it should be the name of a module, and this module should contain a makeSortKey function that takes a pagename (and optionally a language code and script code) as arguments.
If entry_name or sort_key is a table and contains a field remove_diacritics, the value of the field should be a string that forms a valid Lua pattern when it is placed inside negated set notation ([^...]).

These are not checked here, because module errors will quickly crop up in entries if these conditions are not met, assuming that Module:utilities attempts to generate a sortkey for a category pertaining to the language in question, or full_link attempts to use the transliteration module.

Module:languages/code to canonical name and Module:languages/canonical names must contain all the codes and canonical names found in the data submodules of Module:languages, and no more.

The following must be true of the data used by Module:etymology languages:

canonicalName must be given.
parent must be given must be a valid language, family or etymology-only language code.
If ancestors is given, it must be an array, and each string in the array must be a valid language or etymology language code. The etymology language should also be listed as the ancestor of a regular language.
Have no data keys besides these: "canonicalName", "otherNames", "parent", "ancestors", "wikipedia_article", "wikidata_item".

Codes in Module:families data must:

Have canonicalName, which must not be the same as the canonical name of another family.
If family is given, it must be a valid family code.
Have at least one language or subfamily belonging to it.
Have no data keys besides these: "canonicalName", "otherNames", "family", "protoLanguage", "wikidata_item".

Codes in Module:scripts data must:

Have canonicalName.
Have at least one language that lists it as one of its scripts.
Have a characters pattern for script autodetection, and this must form a valid Lua string pattern when placed between square brackets ("[...]"). (It should match all characters in the script, but that cannot be tested.)
Have no data keys besides these: "canonicalName", "otherNames", "parent", "systems", "wikipedia_article", "characters", "direction".