Support tools for Latvian Treebank and Latvian morphologically annotated
corpora. Native data format of the Latvian corpora is an adjusted PML.
Corresponding PML Schemas are available at
TrEd extension/lv-treebank/resources.
(c) AILab, Institute of Mathematics and Computer Science, University of Latvia, 2010-2026
PmlCorporaTools- perl scripts and xslt-s for varios coprora data processing;LVTB2UD- transformator from native annotation format to Universal Dependencies;TrEd extension- development snapshot forlv-treebankmodule for TrEd (this enables TrEd to operate with PML files from Latvian Treebank);SketchEngine specs- specification files that are used together with.vertfiles generated by scripts inPmlCorporaTools;TigerXML Schemas(obselote) - schemas for displaying Latvian Treebank in Tiger Search (data transformation can be done, applyingPmlCorporaTools/LvCorporaTools/TreeTransf/lvpml2tiger.xslto "knitted" PML files, "knitting" can by done byPmlCorporaTools/LvCorporaTools/UIs/TreeTransformatorUI.pm).
For each script in PmlCorporaTools there is a .bat file showing invocation
sample on dummy data located in PmlCorporaTools/testdata. Also, it is possible
to launch these scripts without parameters to get information about expected
parameter values and meaning.
Some scripts have not been used several years and might be obsolete, sorry.
Folder Docs containing datasplits moved to Treebank repo in 2017-12-11.
Folders ParserTools and ParserTools deleted on 2026-03-02 as very outdated.
For PmlCorporaTools:
- Perl
XML::Simple- XSLT module (on Ubuntu, run 'sudo apt-get install libxml-libxslt-perl')
Treex::Coreaccording to https://ufal.mff.cuni.cz/treex/install.html
For LVTB2UD
- Java
- Morphological library https://github.com/LUMII-AILab/morphology
- install homebrew
- brew install cpanm
- brew install libxml2
- sudo cpanm XML::LibXML
- sudo cpanm -n PerlIO::Util
- sudo cpanm Moose
- moose-outdated | cpanm
- sudo cpanm Treex::Core
- treex -h
Files PmlCorporaTools/*_sample.bat contains general descriptions and
commented-off Windows comand samples for main workflows. To follow through a
workflow on Windows machine, create a copy of the necassary sample.bat and
update it accordign to your needs. For convenience of Windows users
.gitignore blocks scripts named /PmlCorporaTools/* - Copy.bat 😉
To follow through a workflow on a Unix machine, you have to create similar
shell script, but it should be relatively easy as the all interesting data
processing is done in platform independent (hopefully) perl scripts.
PmlCorporaTools/prepareForLvtbPublication_sample.bat- create dataset for publishing LVTB in the native hybrid/PML format.PmlCorporaTools/convertLvtbToUd_sample.bat- converting teebank to UD & conllu. Also, this contains notes on what checkups and preparation steps should be done for an UD release.PmlCorporaTools/postprocessConlluForSembank_sample.bat- create UD data for FullStack project Sembank.
PmlCorporaTools/checkNormalizeSembankIds_sample.bat- ID verification before including treebank files into SemBank - this is what is done in Treebank moratorium.PmlCorporaTools/PmlCorporaTools/aTreeTransformator_sample.bat- convert treebank to old dependency formats used before UD.- For parameter specifics to create a PML fileset accordingly to current
naming conventions see
PmlCorporaTools/LVK2LVTB-PML.readme.md PmlCorporaTools/prepareForLvtbInclusion_sample.bat- to add completely new, hand annotated file to LVTB.- Converting treebank to TigerXML is done as follows:
Unite(if needed), TreeTransformatorUI with step--ord TOKEN, applylvpml2tiger.xsl - Preprocessing morphocorpus files before putting in repository:
CheckLvPmlwith paramM(currently not used). ./preparePOSTagData.shis used for preparing morfological data for LVTagger