likumi-db

Latvijas Republikas likumu konsolidētās vēsturiskās versijas teksta formātā

Consolidated historical, plain-text versions of the laws of Republic of Latvia

Tree structure

incoming html in 'intake' folder
cleaned 'text' versions in 'text' folder
diff comparison reports in 'diff' folder

Build process

If you are reading this for first time, please read Prerequisites section first.

Here's how we ultimately build the diffs:

Retrieve home-page version of law manually, place into intake/ folder
Run python3 write_versions.py <(law file).html> This will produce .ver file by extracting the historical-versions information from html.
Manually add information about the law to toc.json The important part is print_id, which is required to retrieve "formatted for print" cleaner versions of html.
Run python3 write_retrievers.py <(law short id)>, for example 'python3 write_retrievers.py satversme'. This loads the toc.json and parses .ver file to write bunch of curl commands for retrieving the print versions of laws - all available historic versions. The script will be stored in folder '(law short id)' (ex. 'satversme').
Go into this folder, run the retrieval script. It will store retrieve the historic version .html files.
Run python3 clean_visibility.py <folder>. It will process all .html files, and will put cleaned versions in 'clean/' subfolder.

This is necessary because before we can convert .html to raw .txt we must first clean up the .html a bit. To render .txt we are using text-only browser 'elinks', which currently does not support css 'display:none' instruction, and the supposedly hidden text would get rendered into .txt result. So, to avoid this bug, we must remove such supposedly invisible sections explicitly from source files, producing 'clean' intermediate .html.

Go back to intake/ folder, run ./to-txt-do <folder name> script (ex. ./to-txt-do satversme) . It will in turn invoke 'elinks' browser (you may need to build it from sources). The key why we use 'elinks', is because when it dumps the .txt result, it allows to specify very long line size: so that almost all law paragraphs will get rendered as single line: which again helps with diff.
Run python3 clean_leadspaces.py <folder name>, to remove the leading 3 spaces from all lines of .txt, which would also mess with the diff.
To produce diffs, run java -jar (built location)/LawDiff-app/target/lawdiff-app.jar in the law . It will go through all files in 'txt/' folder, and produce diff .html reports in 'diff/' folder.

## Prerequisites

If you want to build/run some of this yourself,

You need to have most recent 'elinks' browser (v0.13), build it from sources. See http://elinks.cz/download.html and http://elinks.cz/documentation/installation.html
Clone and build https://github.com/valters/lawdiff

Or: download the app jar directly from http://repo.maven.apache.org/maven2/io/github/valters/lawdiff-app/1.0.0/lawdiff-app-1.0.0.jar

This is consolidated jar that needs no other dependencies or java classpath gymnastics to run.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
diff		diff
intake		intake
text		text
.hgignore		.hgignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

likumi-db

Tree structure

Build process

About

Releases

Packages

Languages

License

valters/likumi-db

Folders and files

Latest commit

History

Repository files navigation

likumi-db

Tree structure

Build process

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages