m4, GNU Make, libtool, Autoconf, automake, gettext, help2man (built with gettext), Perl, Flex, Python, Cython, tar and wget.
git clone https://github.com/rrthomas/recode.git
cd recode
./bootstrap
./configure
make check
make install
To make a release, you’ll need woger and github-release, suitably
configured. Having tested and pushed all the changes, update the version
number in configure.ac
and write the NEWS
entry and push those too.
Then execute: make release
Recode is due for a major overhaul. I want to add a run-time dependency between Recode and Python, with the admitted goal of shifting the internals of Recode from C to Python.
For experimenting what Recode might become and experimenting new concepts more easily, I created a subsidiary and standalone Python project named Recodec, which reproduces a good part of Recode functionality. My goal is now to merge Recodec back into Recode, rather than slowly stretching the distance between Recode and Recodec. Recode is going to be a mix of Python, C and Cython.
Recode 4 should be organised thus:
- The main program is written in Python, and through a Cython interface, calls the existing C API for doing the real work.
- The C API gets merely able to use Cython written steps internally, besides the actual C steps, but with no Cython steps yet.
- New Cython steps wrap many standard Python codecs, with some trickery to force Python codecs over actual, older Recode steps.
- Recode library initialization is moved from C to Python, and gets called through Cython from the C API.
- Initialization is extended to cover the Recodec Python API, which uses different tables and descriptive data.
- More steps from Recodec get moved into Recode, either coexisting with or taking over the previous wrapping of Python codecs.
- The remaining code from the Recodec engine gets moved into Recode, replacing C code having the same fonctionality.
- Special care is given to GNU libc or libiconv support, maybe going from the C side to the Python side.
- Proper documentation and decisions follow extensive comparison and diagnostic of multiple implementations of same charsets or surfaces.
- Profiling allows to fine tune when and how Cython gets used over Python; standard Python codecs might even be cythonized in Recode.
- Program and library initialization get revised to spare disk accesses and building descriptive structures, whenever possible.
- The main program directly links to the Python API rather than through the C API, while the C API becomes a separate facility.
Whenever the Python library offers a charset or a surface which Recode also has, the Python library codec is used. In some cases, this introduces differences, those will have to be resolved one by one, either by accepting that the Python library does better, getting the Python team to improve some codecs, or overriding these from Recodec.
Other differences may occur, especially in the Asian charset area, from the fact libiconv, GNU libc recoding facilities, and various contributors to the Python codecs project, do not fully agree on how things should be done. Recodec is likely to offer configuration mechanisms to choose among various possibilities, but will not likely attempt to rule out who is right and who is wrong! ☺
Issues about reversibility and canonicity, which were much present in Recode 3.X, are fading out. While some of these were moderately easy to implement, other cases stayed pending as fairly difficult to solve without a significant loss of efficiency. I think these issues are better abandoned than forever kept as half-hearted and not wholly dependable. Any user concerned about such things might try the reverse coding to find out if the original file is recoverable, some new option might automate a (costly) reversibility test.
One drawback of the whole move is that the Global Interpreter Lock in Python gets in the way of parallel execution of the code. This would have been more of a concern if GNU libc recoding facilities were relying on the Recode library, but as things stand by now, I’m guessing that users will not be much impacted in practice.
- IETF references
- Character Mnemonics & Character Sets, by Keld Simonsen, 1992-06.
- UTF-7 - A Mail-Safe Transformation Format of Unicode, by David Goldsmith and Mark Davis, 1994-07.
- UTF-8, a transformation format of Unicode and ISO 10646, by François Yergeau, 1997-10.
- Various references
- Unicode charset mappings. The Unicode consortium makes available plenty of charset mappings for converting legacy charsets to Unicode.
- Normalisation et internationalisation: Inventaire et prospectives des normes clefs pour le traitement informatique du français. (392p.) or this other copy. This is a report, written in French, discussing charset issues and many other topics as well. Laurent Bourbeau and François Pinard, 1995-10.
- Recode specific
- ETL presentation
In 1999, the organisers of the m17n99 conference in Tsukuba, Japan, were kind enough to invite me. This has been for me a fabulous trip and experience, and I met many extraordinary people in there. At the conference, I presented the Translation Project, and Recode. The Recode presentation slides are available.
- ETL presentation
- libiconv
- This comprehensive charset converter library, by Bruno Haible, revolves around Unicode, and support Asian encodings among many others. Even Recode uses it!
- tcs
- Here is the main recoding tool from the Plan9 project.
- yuedit
- This GUI editor, by Gaspar Sinai, 1999-01, handles many encodings, among which UTF-8. It also installs uniconv, a recoding program, and uniprint, a printing tool.
- ucs-fonts
- These 6x13 fonts, by Markus Kuhn, 1998-11, covering Unicode characters besides the Asian sets, merely replace the Linux fixed 6x13 font. Works nicely with yudit.
- MtRecode
- This charset converter is oriented towards SGML text manipulation. It may be freely downloaded for non-commercial, non-military use. Pointer given by Jean Véronis, 1996-06.
- sp
- This quite nice SGML structure analyser, by James Clark, contains internal C++ modules for handling many charsets.
- b2c
- This program, by Jörg Heitkötter, 1997-11, is able to generate interpreted character dumps, but properly embedded within complete C header files.
- PyRecode
- This wrapper, by Andreas Jung, provides Recode functionality to Python programs. Also see this link and this other link.