Text Processing
DocBook Doclet

DocBook Doclet is a javadoc doclet that creates DocBook XML and UML class diagrams from Javadoc.

ELinks is an advanced and well-established feature-rich text mode Web (HTTP, FTP, etc.) browser. It can render both frames and tables, is highly customizable, and can be extended via Lua, Guile, Perl, or Ruby scripts. It has limited support for CSS and Javascript.

doclifter helps with lifting documents with nroff markup to XML-DocBook. Lifting documents from presentation level to semantic level is hard, and a really good job requires human polishing. This tool aims to do everything that can be mechanized, and to preserve any troff-level information that might have structural implications in XML comments. TBL tables are translated into DocBook table markup, PIC into SVG, and EQN into MathML (relying on pic2svg and GNU eqn for the last two).

DocBook is an XML vocabulary which enables you to create document content in a presentation-neutral form that captures the logical structure of the content. Using the DocBook Project XSL stylesheets, you can publish DocBook content as HTML pages and PDF files and other formats, including man pages, HTML Help, and JavaHelp.

Verbiste is a French conjugation system
implemented as a C++ library, a GNOME applet, and
two command-line tools. It can conjugate verbs and
analyze conjugated verbs to determine their mode,
tense, and person. The knowledge base contains
over 6700 verbs.

Flat File Extractor

ffe is a flat file extractor. It can be used for reading different flat file structures and displaying them in different formats. ffe can read fixed length and separated text files and fixed length binary files. It is a command line tool developed under GNU/Linux. The main areas of use are extracting particular fields or records from a flat file, converting data from one format to an other, e.g. from CSV to fixed length, verifying a flat file structure, as a testing tool for flat file development, and displaying flat file content in human readable form.

Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and grids. It is based on a hierarchical design targeted at federations of clusters. Ganglia is currently in use on over 500 clusters around the world and has scaled to handle clusters with 2000 nodes.

Txt2tags converts a text file with minimal markup to HTML, XHTML, SGML, DocBook, LaTeX, Lout, Man page, Creole, Wikipedia, Google Code Wiki, PmWiki, DokuWiki, MoinMoin, MagicPoint, PageMaker, AsciiDoc, or ASCII Art. It is simple and fast and features automatic TOC, macros, filters, include, tools, GUI, CLI, and Web interfaces, translations, and extensive documentation.

TWiki is a flexible, powerful, and simple Web based collaboration platform. It is suitable for dynamic intranets and knowledge bases, and for sharing and managing documents and collaborative projects. It resembles a normal Web site, but every page can be changed from a browser. It features automatic link generation, full text search, group authorization, Web forms, reporting, change notification, file attachments, revision control of pages and attachments, a modular templating system with skins, hierarchical navigation based on the topic parenting feature, and more. Plugins can be used to enhance the program and build groupware applications.

4Suite is a Python-based toolkit for XML and RDF application development. It features a library of integrated tools for XML processing, implementing open technologies such as DOM, RDF, XSLT, XInclude, XPointer, XLink, XPath, XUpdate, RELAX NG, and XML/SGML Catalogs. Layered upon this is an XML and RDF data repository and server, which supports multiple methods of data access, query, indexing, transformation, rich linking, and rule processing, and provides the data infrastructure of a full database system, including transactions, concurrency, access control, and management tools. It also supports HTTP, RPC, SOAP, and FTP, plus APIs in Python and XSLT.

Apache Lucene

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is suitable for nearly any application that requires full-text search, especially cross-platform.

Mono Project

Mono Project is an Open Source implementation of the various ECMA and .NET framework technologies for Unix, Mac OS X, and Windows. The project includes a compiler, a class library, and a CLI runtime engine.

AFT (Almost Free Text) is a document preparation system. It is mostly free form, meaning that there is little intrusive markup; AFT source documents look a lot like plain old ASCII text. It has a few rules for structuring your document, more to do with formatting your text than embedding lots of commands, and it produces all types of output (HTML, XHTML, LaTeX, roll-your-own XML, etc.). All that needs to be done is to edit a rule file. You can even customize your own rule files for specialized output.

Enca detects the encoding of text files, on the basis of knowledge of their language. It can also convert them to other encodings, allowing you to recode files without knowing their current encoding. It supports most of Central and East European languages, and a few Unicode variants, independently on language.

OCRE is an optical character recognition (OCR) system that reads an image file and writes ASCII or Unicode characters.