Project Description
TextEditorLibrary is a C# DLL defining classes for programmatic manipulation of large text buffers, with powerful pattern-matching capability.

History
Once upon a time, before IDEs had even been dreamt of, and the DEC Vax 780/VMS was cool, I used to maintain a scientific application written in a few hundred FORTRAN77 source files. The EDT editor was fine for regular work, but sometimes you needed to make a global change - and EVE/TPU was the tool of choice, with its powerful scripting language. When they took away the Vax and gave me a Windows 3.1 PC, I tried to capture some of that power in FORTRAN77. Later I did it properly in FORTRAN90. Then they took away my FORTRAN90 compiler, so I lashed something together in C. But really what I needed was C# and Visual Studio Express, which are so good that I'm nearly back where I started.

Classes
The following classes are exposed:
  • Text - a text buffer: load it from file or StreamReader or string; move around it, insert text, delete text; write it back out again.
  • Mark - a bookmark in a Text, which moves with changes: move it, move to it; extract, insert, delete and move relative to it; build a Range with two of them.
  • Range - a pair of Marks defining a contiguous section of a Text, for extracting or replacing text.
  • Pattern - a text-matching primitive or a delegate or a hierarchical combination of Patterns, used for searching; easier to use and more powerful than regular expressions.
  • PatternCount - wrapper for an int used in Patterns to allow dynamic feedback into the search process.
  • Box - derived from the Windows Forms RichTextBox to display text to the user with formatting, and to manage user input.
Implementation
Text is implemented as a linked list of short (fixed length) StringBuilders, so it is efficient to insert or delete text in the middle of a multi-GB Text. Data from file can be loaded immediately when the Text is created, or loaded sequentially in a background thread, or randomly accessed on demand with minimal memory usage.

Line ends are not used to divide text up, so text with very long lines or without any line breaks does not cause a problem. Carriage returns are (optionally) removed from text on import to simplify movement and searching across line ends.

No support is offered for multi-char characters (outside Unicode Plane 0), or combining code points.

Operator overloads are provided whenever possible to make code compact and easy to read:
  • Mark + Mark is a Range
  • Mark + int is an offset Mark
  • Pattern & Pattern is a Pattern which matches the first Pattern followed by the second
  • Pattern | Pattern is a Pattern which matches the first Pattern or the second if the first fails
  • and so on.
Methods are overloaded wherever possible to use default argument values.

Intellisense markup is provided throughout.

It is not safe to access a Text (or its Marks and Ranges) in more than one thread. However, different threads can safely work on different Texts concurrently.

Examples of Use
A long time ago: the FORTRAN77 program had thousands of WRITE statements to the console, some referencing a FORMAT statement, some not. These each had to be changed to call a SUBROUTINE so I could re-direct the output. (The syntax is very different.)

Recently: I built an index page for an HTML help system (a couple of hundred files) in three steps: use TEL to read the HTML and build a list of words (in the body, dodging tags and special characters, skipping stop words); use the list to manually fix spelling mistakes; use TEL to read the HTML again, and generate the index page.

Just now: the documentation page (what, there's a documentation page? Yes, and it has some amusing example code as well - please take a look) was generated from Intellisense markup and the code it was marking.

Project
The main reason for putting this up is so that the effort I put into it gets a chance to be useful for someone else.

As a bonus I might get some feedback on how to do it properly; I'm sure I've made some questionable design decisions through ignorance. Someone might find a bug, or make a good enhancement suggestion.

Possible enhancements:
  • I could improve the performance of searches which start with a fixed string - maybe implement Knuth-Morris-Pratt?
  • Any other ideas?
Current Status
In version 3.0 I have added support for huge files (larger than 4GiB), including efficient low-memory random access.

Last edited Aug 12, 2014 at 10:09 PM by Bodgel, version 15