Archive

Text Import - Part 2



Jim Nottingham

In Part 1, we considered how the various characters that we see on screen and paper are designated and also looked at the range of methods we can use to enter characters into the computer. In this part, we turn to the actual business of importing text from 'foreign' sources and specifically how we can filter out the unwanted control characters. Some of the word-processor (WP) and DTP applications for Acorn machines have built-in routines and facilities to allow this to be done in part but, as we all have access to Edit, we will use this powerful text editor - in particular its invaluable Find and Replace function - as the primary tool.

At the outset, let me stress that there will always be two or more methods of achieving the same result in Edit and I can only cover a small fraction of the entire repertoire in this article. So my intention is to deal with a limited number of situations we might meet in practice and suggest solutions, considering not only the 'how' but also the 'why', in some detail. My aim is to give you sufficient familiarity with the workings of Edit so that, when you meet a situation we have not discussed, you will quickly be able to deduce a working solution from the basic principles.

For convention throughout the article, anything I ask you to type in at the keyboard will be enclosed in <> brackets, so <Text><return> would mean type in the word "Text" and then press the return key. Similarly, <Alt-169> would mean hold down the Alt key, type in the numerals 169 (on the numeric keypad) and then release the Alt key.

Preparation

For the sake of this exercise, I will assume that you have either received the 'foreign' text on floppy disc or already imported it into your computer via a serial link or through a modem. At the end of the day, you will want to have the text converted to Acorn-speak and displayed by your WP or DTP application. So, for starters, load Edit and the WP/DTP package onto the iconbar.

My first and most important bit of advice is that you should make at least one back-up copy of what you receive. Having learnt the hard way, I always make two copies on different storage media as a matter of course, one as a working copy and - most importantly - one I can still get at should the original be corrupted. (I still come out in a sweat when I remember the floppy that arrived from Saudi Arabia, two weeks late, with a cracked case and a mangled metal slider - arrghh!)

When you have the text file copied onto your hard disc or whatever, you may well find the icon above the filename will represent a PC.

This is because the filetype has been set to the PC disc operating system - DOS. Some applications (e.g. Impression Publisher) will not accept such files so, initially, it is necessary to convert their filetype to Text. I always convert the files as a matter of course because I prefer to see Acorn-style icons on the desktop!

To do this, click <menu> over the file icon and follow through the Filer-File-Set type sub-menus. Delete "DOS" in the dialogue box and type in <Text><return>.

The file icon will change to the more familiar Text style and double-clicking on it will load the file into Edit as normal. If you will be processing multiple files, you can change their filetype at one go by selecting them all and, this time, stepping through the Filer-Selection-Set type sub-menus.

Importing ASCII text

The most common and straightforward situation is when you need to import some text which has been sent in the standard, ASCII format, so let's have a look at a practical example of that. On the monthly disc is the file Example1 which we can use so, for the moment, drop the file icon onto whichever WP or DTP package you have loaded on the iconbar.

Don't worry if you don't have the monthly disc, here is a truncated section of what the file should look like in your WP/DTP package window (with acknowledgements to Richard Torrens):

These days, a fax facility is almost a
necessity for running a business. When
people asked for our fax number, they
were most put out to find that we didn't
have one - so we invested in David
Pilling's ArcFax and bought ourselves
a fax modem.

What is a computer fax?

The Acorn computer prints by sending to
the printer a graphic image of the page
which is made up as a series of dots.
Normally, an electronic representation
of those dots is sent up the cable to the
printer. The fax modem can be thought of
as a 'printer' which turns these dots
into sounds which can be sent down the
telephone line to a remote receiver.

The main characteristic to note from this example is that the text does not fill the column width. This is because 'hard' linefeeds have become embedded (invisibly) in the 'foreign' text and have been imported with it. It is possible to re-format the text manually, line by line, but that's even less exciting than watching Corel Draw re-draw on a PC screen(!), so we need a better method.

Find and replace

Let's see how we can use Edit to help us. Close and discard the WP/DTP document and load the Example1 file into Edit. For reasons which will be discussed later, it is advisable to set the Edit display to something other than the System font (for clarity, I prefer Homerton). To do this, click Menu in the Edit window and follow through the Display-Font sub-menus.

Initially, the caret will already be in the top left-hand corner but, as in later activities it can and will be elsewhere in the file, so get used to pressing the <home> key to re-set it. Finally, press <f4> which will open up Edit's Find text box.

In normal usage, this allows us to replace one string of text with another desired string, either singly or globally, for instance replacing "Archivers" with "Archive readers" throughout a document. This is a very powerful and flexible function and will handle not only text strings, but also individual or groups of odd characters including those from the top-bit set and even, as we shall see, control characters. I believe the only limitation is that, when using window-based character-select utilities such as !Chars, we cannot enter characters into the Edit Find/Replace dialogue boxes by clicking <select>, so we must either press <shift> (as described in Ed's note in Part 1) or fall back on the keyboard entry methods. For this reason, you may find it handy to have available the table included in part 1 (reading specs from Ed. please, not me...).

Embedded linefeeds

To deal with the unwanted linefeeds in Example1, we can strip them out by entering the appropriate character in the Find: dialogue box and globally replacing it with something else. But what is the linefeed character? If we hunt through the table, we will find that a linefeed (LF) is the control character which has the ASCII decimal number 10. Unfortunately, a quirk of Edit is that we can't use the (Alt-xxx) system to enter the control characters in the ASCII range 00-31 into the Find: box (try it - you will get the superscript "¹" instead). So we are forced to use an alternative method, in this case by entering the equivalent hexadecimal number (&0a).

Magic characters

To enable this to work, we must first click on the "Magic characters" radio button in the Find text box which extends the window to display various options. (Users of RISC OS 2 will already see these options in the Find text window, but you will need to click on Magic characters anyway.)

As shown by "hex char", we could enter the linefeed character by typing <\x> followed by the appropriate 2-digit hex number (excluding the '&'). So, in this case, typing <\x0a> would do the trick. However, this is not exactly friendly so, again as shown, Edit allows us to type in <\n> instead, which represents a linefeed or what it calls a "newline" character.

So, having typed <\n> in the Find: box and pressed <return>, what do we replace it with? The answer is either a space - or nothing at all! Our problem is that this decision depends on where the text originally came from (i.e. the 'foreign' application used) so, initially, I always play safe by pressing the Space bar before pressing <return> (more on that anon).

This will bring up the Text found window, indicating that Edit has found the first instance of the linefeed character.

Normally, the first find would be marked in inverse text in the Edit window but, because these linefeed characters are 'invisible', it cannot do that; however, the caret has moved to the correct position of the linefeed, i.e. the end of the first line of text.

Next, click "End of file replace" which will bring up 38 finds - the number of lines of text and paragraph breaks in the file. Now, before doing anything else, look at the result in the Edit window. You will see that the text is now ranged across the full width of the window, confirming that the unwanted linefeeds have been stripped out successfully.

Paragraph breaks

Unfortunately, the double-spaces between the paragraphs, and either side of the heading, have also been stripped out! As Harry Enfield would say, we didn't want to do that... The straightforward reason for the hiccup is that, when you think about it, double-spacing is simply two linefeeds back-to-back (in the same way that we would normally press <return> twice to get double-spacing in a document). In these cases, Edit has simply found pairs of linefeeds, back-to-back, and obediently replaced them with two spaces.

Before we correct the error, look again at the text in the Edit window, specifically where the linefeeds used to be. You will see that - appropriately in this case - there is a single space, indicating that we were correct to have replaced the linefeeds with a space. If we hadn't, the words at the end of each line and the start of the next would have been joined together which would be a pain to untangle. Had there been double spaces (i.e. an unwanted space had been added in each case), this would have indicated that we should have replaced the linefeeds with nothing.

Back to the problem of how to retain paragraph spacing. In this case we've messed it up, so we can either go back one step by clicking on "Undo" in the Text found box and then clicking on "Stop" or, alternatively, discard the Edit file altogether and start again.

What we need to do is devise a method of getting Edit to recognise and strip out single linefeeds while ignoring double linefeeds. We can achieve this by running through the following procedure:

  1. Temporarily replace each double linefeed (\n\n) with something completely different (a 'dummy').
  2. Strip out the single linefeeds as above (the 'dummies' representing the double linefeeds will be disregarded).
  3. Replace/restore the 'dummies' with double linefeeds (or single linefeeds, if you prefer).

What we use as the temporary dummy is not important except that it must be uniquely different; that is, when we come to replace it with a double linefeed, there must be no possibility of inadvertently replacing a matching string in the wanted text. I've seen people using a variety of dummy strings; "ZCZC", "%$%" and the like. For this exercise, we will use "%%".

Working procedure

So the suggested, full procedure for importing ASCII text with embedded linefeeds is as follows:

  1. Press <home> followed by <f4>
    "Find:" Type in <\n\n><return>
    "Replace with:" Type in <%%><return>
    Click on "End of file replace" (5 finds)
    Click on "Stop" (or press <return>)
    (Note: The five paragraph spaces - double linefeeds - have now been replaced with the "%%" dummy string).
  2. Press <home> followed by <f4>
    "Find:" Type in <\n><return>
    "Replace with:" Press <space><return>
    Click on "End of file replace" (28 finds)
    Click on "Stop" (or press <return>)
    (Note: The 28 remaining single linefeeds have now been replaced with spaces.)
  3. Press <home> followed by <f4>
    "Find:" <%%><return>
    "Replace with:" <\n\n><return>
    Click on "End of file replace" (5 finds)
    Click on "Stop" (or press <return>)

(Note: The five instances of "%%" have been deleted and double linefeeds restored.)

If all has gone well, the text in the Edit window will now be ranged across its full width but the original paragraph and heading spacings will have been retained. As proof of the pudding, open an Edit save box, drag it to the WP/DTP icon and marvel at your undoubted skill in converting the 'foreign' text into fully-formatted Acorn-speak.

Familiarity and (semi-)automation

For what seems such a straightforward problem, this might appear to be a very heavy-handed procedure. However, familiarity with it comes very quickly and what we have done here for a fairly trivial sample applies equally well for the majority of ASCII-text import problems that I have met. Helping Edit to massage a 30-page 'foreign' document into perfectly formatted text in a couple of minutes can be extremely satisfying.

Indeed, because it is a relatively standard procedure, it can be semi-automated by the use of an appropriate module built into applications such as the Impression family or by using a utility such as the wonderful Keystroke. I prefer to use the latter because we can capitalise on its inherent flexibility to get round the inevitable variations in foreign-text format which, on occasions, seem to upset the built-in routines. For example, the ASCII-text output option from my Magic Note is slightly odd-ball but, by pressing <Alt-L>, Keystroke converts it to Acorn-speak at the rate of around 2 secs/page. There's productivity for you.

Next month...

With Archive space at a premium this month, this is a convenient point to break off for the moment. In the final part, planned for next month, we will look at a more complex series of problems which are typical of those we might meet in text imported directly from common 'foreign' word-processors such as Word, Word-Perfect, Wordstar and the like. With our knowledge and experience to date, we shall have no difficulty using Edit to convert the text to pure Acorn-speak, honest...


Contents - The Archives - Archive Articles