Jim Nottingham
Originally, this article was planned to cover how to get your Acorn machine to make sense of text once it had been imported from, for example, a laptop such as the Samsung 'Magic Note'. However, it seems text manipulation is one of the things readers have been asking about in the recent Archive questionnaires, so I provisionally agreed with Paul to extend the article to cover a wider range of 'foreign' text imports from Macs, IBM-compatibles and the like.
The idea seems to have caught on even before I put digits to keyboard and it has become clear that we should perhaps make the article as suitable for the beginner as for the relatively experienced reader. It has therefore grown from the original single page to the point where we ought to split it into two sessions. This first part will consider the overall problem of text import with some of the terminology and basic ground rules, and the second will build on this knowledge to introduce specific methods of converting imported text to Acorn-speak. Much of what will be discussed has already appeared piecemeal in the Hints and Tips columns of earlier issues of Archive but, for the benefit of recent subscribers, this is an opportunity to revisit the subject, pull it all together and add some more hints.
The problem
Part of my day job for 'UK plc' is to collate, edit and publish technical reports, incorporating material from the boffins worldwide. The graphics and text come in a myriad of formats ranging from something called ASCII to what looks like Zarathustrian. I can be sure of three things: Firstly, foreign graphics import will not normally present a problem (thanks to an ever-widening range of transfer utilities such as ChangeFSI, ImageFS and Translator); secondly, the text will invariably come up as 'scribble' on the Acorn; thirdly, the material will always be late. This last factor means there simply isn't time to go back to the originator for a reformat of the text - or even to find out what the format is - and I have to make the best of what I get. So, of necessity, over a period and by dint of empirical sampling (posh phrase for suck it and see...), I've managed to deduce a number of ways of making some sense of what I'm given. These methods will be discussed in Part 2.
Commercial solutions
On the face of it, effective text import utilities seem to be in the minority. My own view is that this is probably to be expected as there are so many possible variations in format that, to be all-embracing, a program would have to be extremely clever. Presently, some applications such as Impression Publisher do incorporate modules which purport to allow foreign text formats to be loaded directly. However, the range is by no means exhaustive and, in practice, individual modules do not appear to work very well. I think this is probably because there are significant formatting differences between the diverse versions of any one application (e.g. Wordperfect variants such as WP from my Magic Note, WP v5.1 for DOS, WP v5.2 for Windows, and so on). Clearly this problem is by no means limited to Acorn machines and there is no such thing as an industry standard. (There's an excellent review of the current state of affairs on p21 of the October '94 issue of Acorn User.)
All this may seem odd as, surely, text is text? This is primarily true, but it is not unusual for a single page of text to be interspersed and surrounded by literally pages of what appears to be scribble. These are the formatting commands used by the foreign application. It is our job to recognise the original text and devise ways of filtering out the 'noise'. As the proud owner of (any) Acorn machine - and unlike Macs and PCs - it won't cost you anything apart from your time because, in Edit, you already have an excellent piece of software to do the job.
ASCII codes
On to the terminology we will need to use, starting with the 'signal', i.e. the actual alpha-numeric characters that we will wish to finish up with on screen and, eventually, in print. Fortunately for us, Gerald Fitton wrote a very clear and informative section on this in a recent issue of Archive, so from now on I will assume you will have re-read that and understand the relevance of ASCII (pronounced "Askey") which is the acronym for the American Standard Code for Information Interchange.
To reiterate briefly, the 256 ASCII codes to which Gerald refers are basically sub-divided into three; the printer instruction codes (ASCII codes 0-31), the alphanumeric characters you see on your keyboard (covered by codes 32-127) and all the 'funny' characters you may wish to add in by some means (codes 128-255). The ASCII codes and the characters to which they relate are not presented very clearly in the various Acorn user-guides, if at all, so I've listed the so-called 'standard' set on the table. (In fact it's by no means standard but I'll discuss that later.)
Binary code
Gerald described the binary code system used by the computer which always confuses me but, fortunately, we won't need to use that in this exercise. The only significance here is that, as was mentioned, the 'funny' characters always start with a binary number 1 instead of 0 and so are often called the 'top-bit set' characters.
Hexadecimal code
Just when we thought we'd avoided clever counting systems, in comes another - the hexadecimal system - often abbreviated to hex and, in print, usually preceded by the ampersand character (&). We good Europeans are quite used to working in decimal notation (0-9); hex is just another system, this time counting in sixteens (0-15). British readers of a certain age, like me, will find this relatively easy because we used to have to count in sixteens! (Hands up the wrinklies who remember the good old days when we had 16 ounces to the pound.)
The problem with representing hex numbers on screen or paper, using just the conventional decimal numbers 0-9, is that we run out of characters. So the hex system uses the lower-case letters a-f to represent the six decimal numbers 10-15. Confused? So am I. Not to worry, I've listed the ASCII characters on the table in both decimal and hexadecimal formats so that we can work out the relationship and use whichever system is appropriate.
Why do we need hexadecimal? Well, have a look at the following which is a typical result of text imported direct from a 'foreign' word-processor into Edit:
[1d]
[00][09]Ð[02]@[02] [05]
[00][1d]Now is the winter of our discontent [0d]
Made glorious summer by this sun of York [0d]
And all the clouds that lour'd upon our house [0d]
In the deep bosom of the ocean buried.[1a]
In this short sample, the required text is easily recognised but there are a couple of 'funny' characters and some strange-looking numbers in square brackets, e.g. [1d]. In Edit and some other text-processors, a number in square brackets is used, conventionally, to represent an ASCII character whose number is given in hexadecimal format. For example, if you look up &0d and &1a in the table, you will see they are the same as the decimal numbers 13 and 26. We will need to devise a method to strip out all these funny hex numbers and this will be discussed in Part 2.
Printer codes
The ASCII codes in the range 0-31 will not actually reproduce characters on screen but are used as coded commands, often embedded in the text, to tell the computer and/or printer to carry out a particular operations. Numerically, they are the exact equivalent of the Basic VDU commands so, for example, ASCII code 13, Hex code &0d and VDU 13 all mean the same thing; Carriage Return. I've put a selection of the meanings of these codes on the table (from Beeb days, you may recognise VDU2/3 as Printer on/off).
Further considerations
That really concludes coverage of the terminology and ground rules we will need to be familiar with to progress to Part 2 of this article. However, having introduced the 'standard' ASCII character set and presented the table, we can usefully go on to consider allied topics which, although nothing to do with foreign text import, you may nevertheless find a worthwhile refresher.
Standardisation
Although the ASCII codes are supposedly a standard way of representing characters, they are by no means universal as, strictly speaking, they apply only to the ISO 8859/1 'Latin1 Alphabet' font. Your computer should be set to this default on delivery. If appropriate to your needs, you can configure the computer to use a different alphabet such as Latin 2-4, Cyrillic or Greek. The range of available alphabets and how to get the computer to use them will vary with the version of RISC OS you have, so see your User Guide for details. I believe that, apart from Hebrew, characters in the ASCII range 32-127 are standard. However, the ones in the range 128-255 may well vary with the alphabet you are using.
There will be other reasons why a supposedly standard Latin1 alphabet font, on paper and/or screen, will not give the characters listed on the table and you need to watch out for this. Some of the reasons are:
Entering top-bit set characters
The top-bit set characters, i.e. those in the ASCII range 128-255, do not appear on the keyboard, so how do we get them onto the screen and printed? Let us take a fairly common one as a working example; the © copyright character. In practice, we have a number of solutions available to us:
Whichever system or systems you use will depend on your personal taste but the options give a powerful set of choices. My own preference is to use <Alt> in conjunction with the numeric keypad as it is a convenient, universal method without need to call up another utility. However, as none of the user-guides include a convenient listing of the ASCII code characters, I always have a handy reference chart available. This is simply a cut-down version of the table so I'm including a drawfile version of ASCIIChars (45k).