Non Gamstop Sites Casinos Not On Gamstop Betting Sites Not On Gamstop Casinos Not On Gamstop Casinos Not On Gamstop Casinos Not On Gamstop

Text Import - Part 1

Jim Nottingham

Originally, this article was planned to cover how to get your Acorn machine to make sense of text once it had been imported from, for example, a laptop such as the Samsung 'Magic Note'. However, it seems text manipulation is one of the things readers have been asking about in the recent Archive questionnaires, so I provisionally agreed with Paul to extend the article to cover a wider range of 'foreign' text imports from Macs, IBM-compatibles and the like.

The idea seems to have caught on even before I put digits to keyboard and it has become clear that we should perhaps make the article as suitable for the beginner as for the relatively experienced reader. It has therefore grown from the original single page to the point where we ought to split it into two sessions. This first part will consider the overall problem of text import with some of the terminology and basic ground rules, and the second will build on this knowledge to introduce specific methods of converting imported text to Acorn-speak. Much of what will be discussed has already appeared piecemeal in the Hints and Tips columns of earlier issues of Archive but, for the benefit of recent subscribers, this is an opportunity to revisit the subject, pull it all together and add some more hints.

The problem

Part of my day job for 'UK plc' is to collate, edit and publish technical reports, incorporating material from the boffins worldwide. The graphics and text come in a myriad of formats ranging from something called ASCII to what looks like Zarathustrian. I can be sure of three things: Firstly, foreign graphics import will not normally present a problem (thanks to an ever-widening range of transfer utilities such as ChangeFSI, ImageFS and Translator); secondly, the text will invariably come up as 'scribble' on the Acorn; thirdly, the material will always be late. This last factor means there simply isn't time to go back to the originator for a reformat of the text - or even to find out what the format is - and I have to make the best of what I get. So, of necessity, over a period and by dint of empirical sampling (posh phrase for suck it and see...), I've managed to deduce a number of ways of making some sense of what I'm given. These methods will be discussed in Part 2.

Commercial solutions

On the face of it, effective text import utilities seem to be in the minority. My own view is that this is probably to be expected as there are so many possible variations in format that, to be all-embracing, a program would have to be extremely clever. Presently, some applications such as Impression Publisher do incorporate modules which purport to allow foreign text formats to be loaded directly. However, the range is by no means exhaustive and, in practice, individual modules do not appear to work very well. I think this is probably because there are significant formatting differences between the diverse versions of any one application (e.g. Wordperfect variants such as WP from my Magic Note, WP v5.1 for DOS, WP v5.2 for Windows, and so on). Clearly this problem is by no means limited to Acorn machines and there is no such thing as an industry standard. (There's an excellent review of the current state of affairs on p21 of the October '94 issue of Acorn User.)

All this may seem odd as, surely, text is text? This is primarily true, but it is not unusual for a single page of text to be interspersed and surrounded by literally pages of what appears to be scribble. These are the formatting commands used by the foreign application. It is our job to recognise the original text and devise ways of filtering out the 'noise'. As the proud owner of (any) Acorn machine - and unlike Macs and PCs - it won't cost you anything apart from your time because, in Edit, you already have an excellent piece of software to do the job.

ASCII codes

On to the terminology we will need to use, starting with the 'signal', i.e. the actual alpha-numeric characters that we will wish to finish up with on screen and, eventually, in print. Fortunately for us, Gerald Fitton wrote a very clear and informative section on this in a recent issue of Archive, so from now on I will assume you will have re-read that and understand the relevance of ASCII (pronounced "Askey") which is the acronym for the American Standard Code for Information Interchange.

To reiterate briefly, the 256 ASCII codes to which Gerald refers are basically sub-divided into three; the printer instruction codes (ASCII codes 0-31), the alphanumeric characters you see on your keyboard (covered by codes 32-127) and all the 'funny' characters you may wish to add in by some means (codes 128-255). The ASCII codes and the characters to which they relate are not presented very clearly in the various Acorn user-guides, if at all, so I've listed the so-called 'standard' set on the table. (In fact it's by no means standard but I'll discuss that later.)

Binary code

Gerald described the binary code system used by the computer which always confuses me but, fortunately, we won't need to use that in this exercise. The only significance here is that, as was mentioned, the 'funny' characters always start with a binary number 1 instead of 0 and so are often called the 'top-bit set' characters.

Hexadecimal code

Just when we thought we'd avoided clever counting systems, in comes another - the hexadecimal system - often abbreviated to hex and, in print, usually preceded by the ampersand character (&). We good Europeans are quite used to working in decimal notation (0-9); hex is just another system, this time counting in sixteens (0-15). British readers of a certain age, like me, will find this relatively easy because we used to have to count in sixteens! (Hands up the wrinklies who remember the good old days when we had 16 ounces to the pound.)

The problem with representing hex numbers on screen or paper, using just the conventional decimal numbers 0-9, is that we run out of characters. So the hex system uses the lower-case letters a-f to represent the six decimal numbers 10-15. Confused? So am I. Not to worry, I've listed the ASCII characters on the table in both decimal and hexadecimal formats so that we can work out the relationship and use whichever system is appropriate.

Why do we need hexadecimal? Well, have a look at the following which is a typical result of text imported direct from a 'foreign' word-processor into Edit:

[1d]
[00][09]Ð[02]@[02] [05]
[00][1d]Now is the winter of our discontent [0d]
Made glorious summer by this sun of York [0d]
And all the clouds that lour'd upon our house [0d]
In the deep bosom of the ocean buried.[1a]

In this short sample, the required text is easily recognised but there are a couple of 'funny' characters and some strange-looking numbers in square brackets, e.g. [1d]. In Edit and some other text-processors, a number in square brackets is used, conventionally, to represent an ASCII character whose number is given in hexadecimal format. For example, if you look up &0d and &1a in the table, you will see they are the same as the decimal numbers 13 and 26. We will need to devise a method to strip out all these funny hex numbers and this will be discussed in Part 2.

Printer codes

The ASCII codes in the range 0-31 will not actually reproduce characters on screen but are used as coded commands, often embedded in the text, to tell the computer and/or printer to carry out a particular operations. Numerically, they are the exact equivalent of the Basic VDU commands so, for example, ASCII code 13, Hex code &0d and VDU 13 all mean the same thing; Carriage Return. I've put a selection of the meanings of these codes on the table (from Beeb days, you may recognise VDU2/3 as Printer on/off).

Further considerations

That really concludes coverage of the terminology and ground rules we will need to be familiar with to progress to Part 2 of this article. However, having introduced the 'standard' ASCII character set and presented the table, we can usefully go on to consider allied topics which, although nothing to do with foreign text import, you may nevertheless find a worthwhile refresher.

Standardisation

Although the ASCII codes are supposedly a standard way of representing characters, they are by no means universal as, strictly speaking, they apply only to the ISO 8859/1 'Latin1 Alphabet' font. Your computer should be set to this default on delivery. If appropriate to your needs, you can configure the computer to use a different alphabet such as Latin 2-4, Cyrillic or Greek. The range of available alphabets and how to get the computer to use them will vary with the version of RISC OS you have, so see your User Guide for details. I believe that, apart from Hebrew, characters in the ASCII range 32-127 are standard. However, the ones in the range 128-255 may well vary with the alphabet you are using.

There will be other reasons why a supposedly standard Latin1 alphabet font, on paper and/or screen, will not give the characters listed on the table and you need to watch out for this. Some of the reasons are:

Printers will not necessarily replicate the characters you see on screen. With PostScript printers for example, the font resident in the printer must be an exact replica of the outline font you are using for screen display. Elsewhere, the printer-driver may not be perfect and - a common example - give you a hash (#) when you wanted a £.
Many fonts, particularly those from PD sources or 'fancy' fonts, will not reproduce all the characters in the top-bit set. Some do not include any characters in the ASCII range above 127. If you try to enter a character which has not been added to the set, not surprisingly, it will not be displayed on screen. In some applications, you might get the equivalent hex number displayed, in square brackets. In others, you will apparently get nothing at all. I say "apparently" because, in fact, you are actually getting a blank character with zero width.
Some font suppliers have their own minor variations, usually in the ASCII range 128-143. The Electronic Font Foundry, for instance, have what they call the 'EFF Extensions' and even Acorn's fonts are not fully 100% standard to ISO 8859.
Acorn's bit-map System font is an odd-ball and produces some different characters in the ASCII range 128-159.
Some fonts are designed and produced for a specific purpose and the character set is almost wholly different from the standard with which we are familiar. Examples are foreign fonts such as Bengali, or symbol fonts such as Dingbats or MathGreek.

Entering top-bit set characters

The top-bit set characters, i.e. those in the ASCII range 128-255, do not appear on the keyboard, so how do we get them onto the screen and printed? Let us take a fairly common one as a working example; the © copyright character. In practice, we have a number of solutions available to us:

Text utilities - RISC OS 3.1 machines come with the Acorn program !Chars in the Apps directory and there are others. Again, Gerald Fitton has covered this well in his article. The only point I would add is that, when using the utility, it often helps to select the required outline font in the !Chars window as this displays the particular range of characters included in that font. This also works for the foreign and symbol fonts. If we don't do this, the Acorn System font will be displayed by default and, as discussed above, we may well get something completely different. To enter the © character into a text-processor, all we need to do is position the caret where we want the character to appear, and click <select> on © in the !Chars window. This method is probably the most user-friendly and is universal, i.e. it always works for any available character and for any outline font. Usually, this is used for entering top-bit set characters, although I know somebody who has a dicky key on his keyboard which won't produce a 5 or % so, rather than go to the expense of a repair or replacement, he uses !Chars instead!
(Health Warning: using !Chars can seriously damage your document's health. Let me explain. An alternative way of entering characters with !Chars is to place the pointer over the desired character and press <shift>. So, if you accidentally leave the pointer over the !Chars window while you carry on typing, every time you press <shift>, you will add whatever character happens to be under the pointer at the time - this can be very disconcerting if you don't know what is happening. Ed.)
Alt key + numeric keypad - We can input any ASCII character into a text-processor by positioning the caret where we want the character entered, pressing and holding down <Alt>, typing the 2- or 3-digit ASCII code (using decimal as listed on the table) on the numeric keypad and then releasing <Alt>. Nothing will appear to happen until we release <Alt>, at which point the character will appear. For example, for the © character, we would need Alt + 169. This method is also universal but assumes the character is included in the font used.
Alt key + character - In a similar way to the previous case, you can sometimes use the Alt key in conjunction with a designated character on the main keyboard. However, this method is not available for all the top-bit set characters and varies between different versions of RISC OS - the ones that are available on the Risc PC (i.e. v3.50) are listed on the table. So, to enter the © character, we can press <Alt> and type in <C>, and the character will immediately appear. Note that in this case, because the 'control' letter is an upper-case character, we also have to press the Shift key, so we actually need to type <shift-alt-c>. As another example, typing <alt-4> is a quick way of getting a ¼ character. The codes for RISC OS 3.1 characters are listed in Archive 6.1 p9. The codes for RISC OS 2 and OS 3.0 characters are listed in Archive 5.1 p10.
Accents - All the accented characters in the top-bit set can be entered by pressing <alt> and a designated key, releasing them and then typing the unaccented character, at which point the accented version will appear. These combinations are shown on the table, in italics to distinguish them from the previous option. So, for example, if we press and release the Alt and ] keys, and then type A, we will get À (ASCII code 192). A few more details are given in Archive 6.2 pp8/9.
'Hard' characters - There are two characters in the top-bit set which, on the face of it, are identical to their keyboard counterparts but which can be usefully used in particular circumstances. These are the 'hard space' (ASCII code 160) and the 'hard hyphen' (ASCII code 153). The hard space can be used when we might prefer to keep together two elements normally separated by a space and which otherwise may be split onto two lines by the text processor. My postcode YO4 2EY is a typical example. We can enter a hard space by using !Chars (click on the 'space' just before the ¡ character), typing <alt160> or typing <alt-space>. The hard space is also useful for putting spaces into disc filenames (I always think 'Read Me' looks more elegant than 'ReadMe' or 'Read_Me' but that's a personal thing). Similarly, many text processors can split hyphenated words or phrases onto two lines and this may reduce clarity (e.g. you wouldn't want "<shift-ctrl-f4>" to be split). In this event, entering hard hyphens by using !Chars, typing <alt-173> or typing <alt-hyphen> will prevent this happening. The three longer hyphens or 'dashes' (ASCII codes 151, 152 and 153 are also 'hard' characters).

Whichever system or systems you use will depend on your personal taste but the options give a powerful set of choices. My own preference is to use <Alt> in conjunction with the numeric keypad as it is a convenient, universal method without need to call up another utility. However, as none of the user-guides include a convenient listing of the ASCII code characters, I always have a handy reference chart available. This is simply a cut-down version of the table so I'm including a drawfile version of ASCIIChars (45k).

Contents - The Archives - Archive Articles