Jim Nottingham
Firstly, my grateful thanks to those readers who took the trouble to respond to Part 1. In truth, I had feared it was over-detailed and merely covered old ground but, from the incidence of "well, I never knew that" comments, it appears I was on the right track after all. Please do keep those helpful comments coming.
In Part 2, we looked at the problems of filtering out unwanted line-feeds which, invariably, are embedded in and imported with so-called 'standard' ASCII text, whatever the source. In doing so, I am confident you will have become sufficiently familiar with the workings of Edit's Find and Replace function to be able to press on and deal with the majority of other problems you are likely to meet.
Processing compound problems
In principle, by applying exactly the same procedures we considered in Part 2, we can deal with imported documents with embedded control codes of increasing complexity. This includes even those which, at first impression, appear to be total scribble. However, the law of diminishing returns comes in here. By that I mean that, if you need to import perhaps only a couple of short paragraphs, it may prove easier and quicker just to strip out the offending codes using the delete key, repetitively, rather than go through some complex Find and Replace procedures. You will soon get used to which option to employ under given circumstances.
Before we consider another example in detail, there is an alternative Edit function we can introduce to do the same job as the Magic characters system but which is rather more user-friendly; this is the wildcard feature. To use it, click on the "Wildcarded expressions" radio button which will open up an alternative series of options (not available in the RISC OS 2 version of Edit; continue to use Magic characters - or upgrade to OS 3.10!).
The options look a little more complicated but, in practice, are easier to use than the magic character system. Single characters are used (e.g. $ represents a line-feed) and the character can be entered not only by typing it in as normal but also by simply clicking on the appropriate option box.
Now to look at a more complex example which is an amalgam of many real situations we may meet, including bits from those well-known foreigners Multimate, MS Word, Word-Perfect and Wordstar - plus a bit of my Magic Note in native text format thrown in for good measure. My aim in purposely making it such an eye-watering (but typical) sample is that, if we can hack this, we shall have gained the confidence to manipulate anything the foreigners can throw at us.
For this exercise, download Example2, and make a working copy (for the moment, don't change the filetype to Text). The actual text is almost the same as used in Example1. That was 1,250 bytes long but, as the file size has increased to almost 10Kb, clearly we have picked up lots of 'scribble' in the importing. For starters, drag the Example2 file onto your WP/DTP icon and - immediately - you will hit a major problem in that, in all probability, it won't load properly.
Publisher, for example, says the file is not understood and insults us by displaying a picture of a PC! Ovation is not as rude but fails to display any more than half a page of nonsense. This is a very common example of how embedded control codes can seriously interact with our system so, as a starter, convert the filetype to Text as described in Part 2.
To see the extent of the problem, discard the WP/DTP document and load Example2 into Edit. At first sight, we have simply imported pages of utter scribble which seems to be made up of thousands of numbers in square brackets (= hex) with the occasional alpha-numeric character embedded, e.g. "[01]ü[1a]". However, if you scroll to the bottom, you'll see what on a clear day just might be the text we are looking for. For those readers without the examples disc, here is a very much cut down extract from the original file:
Pages of scribble followed by...
.....................[00µ[00]}[00][00][00]´[00]Ð[00]Ð
[1d]These¹days,¹a¹fax¹facility¹is¹almost¹a¹
necessity¹for¹running¹a¹business.¹¹When¹
people¹asked¹for¹our¹fax¹number,¹they¹
were¹most¹put¹out¹to¹find¹that¹we¹didn©t¹
have¹one¹-¹so¹we¹invested¹in¹David¹
Pilling©s¹ArcFax¹(¹35)¹and¹bought¹
ourselves¹a¹fax¹modem¹(¹199.99).[0d]
[0d]
[1d]
[00][09]Ð[02]@[02]¹[05]
[00][1d]What¹is¹a¹computer¹fax?[0d]
[0d]
[1d]
[00][09]Ð[02]@[02]¹[05]
[00][1d]The¹computer¹prints¹by¹sending¹to¹
the¹printer¹a¹graphic¹image¹of¹the¹page¹
which¹is¹made¹up¹as¹a¹series¹of¹dots.¹¹
Normally,¹an¹electronic¹representation¹
of¹these¹dots¹is¹sent¹up¹the¹cable¹to¹the¹
printer.¹¹The¹fax¹modem¹can¹be¹thought¹of¹
as¹a¹©printer©¹which¹turns¹these¹dots¹
into¹sounds¹which¹can¹be¹sent¹down¹the¹
telephone¹line¹to¹a¹remote¹receiver.[1a]Ô[08]
3"[1d][00]Ñf[03][00][00][00]Ñ[01]#[00]
[01]................................
General procedure
Clearly, Edit is going to have to work extra hard this time and you will have to deal with this in a very controlled fashion if you are not to lose sight of what you have done. So my recommendations for a general procedure are three-fold:
Tactics and techniques
Let's put the general procedures into practice. Firstly, we could quickly get rid of the thousands of control characters appearing before the text starts. This is very easy to do in Edit; simply select-drag to mark whole blocks from the top of the file and then delete them (<ctrl-X>). Continue for some pages until eventually you come to the start of the wanted text. Finally, select and delete the few lines of code characters after the text, in the same manner.
Drop an Edit save box onto your WP/DTP icon and the application should now accept and display the shortened file sensibly, so we are on the right path. Save the Edit file as an interim result and discard the WP/DTP document.
Looking at the result in the Edit window, it is now obvious that, throughout the text file, gaps between words are filled by a superscript 1 ("¹") where there should be spaces (exported from the Magic Note 'native' word-processor). Looking at the ASCII table included in Part 1, we find we can enter the "¹" character either by typing <alt-185> (using the numeric keypad) or by <alt-1> (main keyboard). So we can deal with the problem globally by using a straightforward Edit Find/Replace procedure:
Press <home> followed by <f4>
"Find:" Type in <alt-185><return>
"Replace with:" Press <space><return>
Click on "End of file replace" (249 replaced)
Click on "Stop" (or press <return>)
Once again, drop an Edit save box onto your WP/DTP icon and note that the result is now becoming fairly readable, although the formatting still needs work. Discard that document and save the interim Edit file.
Probably the next most common occurrence is the "" character at the end of each line. This is an unusual feature, imported from Wordstar v3.3. The characters are in fact carriage returns (CR) but most other packages export them as ASCII 13 control characters, as shown in the table. On screen, they would appear in the equivalent hex number format ([0d]). Not to worry about these differences, we can again perform an Edit global Find/Replace to strip out the carriage returns. In the example file, the "" character is always preceded by a space so, in this instance, there is no need to replace it with another space. In the table, "" is listed as the ASCII decimal number 141, so the procedure is:
Press <home> followed by <f4>
"Find:" Type <alt-141><return>
"Replace with:" Press <return> (i.e. 'nothing')
Click on "End of file replace" (28 replaced)
Click on "Stop" (or press <return>)
Viewing the result in our WP/DTP package window shows that the carriage return characters have gone from the 28 lines of text. Discard the document and save the Edit temporary file as usual.
We are still left with the (invisible) line-feeds which are stopping the text filling the full width of the window. However, unlike in Example1 where paragraph spacing was achieved by using double line-feeds, in Example2 it is brought about by line-feeds plus a unique string of 17 control characters ([0d]....[1d]). For this reason, it is unnecessary to use the 'dummy' procedure this time, so we can strip out the single line-feeds using a simplified procedure, noting that the Wildcarded expression for a line-feed ('Newline') is the $ character. We can either type this in or enter it by clicking on the "Newline" box. This time, we play safe and replace each (invisible) line-feed character with a space, so:
Press <home> followed by <f4>
"Find:" Click on "Newline" (or type in<$>)
"Replace with:" Press <space><return>
Click on "End of file replace" (48 replaced)
Click on "Stop" (or press <return>)
There were far more line-feed characters replaced (48) than carriage returns (28) because, this time, we have stripped out the LFs between paragraphs and either side of the heading.
Dropping an Edit save box on the WP/DTP package icon shows us that the individual paragraphs are now formatting properly, although we still need to sort out the paragraph spacing. This is slightly more tricky than usual because the 'foreign' package (Wordstar v6.0) has given us the rather unfriendly 17-character string to deal with.
This is an instance where, with only a few paragraphs to import, it would be appropriate simply to use the delete key. However, for the exercise (and bearing in mind it will work equally well for a 100-page document), we will do it the hard way! Fortunately, we don't need to type in the complete string of complex characters as we can use the Wildcarded expressions' "Any" function. In theory, as there are no other hex characters ([xx]) remaining in the file, we could simply type in, say, the first <[0d]> followed by 16 wildcards (full stops) to represent the full string. In this case, it is a unique solution and will work but, for other circumstances, we must ensure it is a unique occurrence and, if necessary, type in the full string as it stands.
For this exercise, we will use the start and end characters with 15 intermediate wildcard characters. To do this, we will enter the first and last hex numbers, separated by 15 full stops. The two hex numbers are entered in both cases by clicking on the Wildcarded expressions' "Hex" box, which puts a cross in the Find: dialogue box, before we type in the relevant hex number. We can enter the full stops either from the keyboard or by clicking 15 times on "Any". As a replacement for the string, we will want to enter a couple of line-feed characters to achieve the double-spacing between the paragraphs and either side of the heading. So, using the normal procedure and having entered or typed in the data, Edit's Find text window will look like this:
Pressing <return> and clicking on "End of file replace" gives us 5 replaced. Dropping the result onto our WP/DTP icon shows that we are virtually there, with just a few minor anomalies to deal with.
Odds and ends
Although it would be quite reasonable to edit out the remaining anomalies manually, we may as well complete the exercise using Edit as a general procedure:
The finished result (at last!)
So what have we achieved? In fact, a great deal (in more ways than one). We have taken an unseen text file which may well have been imported from an unidentified source and, at first sight, appeared to be scribble. But we have progressively massaged it to the point that it has become 100% readable by our word-processor or DTP package, which is just what we set out to do.
To achieve this, we didn't need to have any knowledge of the source application, software version number or host system, nor any technical expertise. We just needed to have a modicum of familiarity with the way Edit's Find/Replace function is handled.
We have done this without any direct expense because the only tool we used - Edit - came 'free' with our computer.
As I said - a great deal... Thanks Acorn, what would we do without you?
Read, learn and practise...
Don't worry if you found working through this complex example hard going, the learning curve is very steep. I suggest you go through it at least a couple more times for consolidation and then you should find it will take you only a few minutes to convert almost any other imported file so that it is fully readable by your WP/DTP package.
"Almost any"? Well, in practice, I haven't actually come across any imported file which I haven't managed to get Edit to convert successfully, given enough time and application. That said, I have to say that, in order to meet a very tight timescale, I once had no option but to massage a 50-page PostScript file. The finished result was testimony to Edit's productivity but, my word, doing it destroyed any hope of slowing the pace of my near-terminal baldness. Incidentally, does anyone know of a PostScript reader for Acorn machines?
In these articles, I cannot hope to have covered all possible situations you might meet, or every nuance of Edit's Find/Replace facilities. If you would like to take things further, I recommend you read the notes on Wildcarded expressions in the manual, especially the examples on p14 of the RISC OS 3 Applications Guide or p260 of the Risc PC User Guide.