Page 1 of 3 123>
Topic Options
#101873 - 2003-06-05 03:12 PM HTMLtoText, has anyone done that?
masken Offline
MM club member
*****

Registered: 2000-11-27
Posts: 1222
Loc: Gothenburg, Sweden
Has anyone made a parser that converts HTML tags for special characters etc into plain text? Ie; Input one string and get a plain text formated version out?

Ie;
code:
"Implacável" >>> "Implacável"

If not, I'm thinking of writing such a thing (UDF)... need some pointers though.

What about putting a conversion list, like this one or this one for example, in an array, and then somehow replace the characters? How do I optimize the loops in the best manner?

Or perhaps, since these are all CHR() codes (although it seems to be a different table), perhaps one can use that in a smart way?

Edit
hmm.. I guess specifying arrays with 150+ entries in each is hardly the best approach heh.. guessing there must be a COM object that can handle all this... problem is finding it [Big Grin]


[ 05. June 2003, 16:24: Message edited by: masken ]
_________________________
The tart is out there

Top
#101874 - 2003-06-05 04:38 PM Re: HTMLtoText, has anyone done that?
Kdyer Offline
KiX Supporter
*****

Registered: 2001-01-03
Posts: 6241
Loc: Tigard, OR
Have you had a look at something like this?

http://www.ascii.cl/htmlcodes.htm

Kent
_________________________
Utilize these resources:
UDFs (Full List)
KiXtart FAQ & How to's

Top
#101875 - 2003-06-05 04:51 PM Re: HTMLtoText, has anyone done that?
masken Offline
MM club member
*****

Registered: 2000-11-27
Posts: 1222
Loc: Gothenburg, Sweden
yeah, just like the samples I provided too... problem is I can't use KiX's own CHR() function to replace these characters, which means putting up huge arrays and loop though them character by character, or finding some COM object etc that can do the conversion....

Or?
_________________________
The tart is out there

Top
#101876 - 2003-06-05 06:16 PM Re: HTMLtoText, has anyone done that?
Lonkero Administrator Offline
KiX Master Guru
*****

Registered: 2001-06-05
Posts: 22346
Loc: OK
with IE com you can automate the conversion.

or just simply searching for %-char.
that is not actually so much of html than URI encoding...

indeed in BBchecker these was this kinda thing but as found that IE handles most of the cases automatically, don't anymore use this kinda thing.
_________________________
!

download KiXnet

Top
#101877 - 2003-06-06 08:15 AM Re: HTMLtoText, has anyone done that?
masken Offline
MM club member
*****

Registered: 2000-11-27
Posts: 1222
Loc: Gothenburg, Sweden
Yes lonk, but what in what object should one look for a conversion?

I want this for when grabbing a webpage (HTML source)...
_________________________
The tart is out there

Top
#101878 - 2003-06-06 10:23 AM Re: HTMLtoText, has anyone done that?
Lonkero Administrator Offline
KiX Master Guru
*****

Registered: 2001-06-05
Posts: 22346
Loc: OK
I had almost ready code but deleted it already...
wasn't pleased with it [Wink]

actually, it's all about chr() with some exceptions like NBSP or LT/GT.
_________________________
!

download KiXnet

Top
#101879 - 2003-06-07 12:14 AM Re: HTMLtoText, has anyone done that?
masken Offline
MM club member
*****

Registered: 2000-11-27
Posts: 1222
Loc: Gothenburg, Sweden
Lonk, the CHR() codes in KiX doesn't translate to the codes on a webpage. Well the one's below 100 does I think, but not the rest [Frown]
_________________________
The tart is out there

Top
#101880 - 2003-06-06 01:01 PM Re: HTMLtoText, has anyone done that?
masken Offline
MM club member
*****

Registered: 2000-11-27
Posts: 1222
Loc: Gothenburg, Sweden
Ok, this works, with some limitations. Please comment:

code:
;FUNCTION      HTMLtoText()
;
;ACTION Convert a HTML source string into plain text
;
;AUTHOR masken (masken|at|home.se)
;
;VERSION 1.0
;
;DATE CREATED 2003-06-06
;
;DATE MODIFIED -
;
;KIXTART 4.12+
;
;SYNTAX HTMLtoText([STRING])
;
;PARAMETERS STRING
; The string you want to convert.
;
;RETURNS 1 If all HTML codes were converted
; 2 If there were no codes to convert
; 3 If not all codes could be converted
;
;REMARKS * If there's a ";" in the $STRING (ie; the HTML source),
; the conversion will go wrong (there shouldn't be any though).
; * Only handles decimal codes. Some webpage code might have hex codes.
; * Doesn't handle alternatives, like """ and " " for example.
;
; The list of decimal codes used for the arrays were taken from here:
; http://tergestesoft.com/~eddysworld/spechars.htm
;
;DEPENDENCIES none
;
;EXAMPLE $htmlstring = "Implacável"
; $result = HTMLtoText($htmlstring)
; ; $result will be: "Implacável"
;
;KIXTART BBS http://www.kixtart.org/cgi-bin/ultimatebb.cgi?ubb=get_topic&f=14&t=000765
;
;===============================================================================

Function HTMLtoText($string)
;| Exit if there's nothing to convert
IF INSTR($string, "&#") = 0
EXIT 2
ENDIF

DIM $Code[157]
$Code[0] = "	"
$Code[1] = "
"
$Code[2] = "
"
$Code[3] = " "
$Code[4] = "!"
$Code[5] = """
$Code[6] = "#"
$Code[7] = "$"
$Code[8] = "%"
$Code[9] = "&"
$Code[10] = "'"
$Code[11] = "("
$Code[12] = ")"
$Code[13] = "*"
$Code[14] = "+"
$Code[15] = ","
$Code[16] = "-"
$Code[17] = "."
$Code[18] = "/"
$Code[19] = ":"
$Code[20] = "&#59;"
$Code[21] = "<"
$Code[22] = "="
$Code[23] = ">"
$Code[24] = "?"
$Code[25] = "@"
$Code[26] = "["
$Code[27] = "\"
$Code[28] = "]"
$Code[29] = "^"
$Code[30] = "_"
$Code[31] = "`"
$Code[32] = "{"
$Code[33] = "|"
$Code[34] = "}"
$Code[35] = "~"
$Code[36] = "‚"
$Code[37] = "ƒ"
$Code[38] = "„"
$Code[39] = "…"
$Code[40] = "†"
$Code[41] = "‡"
$Code[42] = "ˆ"
$Code[43] = "‰"
$Code[44] = "Š"
$Code[45] = "‹"
$Code[46] = "Œ"
$Code[47] = "‘"
$Code[48] = "’"
$Code[49] = "“"
$Code[50] = "”"
$Code[51] = "•"
$Code[52] = "–"
$Code[53] = "—"
$Code[54] = "˜"
$Code[55] = "™"
$Code[56] = "š"
$Code[57] = "›"
$Code[58] = "œ"
$Code[59] = "Ÿ"
$Code[60] = " "
$Code[61] = "¡"
$Code[62] = "¢"
$Code[63] = "£"
$Code[64] = "¤"
$Code[65] = "¥"
$Code[66] = "¦"
$Code[67] = "§"
$Code[68] = "¨"
$Code[69] = "©"
$Code[70] = "ª"
$Code[71] = "«"
$Code[72] = "¬"
$Code[73] = "­"
$Code[74] = "®"
$Code[75] = "¯"
$Code[76] = "°"
$Code[77] = "±"
$Code[78] = "²"
$Code[79] = "³"
$Code[80] = "´"
$Code[81] = "µ"
$Code[82] = "¶"
$Code[83] = "·"
$Code[84] = "¸"
$Code[85] = "¹"
$Code[86] = "º"
$Code[87] = "»"
$Code[88] = "¼"
$Code[89] = "½"
$Code[90] = "¾"
$Code[91] = "¿"
$Code[92] = "À"
$Code[93] = "Á"
$Code[94] = "Â"
$Code[95] = "Ã"
$Code[96] = "Ä"
$Code[97] = "Å"
$Code[98] = "Æ"
$Code[99] = "Ç"
$Code[100] = "È"
$Code[101] = "É"
$Code[102] = "Ê"
$Code[103] = "Ë"
$Code[104] = "Ì"
$Code[105] = "Í"
$Code[106] = "Î"
$Code[107] = "Ï"
$Code[108] = "Ð"
$Code[109] = "Ñ"
$Code[110] = "Ò"
$Code[111] = "Ó"
$Code[112] = "Ô"
$Code[113] = "Õ"
$Code[114] = "Ö"
$Code[115] = "×"
$Code[116] = "Ø"
$Code[117] = "Ù"
$Code[118] = "Ú"
$Code[119] = "Û"
$Code[120] = "Ü"
$Code[121] = "Ý"
$Code[122] = "Þ"
$Code[123] = "ß"
$Code[124] = "à"
$Code[125] = "á"
$Code[126] = "â"
$Code[127] = "ã"
$Code[128] = "ä"
$Code[129] = "å"
$Code[130] = "æ"
$Code[131] = "ç"
$Code[132] = "è"
$Code[133] = "é"
$Code[134] = "ê"
$Code[135] = "ë"
$Code[136] = "ì"
$Code[137] = "í"
$Code[138] = "î"
$Code[139] = "ï"
$Code[140] = "ð"
$Code[141] = "ñ"
$Code[142] = "ò"
$Code[143] = "ó"
$Code[144] = "ô"
$Code[145] = "õ"
$Code[146] = "ö"
$Code[147] = "÷"
$Code[148] = "ø"
$Code[149] = "ù"
$Code[150] = "ú"
$Code[151] = "û"
$Code[152] = "ü"
$Code[153] = "ý"
$Code[154] = "þ"
$Code[155] = "ÿ"

DIM $Char[157]
$Char[0] = CHR(9)
$Char[1] = CHR(10)
$Char[2] = CHR(13)
$Char[3] = " "
$Char[4] = "!"
$Char[5] = CHR(34)
$Char[6] = "#"
$Char[7] = CHR(36)
$Char[8] = "%"
$Char[9] = "&"
$Char[10] = CHR(39)
$Char[11] = "("
$Char[12] = ")"
$Char[13] = "*"
$Char[14] = "+"
$Char[15] = ","
$Char[16] = "-"
$Char[17] = "."
$Char[18] = "/"
$Char[19] = ":"
$Char[20] = ";"
$Char[21] = "<"
$Char[22] = "="
$Char[23] = ">"
$Char[24] = "?"
$Char[25] = "@"
$Char[26] = "["
$Char[27] = "\"
$Char[28] = "]"
$Char[29] = "^"
$Char[30] = "_"
$Char[31] = "`"
$Char[32] = "{"
$Char[33] = "|"
$Char[34] = "}"
$Char[35] = "~"
$Char[36] = "‚"
$Char[37] = "ƒ"
$Char[38] = "„"
$Char[39] = "…"
$Char[40] = "†"
$Char[41] = "‡"
$Char[42] = "ˆ"
$Char[43] = "‰"
$Char[44] = "Š"
$Char[45] = "‹"
$Char[46] = "Œ"
$Char[47] = "‘"
$Char[48] = "’"
$Char[49] = "“"
$Char[50] = "”"
$Char[51] = "•"
$Char[52] = "–"
$Char[53] = "—"
$Char[54] = "˜"
$Char[55] = "™"
$Char[56] = "š"
$Char[57] = "›"
$Char[58] = "œ"
$Char[59] = "Ÿ"
$Char[60] = " "
$Char[61] = "¡"
$Char[62] = "¢"
$Char[63] = "£"
$Char[64] = "¤"
$Char[65] = "¥"
$Char[66] = "¦"
$Char[67] = "§"
$Char[68] = "¨"
$Char[69] = "©"
$Char[70] = "ª"
$Char[71] = "«"
$Char[72] = "¬"
$Char[73] = "­"
$Char[74] = "®"
$Char[75] = "¯"
$Char[76] = "°"
$Char[77] = "±"
$Char[78] = "²"
$Char[79] = "³"
$Char[80] = "´"
$Char[81] = "µ"
$Char[82] = "¶"
$Char[83] = "·"
$Char[84] = "¸"
$Char[85] = "¹"
$Char[86] = "º"
$Char[87] = "»"
$Char[88] = "¼"
$Char[89] = "½"
$Char[90] = "¾"
$Char[91] = "¿"
$Char[92] = "À"
$Char[93] = "Á"
$Char[94] = "Â"
$Char[95] = "Ã"
$Char[96] = "Ä"
$Char[97] = "Å"
$Char[98] = "Æ"
$Char[99] = "Ç"
$Char[100] = "È"
$Char[101] = "É"
$Char[102] = "Ê"
$Char[103] = "Ë"
$Char[104] = "Ì"
$Char[105] = "Í"
$Char[106] = "Î"
$Char[107] = "Ï"
$Char[108] = "Ð"
$Char[109] = "Ñ"
$Char[110] = "Ò"
$Char[111] = "Ó"
$Char[112] = "Ô"
$Char[113] = "Õ"
$Char[114] = "Ö"
$Char[115] = "×"
$Char[116] = "Ø"
$Char[117] = "Ù"
$Char[118] = "Ú"
$Char[119] = "Û"
$Char[120] = "Ü"
$Char[121] = "Ý"
$Char[122] = "Þ"
$Char[123] = "ß"
$Char[124] = "à"
$Char[125] = "á"
$Char[126] = "â"
$Char[127] = "ã"
$Char[128] = "ä"
$Char[129] = "å"
$Char[130] = "æ"
$Char[131] = "ç"
$Char[132] = "è"
$Char[133] = "é"
$Char[134] = "ê"
$Char[135] = "ë"
$Char[136] = "ì"
$Char[137] = "í"
$Char[138] = "î"
$Char[139] = "ï"
$Char[140] = "ð"
$Char[141] = "ñ"
$Char[142] = "ò"
$Char[143] = "ó"
$Char[144] = "ô"
$Char[145] = "õ"
$Char[146] = "ö"
$Char[147] = "÷"
$Char[148] = "ø"
$Char[149] = "ù"
$Char[150] = "ú"
$Char[151] = "û"
$Char[152] = "ü"
$Char[153] = "ý"
$Char[154] = "þ"
$Char[155] = "ÿ"

$MaxTries = LEN("$string") / 5
WHILE INSTR($string, "&#") <> 0 AND $Tries < $MaxTries
$Tries = $Tries + 1
$CodeStart = INSTR("$string", "&#")
$CodeEnd = INSTR("$string", ";") + 1
$CodeToReplace = SUBSTR("$string", $CodeStart, $CodeEnd - $CodeStart)
$CodeAPos = ASCAN($Code, $CodeToReplace)
IF $CodePos <> -1
$CharToInsert = $Char[$CodeAPos]
$string = SUBSTR("$string", 1, $CodeStart - 1) + $CharToInsert + SUBSTR("$string", $CodeEnd, LEN("$string"))
ENDIF
LOOP
$HTMLtoText = $string
IF INSTR($HTMLtoText, "&#") <> 0
EXIT 3
ELSE
EXIT 1
ENDIF
EndFunction



[ 06. June 2003, 13:07: Message edited by: masken ]
_________________________
The tart is out there

Top
#101881 - 2003-06-06 01:26 PM Re: HTMLtoText, has anyone done that?
masken Offline
MM club member
*****

Registered: 2000-11-27
Posts: 1222
Loc: Gothenburg, Sweden
hmm... the KiX .chm says that arrays are limited to 60 entries?

Also, perhaps it's a good idea to merge the arrays into one, like:
code:
$Code[5] = "&#34;" + "&&" + CHR(34) + "&&" + "&quot;"

...for example, and then use:
code:
$result = SPLIT($Code[5], "&&")

or something?

Edit
I'm working on this.. think I've found a decent solution, will post it later [Smile]


[ 06. June 2003, 15:00: Message edited by: masken ]
_________________________
The tart is out there

Top
#101882 - 2003-06-06 03:27 PM Re: HTMLtoText, has anyone done that?
Lonkero Administrator Offline
KiX Master Guru
*****

Registered: 2001-06-05
Posts: 22346
Loc: OK
masken, arrays are limited... that's true.
if I remember correctly the limit is about 2 million... well, the same as integer-limit.

what comes to the charcodes.
I can't fight for that.

just wait a sec and I prove it with something.
_________________________
!

download KiXnet

Top
#101883 - 2003-06-06 03:40 PM Re: HTMLtoText, has anyone done that?
Lonkero Administrator Offline
KiX Master Guru
*****

Registered: 2001-06-05
Posts: 22346
Loc: OK
yep.
this simple udf converted the numeric presentation (&#) just fine:
code:
 function html2text($in)
$,$t
$in=split($in,"&#")
for $=0 to ubound($in)
$t=split($in[$],";")[0]
$in[$]=chr(substr($t,2))+substr($in[$],len($t)+1)
next
$html2text=$in
endfunction

_________________________
!

download KiXnet

Top
#101884 - 2003-06-06 03:42 PM Re: HTMLtoText, has anyone done that?
Lonkero Administrator Offline
KiX Master Guru
*****

Registered: 2001-06-05
Posts: 22346
Loc: OK
btw, if your udf was 1.0 is this 1.1 or 2.0? [Wink]
_________________________
!

download KiXnet

Top
#101885 - 2003-06-07 10:35 AM Re: HTMLtoText, has anyone done that?
masken Offline
MM club member
*****

Registered: 2000-11-27
Posts: 1222
Loc: Gothenburg, Sweden
hehe.. thanks m8 [Wink]

btw; I took the liberty to make your code readable too [Razz]
code:
FUNCTION html2text($TxtToConvert)
$Count,$Code
$TxtToConvert = SPLIT($TxtToConvert, "&#")
FOR $Count = 0 TO UBOUND($TxtToConvert)
$Code = SPLIT($TxtToConvert[$Count], ";")[0]
$TxtToConvert[$Count] = CHR(SUBSTR($Code, 2)) + SUBSTR($TxtToConvert[$Count], LEN($Code) + 1)
NEXT
$html2text = $TxtToConvert
ENDFUNCTION

I'll look into if the CHR() is really useful here... but if you look at the good old ASCII table displayer, you'll see that the cdes don't match?
code:
BREAK ON
CLS

AT(9,22) "Starting Windows in progress..."
BOX(10,10,12,65,"single")
$COL = 12
WHILE $COL < 64
SETASCII("on")
$a = 1
CLS
WHILE $a < 6 ;Change this value to get more or less numbers
$y = 1
WHILE $y <= 80
$z = 1
WHILE $z <= 20
AT($z,$y) "$x " + CHR($x)
$z = $z + 1
$x = $x + 1
LOOP
$y = $y + 10
LOOP
? SHELL "%COMSPEC% /c pause"
$a = $a + 1
LOOP
EXIT

_________________________
The tart is out there

Top
#101886 - 2003-06-07 10:24 PM Re: HTMLtoText, has anyone done that?
Lonkero Administrator Offline
KiX Master Guru
*****

Registered: 2001-06-05
Posts: 22346
Loc: OK
that is actually what I did.
I checked against your charts and they mathed.
_________________________
!

download KiXnet

Top
#101887 - 2003-06-09 09:41 AM Re: HTMLtoText, has anyone done that?
Richard H. Administrator Offline
Administrator
*****

Registered: 2000-01-24
Posts: 4946
Loc: Leatherhead, Surrey, UK
Be careful when talking about ASCII codes (and tables).

Standard ASCII is 7 bit, and so only covers codes 0-127.

The "extended" ASCII sets are 8 bit (0-255), and the extra symbols are used for national language replacement and things like line drawing characters.

The biggest problem is that the characters above 127 are not constant. This is why in DOS you used to have to load "codeset" pages, which installed a local version of the symbols - these are often called things like "Latin-8" or "IBM Line drawing".

We still have code pages even now,, especially in email and HTML - check out the "charset=" header found in mail headers and on web server responses which defines an ISO character set. You may occasionally get email from the Internet which has warnings that the character set is not supported where it comes from a country which uses a different alphabet. In the Roman alphabet world we would have trouble dealing with Kanjii or cyrillic alphabets unless those code pages have been installed.

Printers also have code pages - when you select a font in say Word, the application will translate it into typeface, weight, and character set amongst other things.

What all this means is that the symbols which appear may well be different on different PCs for the same number. Indeed, the symbol that appears in you Word processing document may not be the one that appears on the page when you print it.

If you want to see the same things you need to be sure that the devices are set the same.

The adoption of unicode will avoid these problems, as each symbol will have a unique number.

Top
#101888 - 2003-06-09 10:03 AM Re: HTMLtoText, has anyone done that?
Lonkero Administrator Offline
KiX Master Guru
*****

Registered: 2001-06-05
Posts: 22346
Loc: OK
masken, your code has the problem of setting ascii to on [Wink]

indeed, we are now playing with "western european" char-codes.
but I won't even dream of handling character-set separately.
something like KOI-8 will keep my away from those.
_________________________
!

download KiXnet

Top
#101889 - 2003-06-09 04:45 PM Re: HTMLtoText, has anyone done that?
masken Offline
MM club member
*****

Registered: 2000-11-27
Posts: 1222
Loc: Gothenburg, Sweden
Yeah... you're right avout the ASCII codes there Richard. But I think in webpages, it's a static table? Ie; the one below? This would also make sure any local OS language gets the codes right?

Right now though.. now why doesn't this work? [Frown] ASCAN() always returns -1. the IF NOT case still kicks in too [Confused]

code:
FUNCTION HTMLtoText($string)
;| Exit if there's nothing to convert
IF INSTR($string, "&") = 0 OR INSTR($string, ";") = 0
EXIT 2
ENDIF

DIM $Codes[157]
$Codes[0] = CHR(9) + "&&" + "&#09;"
$Codes[1] = CHR(10) + "&&" + "&#10;"
$Codes[2] = CHR(13) + "&&" + "&#13;"
$Codes[3] = " " + "&&" + "&#32;"
$Codes[4] = "!" + "&&" + "&#33;"
$Codes[5] = CHR(34) + "&&" + "&#34;" + "&&" + "&quot;"
$Codes[6] = "#" + "&&" + "&#35;"
$Codes[7] = CHR(36) + "&&" + "&#36;"
$Codes[8] = "%" + "&&" + "&#37;"
$Codes[9] = "&" + "&&" + "&#38;" + "&&" + "&amp;"
$Codes[10] = CHR(39) + "&&" + "&#39;"
$Codes[11] = "(" + "&&" + "&#40;"
$Codes[12] = ")" + "&&" + "&#41;"
$Codes[13] = "*" + "&&" + "&#42;"
$Codes[14] = "+" + "&&" + "&#43;"
$Codes[15] = "," + "&&" + "&#44;"
$Codes[16] = "-" + "&&" + "&#45;"
$Codes[17] = "." + "&&" + "&#46;"
$Codes[18] = "/" + "&&" + "&#47;"
$Codes[19] = ":" + "&&" + "&#58;"
$Codes[20] = ";" + "&&" + "&#59;"
$Codes[21] = "<" + "&&" + "&#60;" + "&&" + "&lt;"
$Codes[22] = "=" + "&&" + "&#61;"
$Codes[23] = ">" + "&&" + "&#62;" + "&&" + "&gt;"
$Codes[24] = "?" + "&&" + "&#63;"
$Codes[25] = "@" + "&&" + "&#64;"
$Codes[26] = "[" + "&&" + "&#91;"
$Codes[27] = "\" + "&&" + "&#92;"
$Codes[28] = "]" + "&&" + "&#93;"
$Codes[29] = "^" + "&&" + "&#94;"
$Codes[30] = "_" + "&&" + "&#95;"
$Codes[31] = "`" + "&&" + "&#96;"
$Codes[32] = "{" + "&&" + "&#123;"
$Codes[33] = "|" + "&&" + "&#124;"
$Codes[34] = "}" + "&&" + "&#125;"
$Codes[35] = "~" + "&&" + "&#126;"
$Codes[36] = "‚" + "&&" + "&#130;"
$Codes[37] = "ƒ" + "&&" + "&#131;"
$Codes[38] = "„" + "&&" + "&#132;"
$Codes[39] = "…" + "&&" + "&#133;"
$Codes[40] = "†" + "&&" + "&#134;"
$Codes[41] = "‡" + "&&" + "&#135;"
$Codes[42] = "ˆ" + "&&" + "&#136;"
$Codes[43] = "‰" + "&&" + "&#137;"
$Codes[44] = "Š" + "&&" + "&#138;"
$Codes[45] = "‹" + "&&" + "&#139;"
$Codes[46] = "Œ" + "&&" + "&#140;"
$Codes[47] = "‘" + "&&" + "&#145;"
$Codes[48] = "’" + "&&" + "&#146;"
$Codes[49] = "“" + "&&" + "&#147;"
$Codes[50] = "”" + "&&" + "&#148;"
$Codes[51] = "•" + "&&" + "&#149;"
$Codes[52] = "–" + "&&" + "&#150;"
$Codes[53] = "—" + "&&" + "&#151;"
$Codes[54] = "˜" + "&&" + "&#152;"
$Codes[55] = "™" + "&&" + "&#153;" + "&&" + "&trade;"
$Codes[56] = "š" + "&&" + "&#154;"
$Codes[57] = "›" + "&&" + "&#155;"
$Codes[58] = "œ" + "&&" + "&#156;"
$Codes[59] = "Ÿ" + "&&" + "&#159;"
$Codes[60] = " " + "&&" + "&#160;" + "&&" + "&nbsp;"
$Codes[61] = "¡" + "&&" + "&#161;" + "&&" + "&iexcl;"
$Codes[62] = "¢" + "&&" + "&#162;" + "&&" + "&cent;"
$Codes[63] = "£" + "&&" + "&#163;" + "&&" + "&pound;"
$Codes[64] = "¤" + "&&" + "&#164;" + "&&" + "&curren;"
$Codes[65] = "¥" + "&&" + "&#165;" + "&&" + "&yen;"
$Codes[66] = "¦" + "&&" + "&#166;" + "&&" + "&brvbar;"
$Codes[67] = "§" + "&&" + "&#167;" + "&&" + "&sect;"
$Codes[68] = "¨" + "&&" + "&#168;" + "&&" + "&uml;"
$Codes[69] = "©" + "&&" + "&#169;" + "&&" + "&copy;"
$Codes[70] = "ª" + "&&" + "&#170;" + "&&" + "&ordf;"
$Codes[71] = "«" + "&&" + "&#171;" + "&&" + "&laquo;"
$Codes[72] = "¬" + "&&" + "&#172;" + "&&" + "&not;"
$Codes[73] = "­" + "&&" + "&#173;" + "&&" + "&shy;"
$Codes[74] = "®" + "&&" + "&#174;" + "&&" + "&reg;"
$Codes[75] = "¯" + "&&" + "&#175;" + "&&" + "&macr;"
$Codes[76] = "°" + "&&" + "&#176;" + "&&" + "&deg;"
$Codes[77] = "±" + "&&" + "&#177;" + "&&" + "&plusmn;"
$Codes[78] = "²" + "&&" + "&#178;" + "&&" + "&sup2;"
$Codes[79] = "³" + "&&" + "&#179;" + "&&" + "&sup3;"
$Codes[80] = "´" + "&&" + "&#180;" + "&&" + "&acute;"
$Codes[81] = "µ" + "&&" + "&#181;" + "&&" + "&micro;"
$Codes[82] = "¶" + "&&" + "&#182;" + "&&" + "&para;"
$Codes[83] = "·" + "&&" + "&#183;" + "&&" + "&middot;"
$Codes[84] = "¸" + "&&" + "&#184;" + "&&" + "&cedil;"
$Codes[85] = "¹" + "&&" + "&#185;" + "&&" + "&sup1;"
$Codes[86] = "º" + "&&" + "&#186;" + "&&" + "&ordm;"
$Codes[87] = "»" + "&&" + "&#187;" + "&&" + "&raquo;"
$Codes[88] = "¼" + "&&" + "&#188;" + "&&" + "&frac14;"
$Codes[89] = "½" + "&&" + "&#189;" + "&&" + "&frac12;"
$Codes[90] = "¾" + "&&" + "&#190;" + "&&" + "&frac34;"
$Codes[91] = "¿" + "&&" + "&#191;" + "&&" + "&iquest;"
$Codes[92] = "À" + "&&" + "&#192;" + "&&" + "&Agrave;"
$Codes[93] = "Á" + "&&" + "&#193;" + "&&" + "&Aacute;"
$Codes[94] = "Â" + "&&" + "&#194;" + "&&" + "&Acirc;"
$Codes[95] = "Ã" + "&&" + "&#195;" + "&&" + "&Atilde;"
$Codes[96] = "Ä" + "&&" + "&#196;" + "&&" + "&Auml;"
$Codes[97] = "Å" + "&&" + "&#197;" + "&&" + "&Aring;"
$Codes[98] = "Æ" + "&&" + "&#198;" + "&&" + "&AElig;"
$Codes[99] = "Ç" + "&&" + "&#199;" + "&&" + "&Ccedil;"
$Codes[100] = "È" + "&&" + "&#200;" + "&&" + "&Egrave;"
$Codes[101] = "É" + "&&" + "&#201;" + "&&" + "&Eacute;"
$Codes[102] = "Ê" + "&&" + "&#202;" + "&&" + "&Ecirc;"
$Codes[103] = "Ë" + "&&" + "&#203;" + "&&" + "&Euml;"
$Codes[104] = "Ì" + "&&" + "&#204;" + "&&" + "&Igrave;"
$Codes[105] = "Í" + "&&" + "&#205;" + "&&" + "&Iacute;"
$Codes[106] = "Î" + "&&" + "&#206;" + "&&" + "&Icirc;"
$Codes[107] = "Ï" + "&&" + "&#207;" + "&&" + "&Iuml;"
$Codes[108] = "Ð" + "&&" + "&#208;" + "&&" + "&eth;"
$Codes[109] = "Ñ" + "&&" + "&#209;" + "&&" + "&Ntilde;"
$Codes[110] = "Ò" + "&&" + "&#210;" + "&&" + "&Ograve;"
$Codes[111] = "Ó" + "&&" + "&#211;" + "&&" + "&Oacute;"
$Codes[112] = "Ô" + "&&" + "&#212;" + "&&" + "&Ocirc;"
$Codes[113] = "Õ" + "&&" + "&#213;" + "&&" + "&Otilde;"
$Codes[114] = "Ö" + "&&" + "&#214;" + "&&" + "&Ouml;"
$Codes[115] = "×" + "&&" + "&#215;" + "&&" + "&times;"
$Codes[116] = "Ø" + "&&" + "&#216;" + "&&" + "&Oslash;"
$Codes[117] = "Ù" + "&&" + "&#217;" + "&&" + "&Ugrave;"
$Codes[118] = "Ú" + "&&" + "&#218;" + "&&" + "&Uacute;"
$Codes[119] = "Û" + "&&" + "&#219;" + "&&" + "&Ucirc;"
$Codes[120] = "Ü" + "&&" + "&#220;" + "&&" + "&Uuml;"
$Codes[121] = "Ý" + "&&" + "&#221;" + "&&" + "&Yacute;"
$Codes[122] = "Þ" + "&&" + "&#222;" + "&&" + "&thorn;"
$Codes[123] = "ß" + "&&" + "&#223;" + "&&" + "&szlig;"
$Codes[124] = "à" + "&&" + "&#224;" + "&&" + "&agrave;"
$Codes[125] = "á" + "&&" + "&#225;" + "&&" + "&aacute;"
$Codes[126] = "â" + "&&" + "&#226;" + "&&" + "&acirc;"
$Codes[127] = "ã" + "&&" + "&#227;" + "&&" + "&atilde;"
$Codes[128] = "ä" + "&&" + "&#228;" + "&&" + "&auml;"
$Codes[129] = "å" + "&&" + "&#229;" + "&&" + "&aring;"
$Codes[130] = "æ" + "&&" + "&#230;" + "&&" + "&aelig;"
$Codes[131] = "ç" + "&&" + "&#231;" + "&&" + "&ccedil;"
$Codes[132] = "è" + "&&" + "&#232;" + "&&" + "&egrave;"
$Codes[133] = "é" + "&&" + "&#233;" + "&&" + "&eacute;"
$Codes[134] = "ê" + "&&" + "&#234;" + "&&" + "&ecirc;"
$Codes[135] = "ë" + "&&" + "&#235;" + "&&" + "&euml;"
$Codes[136] = "ì" + "&&" + "&#236;" + "&&" + "&igrave;"
$Codes[137] = "í" + "&&" + "&#237;" + "&&" + "&iacute;"
$Codes[138] = "î" + "&&" + "&#238;" + "&&" + "&icirc;"
$Codes[139] = "ï" + "&&" + "&#239;" + "&&" + "&iuml;"
$Codes[140] = "ð" + "&&" + "&#240;" + "&&" + "&eth;"
$Codes[141] = "ñ" + "&&" + "&#241;" + "&&" + "&ntilde;"
$Codes[142] = "ò" + "&&" + "&#242;" + "&&" + "&ograve;"
$Codes[143] = "ó" + "&&" + "&#243;" + "&&" + "&oacute;"
$Codes[144] = "ô" + "&&" + "&#244;" + "&&" + "&ocirc;"
$Codes[145] = "õ" + "&&" + "&#245;" + "&&" + "&otilde;"
$Codes[146] = "ö" + "&&" + "&#246;" + "&&" + "&ouml;"
$Codes[147] = "÷" + "&&" + "&#247;" + "&&" + "&divide;"
$Codes[148] = "ø" + "&&" + "&#248;" + "&&" + "&oslash;"
$Codes[149] = "ù" + "&&" + "&#249;" + "&&" + "&ugrave;"
$Codes[150] = "ú" + "&&" + "&#250;" + "&&" + "&uacute;"
$Codes[151] = "û" + "&&" + "&#251;" + "&&" + "&ucirc;"
$Codes[152] = "ü" + "&&" + "&#252;" + "&&" + "&uuml;"
$Codes[153] = "ý" + "&&" + "&#253;" + "&&" + "&yacute;"
$Codes[154] = "þ" + "&&" + "&#254;" + "&&" + "&thorn;"
$Codes[155] = "ÿ" + "&&" + "&#255;" + "&&" + "&yuml;"

$MaxTries = LEN("$string") / 5
WHILE INSTR($string, "&") <> 0 AND $Tries < $MaxTries
$Tries = $Tries + 1
$CodeFound = 0
$CodeStart = INSTR("$string", "&")
$CodeEnd = INSTR("$string", ";") + 1
$CodeToReplace = SUBSTR("$string", $CodeStart, $CodeEnd - $CodeStart)
$CodeLength = LEN("$CodeToReplace")

;---TEST
? "CodeStart: " + $CodeStart
? "CodeEnd: " + $CodeEnd
? "CodeToReplace: " + CHR(34) + $CodeToReplace + CHR(34)
? "CodeLength: " + $CodeLength
;---/TEST

IF $CodeLength > 3 AND $CodeLength < 9
;|All codes are between 4-8 characters long.
$CodeAPos = ASCAN($Codes, $CodeToReplace)
? "CodeAPos: " + $CodeAPos
IF NOT $CodePos < 0
;|The code exists
$CodeFound = 1
$CharToInsert = $Codes[$CodeAPos]
;---TEST
? "CharToInsert: " + $CharToInsert
;---/TEST
$CharToInsert = SPLIT($CharToInsert, "&&")[0]
;---TEST
? "CharToInsert: " + $CharToInsert
;---/TEST
$string = SUBSTR("$string", 1, $CodeStart - 1) + $CharToInsert + SUBSTR("$string", $CodeEnd, LEN("$string"))
ENDIF
ENDIF
IF $CodeFound <> 1
;|we need to skip the part which isn't convertible before the next loop
$stringNoChar = $stringNoChar + SUBSTR("$string", 1, $CodeEnd)
$string = SUBSTR("$string", $CodeEnd + 1, LEN("$string"))
ENDIF
LOOP
$HTMLtoText = $string
IF INSTR($HTMLtoText, "&#") <> 0
EXIT 3
ELSE
EXIT 1
ENDIF
ENDFUNCTION



[ 09. June 2003, 16:55: Message edited by: masken ]
_________________________
The tart is out there

Top
#101890 - 2003-06-09 04:58 PM Re: HTMLtoText, has anyone done that?
Lonkero Administrator Offline
KiX Master Guru
*****

Registered: 2001-06-05
Posts: 22346
Loc: OK
no.
afaik, the codes from webpages are relative to their codepage.
thus, your array is still "useless" as it still is replaceable with simple chr()...
_________________________
!

download KiXnet

Top
#101891 - 2003-06-09 05:13 PM Re: HTMLtoText, has anyone done that?
Richard H. Administrator Offline
Administrator
*****

Registered: 2000-01-24
Posts: 4946
Loc: Leatherhead, Surrey, UK
Actually, the "standard" symbols are taken from a couple of code pages - more information here

The idea is that it is up to the renderer to decide what the character should look like. That information is contained in the HTTP headers which are sent before the HTML data - you won't see them if you "view" the HTML text, but you can get hold of them through the COM interface.

If you are simply converting the characters from the "&nnn;" format you should leave them as whatever appears by using the CHR() function, otherwise you are translating to a completely different character.

If you are writing a rendering application then you can change the characters to whatever you want, but you need to be aware that if you ignore the code set which the webserver is expecting you to have installed you may end up with odd results.

Top
#101892 - 2003-06-09 05:17 PM Re: HTMLtoText, has anyone done that?
masken Offline
MM club member
*****

Registered: 2000-11-27
Posts: 1222
Loc: Gothenburg, Sweden
Oh, thanks Lonk... ASCAN() needs perfect array matches, not partial... guess the manual could be a bit more precies there than "Searches an array for an element containing the same value as an expression.". [Wink]

But there's still a need for an array for the "alternative" codes (last in some of the array entries above).

Ok Richard, thanks [Smile] The HTML in this case is the returned page from another function, that Lonkero wrote:
code:
; Loads a webpage into a variable
Function GetPage($URL)
DIM $htmldata
$htmldata = CreateObject("microsoft.XMLhttp")
$htmldata.open("GET",$URL,not 1)
$htmldata.send
$getpage=$htmldata.responsetext ;or responsebody
EndFunction

So I guess what you're saying is that I should use Lonkero's CHR() based conversion for numeric format entries (with SETASCII("ON") in KiX), and an array for the Name format codes?


[ 09. June 2003, 17:30: Message edited by: masken ]
_________________________
The tart is out there

Top
Page 1 of 3 123>


Moderator:  Arend_, Allen, Jochen, Radimus, Glenn Barnas, ShaneEP, Ruud van Velsen, Mart 
Hop to:
Shout Box

Who's Online
1 registered (Allen) and 675 anonymous users online.
Newest Members
batdk82, StuTheCoder, M_Moore, BeeEm, min_seow
17885 Registered Users

Generated in 0.076 seconds in which 0.026 seconds were spent on a total of 12 queries. Zlib compression enabled.

Search the board with:
superb Board Search
or try with google:
Google
Web kixtart.org