I have a large text file with page breaks in the file. I'd like to break the larger document into single file text files based on the page break. Is this possible using the split function or some other way? Thanks.
This is a little dirty and probably could be optimized, but I think it works. Will create files based on the original with a _#####.ext in the same directory as the original. See what you get.
Good point, and no reason that shouldn't work. Like I said though, it could use some work / tlc (last night I was rushing to get it done before leaving for dinner.)
Registered: 2000-01-24
Posts: 4946
Loc: Leatherhead, Surrey, UK
Originally Posted By: Young 'un
By the way, how were you able to determine that page break was a chr(12)?
Us old timers know the ASCII control set forwards and backwards.
CHR(12) is a standard character used in device control. Primarily printers, teletypes and old style character terminals. It is a "form feed" character, which in the days of continuous fan-fold paper meant "advance to the top of the next page"
For modern page printers (laser jet) is means "eject the current page" and for character terminals it means "clear the screen".
Control characters do other things too - carriage return, horizontal and vertical tabs, bell (or beep), and the good old "introduce a non-standard sequence" escape character.
Richard, thanks for the good background and information about ASCII.
Now that I'm able to successfully split the document by page break, I have a new dilemma. I need to search the file for some text, shown as "Findme" below in the subset of the file. The codes (4012F, 98966) under FindMe may or may not exist and they'll always be different. I need to determine if they exist or not. My question is, after I find the "FindMe" text in the file, how do I have the script search two lines below to see if any text exists?
Thanks
Start subset of file
================================================================================ Example Code Description Mod1 Mod2 Mod3 Level -------------------------------------------------------------------------------- 99441 Description here 1 1
Registered: 2000-01-24
Posts: 4946
Loc: Leatherhead, Surrey, UK
The following script will search your original file and output the information that you are looking for.
This assumes that lines are terminated by carriage-return / line-feed pairs.
It requires the LoadFile() UDF as before.
Code:
Break ON
$=SetOption("Explicit","ON")
Dim $sSource,$asPages,$iPageCount,$sSentinelText,$iIndex,$asLines
$sSource=".\findme.txt"
$sSentinelText="Findme"
$asPages=LoadFile($sSource,Chr(12))
For $iPageCount=0 to UBound($asPages)
"Working on page # "+(1+$iPageCount)+@CRLF
$iIndex=InStr($asPages[$iPageCount],$sSentinelText)
If Not $iIndex " String '"+$sSentinelText+"' not found on page"+@CRLF EndIf
While $iIndex
$asPages[$iPageCount]=SubStr($asPages[$iPageCount],$iIndex+Len($sSentinelText))
$asLines=Split($asPages[$iPageCount]+@CRLF+@CRLF+@CRLF,@CRLF)
If (""+$asLines[2]+$asLines[3])=""
" String '"+$sSentinelText+"' found on page, but no data present"+@CRLF
Else
" First code : "+Split($asLines[2])[0]+@CRLF
" Second code: "+Split($asLines[3])[0]+@CRLF
EndIf
$iIndex=InStr($asPages[$iPageCount],$sSentinelText)
Loop
Next
By the way, how were you able to determine that page break was a chr(12)?
I didn't know for certain what the page break code was, but the way I confirmed it was to use "type" in the cmd shell. Once in the cmd, I typed: type yourtextfile.txt, and it displayed the contents as well as the the pagebreak symbols to the screen. Then I just wrote a little code that would check the first letter of each line and display the ascii code: asc($letter). This spit out 12 for the page break.
#200408 - 2010-10-2802:16 AMRe: split on page break
[Re: Richard H.]
Glenn BarnasGlenn Barnas KiX Supporter
Registered: 2003-01-28
Posts: 4401
Loc: New Jersey
Well, personally, I think an ASCII primer should be required reading in grade school.
Despite Unicode and MultiByte and even the occasional UniCycle , basic scripting and page formatting still relies on control codes, and a basic understanding of them is important.
I regularly use US, FS, SOT, EOT and other "separator" characters in my scripts. Chr(31) is a valid ASCII code that works well (and somewhat officially) as a delimiter in split, join, and even message strings when I need a delimiter and can't use a printable charachter.
In fact, I regularly transmit arrays of arrays via socket communications in Kix where the outer array (record) is delimited with Chr(31) and the inner (field) array is delimited with Chr(30). Works exceptionally well and doesn't interfere with the payload.
Just to throw an alternative solution out there, here's a simple script that accomplishes both tasks - break a file into separate page files, and locate a string on each page and determine if the line(s) that follow contain data:
Code:
; This script relies on the FileIO function
Call '%KIXLIBPATH%\FileIO.kxf'
$FPath = 'c:\temp\' ; location of file(s)
$File = 'test.txt' ; name of file
$aData = FileIO($FPath + $File, 'R') ; read the original file
$aData = Join($aData, @CRLF) ; Combine into a single string
$aData = Split($aData, Chr(12)) ; break on FormFeed chars
; This block will break the original file into separate files per page
; This solves the original request, but is not needed to continue
; to the next section which checks for data after a specific string
; on each page
For $Page = 0 to UBound($aData) ; enumerate pages
$SubFile = Right('0000' + $Page, 4) + '_' + $File ; create page filename "0000_filename.txt"
$aSubData = Split($aData[$Page],@CRLF) ; create array of lines for current page
$ = FileIO($FPath + $SubFile, 'W', $aSubData) ; write the sub-file
Next
; at this point, we have an array of pages as a simple string
; Using a similar logic block, we can search each page for a string and then check the next two lines
For $Page = 0 to UBound($aData) ; enumerate pages
$aSubData = Split($aData[$Page],@CRLF) ; create array of lines for current page
$SearchStart = AScan($aSubData, 'findme', 1, , 1) ; locate the line with the search phrase
If $SearchStart ; was it found? will be zero if not
; the search phrase was found in the current page on the line represented by $SearchStart
; Check the next two lines for data, but only if at least 2 more lines are present on the page
If UBound($aSubData) >= $SearchStart + 2
If $aSubData[$SearchStart + 1] ; FindMe + 1 has data
; do something, such as
'Page ' ($Page + 1) ' - search line 1 contains ' $aSubData[$SearchStart + 1] ?
EndIf
If $aSubData[$SearchStart + 2] ; FindMe + 2 has data
'Page ' ($Page + 1) ' - search line 1 contains ' $aSubData[$SearchStart + 2] ?
; do something
EndIf
EndIf
EndIf
Next
Glenn
_________________________ Actually I am a Rocket Scientist!