Page 1 of 1 1
Topic Options
#200385 - 2010-10-26 11:05 PM split on page break
booey Offline
Getting the hang of it

Registered: 2005-07-25
Posts: 76
Loc: USA
I have a large text file with page breaks in the file. I'd like to break the larger document into single file text files based on the page break. Is this possible using the split function or some other way?
Thanks.

Top
#200386 - 2010-10-26 11:07 PM Re: split on page break [Re: booey]
Allen Administrator Offline
KiX Supporter
*****

Registered: 2003-04-19
Posts: 4562
Loc: USA
Should be possible. Can you provide a sample of what the page break looks like?
Top
#200387 - 2010-10-26 11:15 PM Re: split on page break [Re: Allen]
booey Offline
Getting the hang of it

Registered: 2005-07-25
Posts: 76
Loc: USA
I attached a sample of the file. I can see the page break in Textpad, but not so much in Notepad.
Thanks.


Attachments
Document2.txt (324 downloads)
Description:



Top
#200388 - 2010-10-27 01:00 AM Re: split on page break [Re: booey]
Allen Administrator Offline
KiX Supporter
*****

Registered: 2003-04-19
Posts: 4562
Loc: USA
This was a fun puzzle... Your page break was an chr(12) character.

Requires Loadfile() -
http://www.kixtart.org/forums/ubbthreads.php?ubb=showflat&Number=165959

This is a little dirty and probably could be optimized, but I think it works. Will create files based on the original with a _#####.ext in the same directory as the original. See what you get.

break off
 
$filename="d:\temp\222.txt"
$RC=pagebreak($filename)


function pagebreak($filename) dim $count,$line,$ffh,$rc,$folder,$ext,$file if exist($filename) $folder=left($filename,instrrev($filename,"\")) $ext=right($filename,4) $file=left(right($filename,(len($filename)-len($folder))),-4) $count=1 $ffh=freefilehandle() for each $line in loadfile($filename,@CRLF) if asc(left($line,1))=12 $count=$count+1 else if open($ffh,$folder + $file + "_" + right("0000"+ $count,5) + $ext,5)=0 $rc=writeline($ffh,$line + @CRLF) $rc=close($ffh) endif endif next else exit 2 endif endfunction

Top
#200390 - 2010-10-27 09:28 AM Re: split on page break [Re: Allen]
Richard H. Administrator Offline
Administrator
*****

Registered: 2000-01-24
Posts: 4946
Loc: Leatherhead, Surrey, UK
Out of interest, why not pass the Form Feed characater to the LoadFile() UDF?

That way you'll get back an array of pages.

Top
#200394 - 2010-10-27 02:48 PM Re: split on page break [Re: Richard H.]
Allen Administrator Offline
KiX Supporter
*****

Registered: 2003-04-19
Posts: 4562
Loc: USA
Good point, and no reason that shouldn't work. Like I said though, it could use some work / tlc (last night I was rushing to get it done before leaving for dinner.)
Top
#200396 - 2010-10-27 03:13 PM Re: split on page break [Re: Allen]
Allen Administrator Offline
KiX Supporter
*****

Registered: 2003-04-19
Posts: 4562
Loc: USA
Here's Richard's suggestion in action. Still needs more error checking, etc.

function pagebreak($filename)
  dim $i,$ffh,$rc,$folder,$ext,$file,$pages
  if exist($filename)
    $folder=left($filename,instrrev($filename,"\"))
    $ext=right($filename,4)
    $file=left(right($filename,(len($filename)-len($folder))),-4)
    $ffh=freefilehandle()
    $pages=loadfile($filename,chr(12))
    for $i= 0 to ubound($pages)
      if open($ffh,$folder +  $file + "_" + right("0000"+ ($i+1),5) + $ext,5)=0
        $rc=writeline($ffh,$pages[$i])
        $rc=close($ffh)
      endif
    next
  else
    exit 2
  endif
endfunction

Top
#200398 - 2010-10-27 03:53 PM Re: split on page break [Re: Allen]
booey Offline
Getting the hang of it

Registered: 2005-07-25
Posts: 76
Loc: USA
You guys are amazing. That works great. By the way, how were you able to determine that page break was a chr(12)?
Top
#200400 - 2010-10-27 04:52 PM Re: split on page break [Re: booey]
Richard H. Administrator Offline
Administrator
*****

Registered: 2000-01-24
Posts: 4946
Loc: Leatherhead, Surrey, UK
 Originally Posted By: Young 'un
By the way, how were you able to determine that page break was a chr(12)?


Us old timers know the ASCII control set forwards and backwards.

CHR(12) is a standard character used in device control. Primarily printers, teletypes and old style character terminals. It is a "form feed" character, which in the days of continuous fan-fold paper meant "advance to the top of the next page"

For modern page printers (laser jet) is means "eject the current page" and for character terminals it means "clear the screen".

Control characters do other things too - carriage return, horizontal and vertical tabs, bell (or beep), and the good old "introduce a non-standard sequence" escape character.

For more information (though Lord knows why you'd want to): see the WIKI page: http://en.wikipedia.org/wiki/ASCII

Top
#200401 - 2010-10-27 05:27 PM Re: split on page break [Re: Richard H.]
booey Offline
Getting the hang of it

Registered: 2005-07-25
Posts: 76
Loc: USA
Richard, thanks for the good background and information about ASCII.

Now that I'm able to successfully split the document by page break, I have a new dilemma. I need to search the file for some text, shown as "Findme" below in the subset of the file. The codes (4012F, 98966) under FindMe may or may not exist and they'll always be different. I need to determine if they exist or not. My question is, after I find the "FindMe" text in the file, how do I have the script search two lines below to see if any text exists?

Thanks

Start subset of file

================================================================================
Example Code Description Mod1 Mod2 Mod3 Level
--------------------------------------------------------------------------------
99441 Description here 1 1

================================================================================
Findme Description Mod1 Mod2 Mod3 Level
--------------------------------------------------------------------------------
4012F Description1 1 1
98966 Description2 1 1

Top
#200402 - 2010-10-27 06:14 PM Re: split on page break [Re: booey]
Richard H. Administrator Offline
Administrator
*****

Registered: 2000-01-24
Posts: 4946
Loc: Leatherhead, Surrey, UK
The following script will search your original file and output the information that you are looking for.

This assumes that lines are terminated by carriage-return / line-feed pairs.

It requires the LoadFile() UDF as before.
 Code:
Break ON

$=SetOption("Explicit","ON")

Dim $sSource,$asPages,$iPageCount,$sSentinelText,$iIndex,$asLines

$sSource=".\findme.txt"
$sSentinelText="Findme"

$asPages=LoadFile($sSource,Chr(12))

For $iPageCount=0 to UBound($asPages)
	"Working on page # "+(1+$iPageCount)+@CRLF
	$iIndex=InStr($asPages[$iPageCount],$sSentinelText)
	If Not $iIndex "   String '"+$sSentinelText+"' not found on page"+@CRLF EndIf
	While $iIndex
		$asPages[$iPageCount]=SubStr($asPages[$iPageCount],$iIndex+Len($sSentinelText))
		$asLines=Split($asPages[$iPageCount]+@CRLF+@CRLF+@CRLF,@CRLF)
		If (""+$asLines[2]+$asLines[3])=""
			"   String '"+$sSentinelText+"' found on page, but no data present"+@CRLF
		Else
			"   First code : "+Split($asLines[2])[0]+@CRLF
			"   Second code: "+Split($asLines[3])[0]+@CRLF
		EndIf
		$iIndex=InStr($asPages[$iPageCount],$sSentinelText)
	Loop
Next

Top
#200403 - 2010-10-27 09:15 PM Re: split on page break [Re: Richard H.]
booey Offline
Getting the hang of it

Registered: 2005-07-25
Posts: 76
Loc: USA
Richard,

Thank you very much. I'm envious of your and other's Kixtart skills. I should be able to take it from.

Thanks to all who helped me with this.

Top
#200404 - 2010-10-27 09:45 PM Re: split on page break [Re: booey]
Allen Administrator Offline
KiX Supporter
*****

Registered: 2003-04-19
Posts: 4562
Loc: USA
 Quote:
By the way, how were you able to determine that page break was a chr(12)?


I didn't know for certain what the page break code was, but the way I confirmed it was to use "type" in the cmd shell. Once in the cmd, I typed: type yourtextfile.txt, and it displayed the contents as well as the the pagebreak symbols to the screen. Then I just wrote a little code that would check the first letter of each line and display the ascii code: asc($letter). This spit out 12 for the page break.

Top
#200408 - 2010-10-28 02:16 AM Re: split on page break [Re: Richard H.]
Glenn Barnas Administrator Offline
KiX Supporter
*****

Registered: 2003-01-28
Posts: 4401
Loc: New Jersey
Well, personally, I think an ASCII primer should be required reading in grade school. ;\)

Despite Unicode and MultiByte and even the occasional UniCycle ;\) , basic scripting and page formatting still relies on control codes, and a basic understanding of them is important.

I regularly use US, FS, SOT, EOT and other "separator" characters in my scripts. Chr(31) is a valid ASCII code that works well (and somewhat officially) as a delimiter in split, join, and even message strings when I need a delimiter and can't use a printable charachter.

In fact, I regularly transmit arrays of arrays via socket communications in Kix where the outer array (record) is delimited with Chr(31) and the inner (field) array is delimited with Chr(30). Works exceptionally well and doesn't interfere with the payload.

Just to throw an alternative solution out there, here's a simple script that accomplishes both tasks - break a file into separate page files, and locate a string on each page and determine if the line(s) that follow contain data:
 Code:
; This script relies on the FileIO function
Call '%KIXLIBPATH%\FileIO.kxf'

$FPath = 'c:\temp\'					; location of file(s)
$File = 'test.txt'					; name of file

$aData = FileIO($FPath + $File, 'R')			; read the original file
$aData = Join($aData, @CRLF)				; Combine into a single string
$aData = Split($aData, Chr(12))				; break on FormFeed chars

; This block will break the original file into separate files per page
; This solves the original request, but is not needed to continue 
; to the next section which checks for data after a specific string
; on each page
For $Page = 0 to UBound($aData)				; enumerate pages
  $SubFile = Right('0000' + $Page, 4) + '_' + $File	; create page filename "0000_filename.txt"
  $aSubData = Split($aData[$Page],@CRLF)		; create array of lines for current page
  $ = FileIO($FPath + $SubFile, 'W', $aSubData)		; write the sub-file
Next


; at this point, we have an array of pages as a simple string
; Using a similar logic block, we can search each page for a string and then check the next two lines

For $Page = 0 to UBound($aData)				; enumerate pages
  $aSubData = Split($aData[$Page],@CRLF)		; create array of lines for current page
  $SearchStart = AScan($aSubData, 'findme', 1, , 1)	; locate the line with the search phrase
  If $SearchStart					; was it found? will be zero if not
    ; the search phrase was found in the current page on the line represented by $SearchStart
    ; Check the next two lines for data, but only if at least 2 more lines are present on the page
    If UBound($aSubData) >= $SearchStart + 2
      If $aSubData[$SearchStart + 1] 			; FindMe + 1 has data
        ; do something, such as
        'Page ' ($Page + 1) ' - search line 1 contains ' $aSubData[$SearchStart + 1] ?
      EndIf
      If $aSubData[$SearchStart + 2] 			; FindMe + 2 has data
        'Page ' ($Page + 1) ' - search line 1 contains ' $aSubData[$SearchStart + 2] ?
        ; do something
      EndIf
    EndIf
  EndIf
Next
Glenn
_________________________
Actually I am a Rocket Scientist! \:D

Top
Page 1 of 1 1


Moderator:  Jochen, Allen, Radimus, Glenn Barnas, ShaneEP, Ruud van Velsen, Arend_, Mart 
Hop to:
Shout Box

Who's Online
1 registered (Allen) and 1198 anonymous users online.
Newest Members
M_Moore, BeeEm, min_seow, Audio, Hoschi
17883 Registered Users

Generated in 0.064 seconds in which 0.022 seconds were spent on a total of 14 queries. Zlib compression enabled.

Search the board with:
superb Board Search
or try with google:
Google
Web kixtart.org