Possible Istorage.Read Error

Started by David Maruca, November 17, 2010, 03:10:55 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

David Maruca

I've been beating my head over this one and would appreciate any help.

What is happening: I'm making successive calls to Istream.Read to parse an Excel file, and everything is running perfect up to one point. Once that point is reached, the next call to Read reads in stuff from who knows where!

I've edited just about everything out except what is needed for debugging, so this will be easy to see if you take the time to load it.

1) Save the bas and xls files to the same directory.
2) Load the bas file and set a breakpoint on line 66. (ghettoDebugAssert = ghettoDebugAssert)
3) Run the program in debug mode until the first break at position &H8B0. This is the last correct read incase you want to step through it. It will read a number format into an array.
4) The next break is at position &H8CE which is directly after the number format in 3. I've placed a zero seek just beforehand so you can verify that qNewPos is reading the correct position in the file.

The very next read into the Identifier type should read in 2 words which are 00E0 and 0014 once you take into account endians, but it reads in 041E and 0018 instead. I've attached the binary Workbook file if you want to fire it up in a hex editor to follow along. I've searched the entire file for this 1E041800 sequence but it does not exist. Can you account for the sudden garbage read? FYI I have manually checked the cbRead and hr result on every read call and they all return expected values so it can't be that.

I guess my next step will be to load this up in olly to see what is going on. Any help you can provide would be a lifesaver.


Frederick J. Harris

I havn't looked at your program or data David, and I don't know if this comment will help, but about two weeks ago I was working with IStorage and IStream interfaces a good bit, and I discovered somewhat to my dismay that while the analogy to directories and files is pretty close, its not exact.  For example, there is no counterpart that I know of to the Win Api GetFileSize() function.  When an IStream is created its allocated in a 'block' type manner, say for example 2048 bytes.  If you Write, say, 256 bytes to it, there is no way to recall that number.  If you read past the 256 bytes you wrote into the 2048 byte buffer, you'll get S_OK return values, even though you've read beyond what you wrote.  So I was wondering if your bizarre read was reading some control code of COM marking the end of a IStream/IStorage block?

In terms of the problem I was having with not being able to obtain the count of bytes written into an IStream, I decided to open a seperate IStream within my IStorage to record the count of bytes written into other IStreams within the IStorage.  That worked out quite well. 

To be perfectly honest, I didn't know there was any published info on the internal structure of *.xls files.  I know there was at one time, but I thought it was all to be done now through Excel interfaces.

José Roca

Quote
For example, there is no counterpart that I know of to the Win Api GetFileSize() function.  When an IStream is created its allocated in a 'block' type manner, say for example 2048 bytes.  If you Write, say, 256 bytes to it, there is no way to recall that number.

You can use the Stat method of the IStream interface, that fills a STATSTG structure with diverse information. One of the members of this structure, cbSize, is a QUAD that specifies the size, in bytes, of the stream or byte array.

Frederick J. Harris

#3
That was my first thought too Jose, but it doesn't seem to work.  The number that gets returned is the block size of the IStream object, whose rough counterpart would be a sector on a file system I guess.  In fact, I asked the question over on the Code Project Forum, and the reply there I got was that it doesn't seem to be possible to determine the size written into a IStream after it is closed.  The individual there told me he checked one of his company's applications that uses IStreams, and they were storing the length written into the first four bytes.  So that's how I came up with my idea of just creating a seperate IStream within the IStorage to persist metadata, as it were, relating to other IStreams within the IStorage.  By the way, I downloaded and examined your IStorage/IStream examples and was really happy to see you had implemented those Structured Storage Interfaces in your Includes.  I had been working on this in C++ but when I did a PowerBASIC translation it was exactly the same.  I know everyone is always thanking you for doing all those wonderful translations you do, and sometimes I don't because so many other folks are always thanking you, but I too appreciate them.

If you want I can easily post an example demonstrating the problem I described above.  Maybe there is a way to do it but I sank about a day into it and finally did what I did and it worked out OK.

José Roca

That sounds strange. There is an API function available in Shlwapi.inc, but is a wrapper that calls the Stat method and returns the cbSize value:


DECLARE FUNCTION IStream_Size IMPORT "SHLWAPI.DLL" ALIAS "IStream_Size" ( _
  BYVAL pstm AS IStream _                              ' __in IStream *pstm
, BYREF pui AS QUAD _                                  ' __out ULARGE_INTEGER *pui
) AS LONG                                              ' HRESULT


If it also fails, then a possible workaround can be to call the Read method of the IStream interface with a buffer greater than the possible length of the stream (with a maximum of 2147483647& characters because PB quads are signed) and then the size will be the number of characters read returned in the pcbRead parameter.

I have a wrapper function in Ole2Utils.inc that has worked in my tests:


' ========================================================================================
' Reads numchars at the current Seek position. If numchars = 0 or -1, it returns all the
' content of the stream.
' ========================================================================================
FUNCTION IStream_ReadText (BYVAL pIStream AS IStream, BYVAL numchars AS LONG) AS STRING

  LOCAL hr AS LONG
  LOCAL plibNewPosition AS QUAD
  LOCAL pcbRead AS DWORD
  LOCAL tstatstg AS STATSTG
  LOCAL strText AS STRING

  IF ISNOTHING(pIStream) THEN EXIT FUNCTION
  IF numchars < -1 THEN EXIT FUNCTION
  IF numchars = 0 THEN numchars = -1

  IF numchars = -1 THEN
     hr = pIStream.Seek(0, %STREAM_SEEK_SET, plibNewPosition)
     IF hr <> %S_OK THEN EXIT FUNCTION
     hr = pIStream.Stat(tstatstg, %STATFLAG_NONAME)
     IF hr <> %S_OK THEN EXIT FUNCTION
     IF tstatstg.cbsize < 1 THEN EXIT FUNCTION
     IF tstatstg.cbsize <= 2147483647& THEN
        numchars = tstatstg.cbsize
     ELSE
        numchars = 2147483647&
     END IF
  END IF

  strText = SPACE$(numchars)
  hr = pIStream.Read(STRPTR(strText), LEN(strText), pcbRead)
  IF hr <> %S_OK THEN EXIT FUNCTION
  IF pcbRead < numchars THEN strText = LEFT$(strText, pcbRead)
  FUNCTION = strText

END FUNCTION
' ========================================================================================


Frederick J. Harris

Hi Jose!

     I just dug up the code I was working with a couple weeks ago that pretty concisely demonstrates the problem.  I've just glanced through your post above quiuckly and am going to examine it in more detail after I post this.  Itsounds like you have some ideas there that might as you say be a work around.  Anyway, this code is unfortunately C++, but I'm sure you'll understand it.  I call the IStream::Stat method to get the cbSize data which is quite awkward in C/C++ due to the LARGE_INTEGER struct (PB makes this much easier!!!!).  The particular IStream I previously wrote 16 bytes into, i.e., four four byter integers.  The Index struct just has one four byte member.  I arbitrarily ran the loop up to 8 and you can see the output.  Disheartingly, pcbRead comes out with '4' for all reads - even beyond the end of the IStream (only first four are ledgitimate).  S_OK was returned for all reads (even beyond end).  And of course cbSize returned 4096.  When I saw this all I could say/think was WOW!  THIS IS A PROBLEM! 


//CmpStr10  -- compiled as C++
#include <objbase.h>
#include <stdio.h>
#include <string.h>

typedef struct tagINDEX
{
DWORD         dwPlotRecord;         // 4
}INDEX;                              //==
                                     // 4

int main(void)
{
wchar_t szFile[]=L"C:\\Code\\VStudio\\VC++6\\Projects\\COM\\CompStor\\CmpStr07\\Release\\Data.dat";
IStream* pIdxStream=NULL;
IStorage* pStorage=NULL;
char szBuffer[128];
ULONG pcbRead=0;
DWORD grfMode;
STATSTG sts;
unsigned i;
HRESULT hr;
INDEX idx;

wprintf(L"szFile = %s\n",szFile);
grfMode=STGM_DIRECT | STGM_READ | STGM_SHARE_EXCLUSIVE;
hr=StgOpenStorage(szFile,NULL,grfMode,NULL,0,&pStorage);
if(SUCCEEDED(hr))
{
    printf("StgOpenStorage() Succeeded!\n");
    hr=pStorage->OpenStream(L"Index", NULL, STGM_READ | STGM_SHARE_EXCLUSIVE, 0, &pIdxStream);
    if(SUCCEEDED(hr))
    {
       printf("pStorage->OpenStream(Index) Succeeded!\n");
       hr=pIdxStream->Stat(&sts,1);
       if(SUCCEEDED(hr))
       {
          printf("pIdxStream->Stat(&sts,1) Succeeded!\n");
          printf("sts.cbSize.LowPart  = %u\n",sts.cbSize.LowPart);
          printf("sts.cbSize.HighPart = %u\n",sts.cbSize.HighPart);
       }
       printf("\ni\tpcbRead\tidx.dwPlotRecord\tHRESULT\n");
       printf("===============================================\n");
       for(i=1; i<=8; i++)
       {   
           hr=pIdxStream->Read(&idx,sizeof(INDEX),&pcbRead);
           switch(hr)
           {
              case S_OK:
                strcpy(szBuffer,"S_OK");
                break;
              case S_FALSE:
                strcpy(szBuffer,"S_FALSE");
                break;
              default:
                strcpy(szBuffer,"Don't Know What hr Means!");
                break; 
           }
           printf("%u\t%u\t%u\t\t\t%s\n",i,pcbRead,idx.dwPlotRecord,szBuffer); 
       }
       pIdxStream->Release();
    }
    pStorage->Release();
}
getchar();

return 0;
}

/*
szFile = C:\Code\VStudio\VC++6\Projects\COM\CompStor\CmpStr07\Release\Data.dat
StgOpenStorage() Succeeded!
pStorage->OpenStream(Index) Succeeded!
pIdxStream->Stat(&sts,1) Succeeded!
sts.cbSize.LowPart  = 4096
sts.cbSize.HighPart = 0

i       pcbRead idx.dwPlotRecord        HRESULT
===============================================
1       4       1                       S_OK
2       4       3                       S_OK
3       4       4                       S_OK
4       4       5                       S_OK
5       4       0                       S_OK
6       4       0                       S_OK
7       4       0                       S_OK
8       4       0                       S_OK
*/


Frederick J. Harris


David Maruca

#7
Quote from: Frederick J. Harris on November 17, 2010, 04:20:22 PM
To be perfectly honest, I didn't know there was any published info on the internal structure of *.xls files.  I know there was at one time, but I thought it was all to be done now through Excel interfaces.

They have been published for a while now. I've been slowly working at them off and on. You can check out two posts I made on pb forums here:

Part 1 - http://www.powerbasic.com/support/pbforums/showthread.php?t=43427
Part 2 - http://www.powerbasic.com/support/pbforums/showthread.php?t=43986

You can find links to the documentation in part 1. Part 2 has the full code from the OP. The thing is that the Test.xls both here and in my example in the pbforums are almost identical. The one in the pb forums works perfectly and dumps all cell data to a text file. All I did to cause this error was change the format of the date and then all of a sudden Istream doesn't work any more. I am quite confident that my program above has not reached the end of the stream. I extracted the binary Workbook dump using 7-Zip and it confirms my belief that I am not trying to read past the end of the stream. There is a call to IStream_GetSize which returns &H4253 which is greater than &H8CE where the error occurs.

I had originally thought I had a misunderstanding from the BIFF docs about the layout of the xls file when I started getting crashes once I changed the date layout. Now that I've finally taken the time to track this error down I don't believe that is the case. The crash occurred because due to the garbage read it was causing array out of bounds errors.

I did try loading the program in olly to debug it, but it looks like PB obfuscates the asm output because I've never seen anything like that. That avenue is too much time for what it's worth.

My next thought would be to either see if I get the same behavior in C++ or abandon Istream entirely and just read the data directly.

Edit: I can and do use Excel interfaces, but I deal with Excel files in such a volume that Excel causes too much slowdown. I've already integrated xlsx parsing using XML parsers and ECMA 376. I am very close now to getting what I want from xls files. Once that is done it will be a dream come true.

Frederick J. Harris

Quote
All I did to cause this error was change the format of the date and then all of a sudden Istream doesn't work any more.

I know this isn't funny to you David, but to me it is!  We use Excel *.xls files in a major way in my Timber Sale System here where I work.  Whenever I get a call that something crashed or isn't working right the first thing I do is check my log file and check any date info (there are only a few in quite a lot of data).  Almost invariably that's where the problem is.  Excel and dates will absolutely brutalize you!

Frederick J. Harris

#9
Here's basically that same C++ program converted to PowerBASIC Console Compiler 5.  I've attached the doc compound file it reads, which contains mostly binary data.  Its in Data.zip and named Data.dat.  Only the 1st four reads are valid data - everything else is garbage.  The output follows the program, but anyone should be able to reproduce it him/herself from the attached file.


#Compile Exe
#Dim All
#Include "ObjIdl.inc"

Function PBMain() As Long
 Local hr,iValue,pcbRead As Long
 Local pIdxStream As IStream
 Local pStorage As IStorage
 Local strFile As String
 Local stats AS STATSTG
 Register i As Long

 strFile = UCode$("Data.dat")
 hr=StgOpenStorage(Strptr(strFile), Nothing, %STGM_DIRECT Or %STGM_READ Or %STGM_SHARE_EXCLUSIVE, %NULL, 0, pStorage)
 If SUCCEEDED(hr) Then
    Print "StgOpenStorage() Succeeded!"
    strFile = UCode$("Index")
    hr=pStorage.OpenStream(Byval Strptr(strFile), %NULL, %STGM_READ Or %STGM_SHARE_EXCLUSIVE, %NULL, pIdxStream)
    If SUCCEEDED(hr) Then
       Print "pStorage.OpenStream(Index) Succeeded!"
       hr=pIdxStream.Stat(stats, %STATFLAG_NONAME)
       If SUCCEEDED(hr) Then
          Print "pIdxStream.Stat(stats,1) Succeeded!"
          Print "stats.cbSize  = " stats.cbSize
          Print
          Print " i          iValue        pcbRead          HRESULT"
          Print " =================================================="
          For i=1 To 12
            hr=pIdxStream.Read(Varptr(iValue),SizeOf(iValue),pcbRead)
            If SUCCEEDED(hr) Then
               Print i, iValue, pcbRead, "SUCCEEDED"
            Else
               Print i, iValue, pcbRead, "FAILED"
            End If
          Next i
       End If
       Set pIdxStream = Nothing
    End If
    Set pStorage = Nothing
 End If
 Waitkey$

 PBMain=0
End Function

'StgOpenStorage() Succeeded!
'pStorage.OpenStream(Index) Succeeded!
'pIdxStream.Stat(stats,1) Succeeded!
'stats.cbSize  =  4096
'
' i          iValue        pcbRead          HRESULT
' ==================================================
' 1             1             4            SUCCEEDED
' 2             3             4            SUCCEEDED
' 3             4             4            SUCCEEDED
' 4             5             4            SUCCEEDED
' 5             0             4            SUCCEEDED
' 6             0             4            SUCCEEDED
' 7             0             4            SUCCEEDED
' 8             0             4            SUCCEEDED
' 9             0             4            SUCCEEDED
' 10            0             4            SUCCEEDED
' 11            0             4            SUCCEEDED
' 12            0             4            SUCCEEDED



José Roca

This works:


#COMPILE EXE
#DIM ALL
#INCLUDE "ObjIdl.inc"

TYPE INDEX
   dwPlotRecord AS DWORD
END TYPE

' ========================================================================================
' Main
' ========================================================================================
SUB TestCreateStorage

   LOCAL hr AS LONG
   LOCAL pStorage AS IStorage
   LOCAL pStream AS IStream
   LOCAL wszName AS STRING

   wszName = UCODE$("Data.dat" & $NUL)
   hr = StgCreateDocFile(STRPTR(wszName), _
        %STGM_CREATE OR %STGM_DIRECT OR %STGM_SHARE_EXCLUSIVE OR %STGM_READWRITE, _
        %NULL, pStorage)
   IF FAILED(hr) THEN
      ? "StgCreateDocFile failure: " & HEX$(hr)
      EXIT SUB
   END IF

   wszName = UCODE$("Index" & $NUL)
   hr = pStorage.CreateStream(STRPTR(wszName), _
        %STGM_DIRECT OR %STGM_CREATE OR %STGM_READWRITE OR %STGM_SHARE_EXCLUSIVE, _
        0, 0, pStream)
   IF FAILED(hr) THEN
      ? "IStorage.CreateStream failure"
      EXIT SUB
   END IF

   LOCAL i AS LONG
   LOCAL idx AS INDEX
   LOCAL cbWritten AS DWORD
   FOR i = 1 TO 8
      idx.dwPlotRecord = i
      hr = pStream.Write(VARPTR(idx), SIZEOF(INDEX), cbWritten)
      ? "hr = " hr, "cbWritten = " cbWritten
   NEXT

   ? "Storage created"

   pStream = NOTHING
   pStorage = NOTHING

END SUB
' ========================================================================================

' ========================================================================================
SUB TestOpenStorage

   LOCAL hr AS LONG
   LOCAL pStorage AS IStorage
   LOCAL pStream AS IStream
   LOCAL wszName AS STRING

   wszName = UCODE$("Data.dat" & $NUL)
   hr = StgOpenStorageEx(STRPTR(wszName), %STGM_SHARE_EXCLUSIVE OR %STGM_READ, _
        %STGFMT_STORAGE, 0, BYVAL %NULL, %NULL, $IID_IStorage, pStorage)
   IF FAILED(hr) THEN
      ? "StgOpenStorageEx failure: " & HEX$(hr)
      EXIT SUB
   END IF

   wszName = UCODE$("Index" & $NUL)
   hr = pStorage.OpenStream(STRPTR(wszName), %NULL, _
        %STGM_SIMPLE OR %STGM_READ OR %STGM_SHARE_EXCLUSIVE, _
        %NULL, pStream)
   IF FAILED(hr) THEN
      ? "IStorage,OpenStream failure: " & HEX$(hr)
      EXIT SUB
   END IF

   LOCAL i AS LONG
   LOCAL idx AS INDEX
   LOCAL cbRead AS DWORD
   FOR i = 1 TO 8
      hr = pStream.Read(VARPTR(idx), SIZEOF(INDEX), cbRead)
      ? "hr = " hr, "cbRead = " cbRead, "Data = " idx.dwPlotRecord
   NEXT


   pStream = NOTHING
   pStorage = NOTHING

END SUB
' ========================================================================================

' ========================================================================================
FUNCTION PBMAIN

'   TestCreateStorage   ' Unrem this to create the file
   TestOpenStorage

   WAITKEY$

END FUNCTION
' ========================================================================================


Frederick J. Harris

That's interesting work you are doing with the BIFF format David.  I'll have to think on it for awhile to see if I can make use of it.  At tis time I'm strictly using automation to read/write Excel data.  It does really bug me though!  I've written this horrendously fast PowerBASIC Api program to crunch our somewhat voluminous data, and the bottleneck is always Excel.  What a slug!

I kind of have a love/hate relationship with Excel.  I love it because lots of folks have used it to make all these forms we use, and it didn't take me too long to figure out that every time somebody makes a form to enter some kind of data, that's one data entry screen I don't have to create myself.  The hate part comes in because its a big slow pig.  So yea, what you are doing is interesting.  The last word I had on it was that it wasn't documented anymore, and only COM was the 'official' access to it.  But that comment is only something I picked up on some C++ discussion forum somewhere.  I havn't researched it like you.

David Maruca

Quote from: Frederick J. Harris on November 17, 2010, 07:55:35 PM
I know this isn't funny to you David, but to me it is!  We use Excel *.xls files in a major way in my Timber Sale System here where I work.  Whenever I get a call that something crashed or isn't working right the first thing I do is check my log file and check any date info (there are only a few in quite a lot of data).  Almost invariably that's where the problem is.  Excel and dates will absolutely brutalize you!

Most of my data scrubbing code relates to Excel dates. It's because Excel only stores numbers or text. Dates are just a display effect. Today I helped someone debug an Excel macro that wasn't working because all of the dates "11/15/2010" were actually stored as text so comparisons were invalid. The only way to reliably convert it was to convert it manually by loading an array and using variable = CDbl(CDate(variable)) - 1462. 1462 only if the workbook uses 1904 dates. No combination of copy/paste and formatting would remove the text flag and make excel parse the date outside of VB.

David Maruca

Quote from: Frederick J. Harris on November 17, 2010, 08:21:26 PM
That's interesting work you are doing with the BIFF format David.  I'll have to think on it for awhile to see if I can make use of it.  At tis time I'm strictly using automation to read/write Excel data.  It does really bug me though!  I've written this horrendously fast PowerBASIC Api program to crunch our somewhat voluminous data, and the bottleneck is always Excel.  What a slug!

Are you familiar with the Range.Value2 property? It's the fastest way to extract from and put into Excel files using automation.

José Roca

And the Stat method returns a correct value:


#COMPILE EXE
#DIM ALL
#INCLUDE "ObjIdl.inc"

TYPE INDEX
   dwPlotRecord AS DWORD
END TYPE

' ========================================================================================
' Main
' ========================================================================================
SUB TestCreateStorage

   LOCAL hr AS LONG
   LOCAL pStorage AS IStorage
   LOCAL pStream AS IStream
   LOCAL wszName AS STRING

   wszName = UCODE$("Data.dat" & $NUL)
   hr = StgCreateDocFile(STRPTR(wszName), _
        %STGM_CREATE OR %STGM_DIRECT OR %STGM_SHARE_EXCLUSIVE OR %STGM_READWRITE, _
        %NULL, pStorage)
   IF FAILED(hr) THEN
      ? "StgCreateDocFile failure: " & HEX$(hr)
      EXIT SUB
   END IF

   wszName = UCODE$("Index" & $NUL)
   hr = pStorage.CreateStream(STRPTR(wszName), _
        %STGM_DIRECT OR %STGM_CREATE OR %STGM_READWRITE OR %STGM_SHARE_EXCLUSIVE, _
        0, 0, pStream)
   IF FAILED(hr) THEN
      ? "IStorage.CreateStream failure"
      EXIT SUB
   END IF

   LOCAL i AS LONG
   LOCAL idx AS INDEX
   LOCAL cbWritten AS DWORD
   FOR i = 1 TO 8
      idx.dwPlotRecord = i
      hr = pStream.Write(VARPTR(idx), SIZEOF(INDEX), cbWritten)
      ? "hr = " hr, "cbWritten = " cbWritten
   NEXT

   ? "Storage created"

   pStream = NOTHING
   pStorage = NOTHING

END SUB
' ========================================================================================

' ========================================================================================
SUB TestOpenStorage

   LOCAL hr AS LONG
   LOCAL pStorage AS IStorage
   LOCAL pStream AS IStream
   LOCAL wszName AS STRING

   wszName = UCODE$("Data.dat" & $NUL)
   hr = StgOpenStorageEx(STRPTR(wszName), %STGM_SHARE_EXCLUSIVE OR %STGM_READ, _
        %STGFMT_STORAGE, 0, BYVAL %NULL, %NULL, $IID_IStorage, pStorage)
   IF FAILED(hr) THEN
      ? "StgOpenStorageEx failure: " & HEX$(hr)
      EXIT SUB
   END IF

   wszName = UCODE$("Index" & $NUL)
   hr = pStorage.OpenStream(STRPTR(wszName), %NULL, _
        %STGM_SIMPLE OR %STGM_READ OR %STGM_SHARE_EXCLUSIVE, _
        %NULL, pStream)
   IF FAILED(hr) THEN
      ? "IStorage,OpenStream failure: " & HEX$(hr)
      EXIT SUB
   END IF

   LOCAL stats AS STATSTG
   hr = pStream.Stat(stats, %STATFLAG_NONAME)
   ? "Stat - hr = " hr, "cbSize = " stats.cbSize

   LOCAL i AS LONG
   LOCAL idx AS INDEX
   LOCAL cbRead AS DWORD
   FOR i = 1 TO 8
      hr = pStream.Read(VARPTR(idx), SIZEOF(INDEX), cbRead)
      ? "hr = " hr, "cbRead = " cbRead, "Data = " idx.dwPlotRecord
   NEXT


   pStream = NOTHING
   pStorage = NOTHING

END SUB
' ========================================================================================

' ========================================================================================
FUNCTION PBMAIN

'   TestCreateStorage   ' Unrem this to create the file
   TestOpenStorage

   WAITKEY$

END FUNCTION
' ========================================================================================


Probably it is because you both are using the deprecated StgOpenStorage function instead of StgOpenStorageEx. If I try to use StgOpenStorage in my computer, it returns a %STG_E_OLDFORMAT error.