Protection Considerations

Started by Donald Darden, August 18, 2007, 11:37:55 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Donald Darden

The average forum discussion on protection usually centers on protecting a program from being pirated, or keeping original source code and concepts from being exposed.  Some of these views get pretty extreme, such as the guy that figures out how to write a program that says "Hello, Mark!" instead of "Hello, World!" and wants to protect it.

If you program long enough, you get to the point where you recognize that efforts to steal embedded code takes a lot of work and experience, and is not the real risk, but that having a marketable program illegally copied and used without due payment is a nightmare to be contended with.

But what is hardly ever discussed is the vulnerability of any data that is needed by the customer, which your program has recourse to.  Take for example any customer accounts that your program might process or access.  If you specify a  file format for that data that essentially leaves the data exposed to any other program that can recognize that format, then the customer accounts are vulnerable to being hacked with little fanfare.

Now let's face it:  Customers are often slow to consider these matters themselves.  Our model of centralized processing is that we are safe within the confines of the computer room or building.  But with internet connections, the world at large is now able to try and gain access to our primises.  And not only are our customer's records and information at risk, but often the accounts and information related to other customers and businesses as well.

At the present state of art, there is little concern for the manner in which programs work and data is exposed internally to the PC or mainframe, and much more attention to trying to secure the premises and boundaries through proxy servers, firewalls, and a number of resident programs that monitor for suspicious activity or signature behaviors.

And when these precautions fail, which they have been known to do in the face of determined hackers, the fault is generally heaped on those responsible for system security, rather than being laid as well to the present inability for securing the data within the confines of the computer.

Some indications that this may change are already evident.  You can now buy software that encrypts the entire contents of hour hard drive, so that even if the computer itself is stolen (such as a laptop), the information, both programs and data files, are rendered useless without the right password.

There are a number of problems with this approach:  First, it is assumed that there is only one user of the computer in question, so if it is shared, all users have to have access to the same password.  Thus, there is a question of real accountability on an individual basis.  Second, it depends upon the human agent, and this is often the point of least certainty when it comes to security.  People can betray you intentionally, or act in a way that jeapodizes your efforts to protect and secure the information entrusted to them.  Third, it might prove impractical to ever change the password, and it could happen that you lose the employee, then have no way to recover the drive's contents in their absence.

When it come to securing the customer's data directly, it may not be possible if that customer intends to use other software tools and programs on that data as well.  Then you are forced to accept and use whatever data format is being presently subscribed to.  However, if you are contemplating building a vertical application for the customer from scratch, then it might be woth considering a proprietary data format that envolves encryption or some method of encoding as a means of securing it from prying eyes and rogue programs.

Let's say you are contemplating offering the customer a secure alternative to his present mode of operation that is cobbled together with Excel and Access and some scripting tools.  You've shown hin that his present environment is easy to hack, and vulnerable to everything from a single drive failure to deliberate acts of sabatoge and theft.  Not only is he vulnerable, but he could be liable as well,
if a court of law found him negligent.  It may not be the rule of thumb today, but a time ls likely to come when companies will be liable for the safeguards that they fail to employ to protect their customer data.

You realize that if the customer accepts your secure alternative, it gives you the opportunity to lock that customer into your approach for the foreseeable future.  Any additional programming needs will have to be able to work conjunctively with your model, and most likely you will have a role as the architecture or consultant needed to make that happen.  So from your persepective as well as the customers, there is good reason to plan to go with this model.

The first part of the model involves making the program user specific, but not user dependent.  That is, each authorized user must be able to access and run the program, and there needs to be an audit train related to each user's activities.  But the user may leave at some point, and other users may need to be added from time to time, and losing an employee should not put the company's programs or data at risk.  There has to be a way to remove an employee's access to both, so that they can no longer make use of either.

The second part of the model means ensuring the validity, privacy, and survivability of the data.  You can adopt a plan that allows the data to be made redundant on a range of drives, and secured with physical backups off premise,
and you can use encryption or encoding as methods to protect the privacy of the contents.  The validity of the data depends on human factors and recourse to other sources, so it is really hard to define in detail except in the specific of each situation.  You get into considerations of what qualifies as trusted sources, digital signatures, private and public encryption keys, methods of data aquisition, and so on.

What I do want to look at briefly is the manner of securing data locally.  The art of hiding information in plain sight is an old one, but most of the history that is involved has been limited by the human agent.  It had to be clever, but not too difficult, because people had to be able to do it by hand.  With computers now able to handle the difficult parts for us at a very rapid rate, it can now be both clever and quite involved in nature.  And there are certain operations of the computer architecture, such as XOR and shift instructions, that are ready made for data conversions.

Have you ever heard of the trick of casting out nines?  You can take any number, such as 3248097, and by a process of adding digits together, reduce the result to a signle digit 0 through 8.  For instance, if you add 3+2+4 in this sequence, you get 9, but with any number 9 or greater, you deduct 9, and the result here would be 0.  The next digits of 8+0+9+7 would give 24, but you add the original 0+2+4 together and you get 6.  It makes absolutely no difference which way you solve this problem for these digits, the outcome will always be 6.
This trick was one of many used in the past to help validate whether adding up a column of numbers, or performing addition or multiplication, gave a true result.
It worked because a certain amount of useful information was able to be retained in the process that could be used to verify the original values.  But it was a destructive process, because most of the original information was discarded.  In other words, if I knew that casting out nines gave me a result of 6, I could not use that six to determine the original number of digits, what the individual digits were, or the sequence that they appeared in. 

Obviously casting out nines is not a good method for sending code, because too much information becomes lost in the process.  Typically, a good encoding process cannot lose information, otherwise it loses value.  However it can gain information and often does.  It can also be restructured so that the code is not in its original sequence.  It can undergo substitution where something else is used in its place - letters and digits can be interchanged, or other symbols employed instead for example.  And it can be divided up into several different messages that can be recombined later.  Another method involves external referencing, where you might have a book, map, or chart that contains the actual result, and only use references in your code.  Not knowing what the external reference is, or having an exact copy of it, would twart efforts to read the code, as long as similar references were made to different targeted items rather than all to the same one.

Whatever method is used in encoding information, an exact counter-method has to be employed to unravel it and restore it to original form.  Sometimes the two methods also have to share a password or reference key to complete the process.  The problem with using a password is that it becomes embedded, but  a userid/password file could be used to look up or validate a reference key, which would add significant protection.  Remove a userid and pasword, and a vengfull ex-employee would be no further threat, since they never knew or handled the reference key directly.

One method of encryption that has come into vogue involves prime numbers and the idea that you can have two prime number keys to the encryption, one called the private key and the other the public key.  This form of encryption is best where two or more parties have to be able to be able to exchange data, and you do not trust the other party to hold the private key, but you do trust them with the public key.  The downside with this method is twofold:  First, your computer must be able to process very large prime numbers numerically with great precision, a process that takes up quite a bit of time, even for a computer.  The other is, that two parts of the process are then known to the public, and the manner in which they are related to each other, then with enough time and processing power, it is theoricially be possible to determine the private key at some point in the future.  With today's processors, the time required is measured in the millineum, but with the power of processors doubling every 18 months or so, and quantum computers on the far horizon, and new strides made in mathematics as well, it may be possible for a breakthrough where prime-based codes of the present will suddenly become as easy to break as reading yesterday's newspapers.

Since the information in a private data set in a computer is not intended to be shared with any outside parties, you can focus on the best ways to protect it without concerns of making a part of that process known to third parties.  You don't need multiple keys, you can adopt a method to suit yourself, and you simply have to address the method and manner of protecting access via your program, and any other programs within your scope.       

  •  

Donald Darden

A secret message can be encrypted in its entirety by one method, or it can be subjected to multiple methods of encryption.  Once the encryption process is complete, it becomes static and unchangable.  But if you want to encrypt a data base, there is a profound difference - the data base is dynamic, subject to frequent changes, additions, deletions, and restructuring due to various sorts.
The encryption process is not only going to get in the way of all that, it may lead to data corruption and nonrecovery if it is not implemented with some care.

One of the key differences then is that the data base method must me independent of any aspect of the data base, such as its size, record position in the data base, and so on.  Each record, possibly each field in each record, must be recoverable regardless of the threatment of any remaining records (or fields),

There are two techniques that are supportive of this concept, because they avoid any direct alteration of the data itself,  The first is defusion, which is splitting up each record into multiple parts and distributing the results into various files scattered in different folders on the drive.  Without knowing the exact relationship and sequence for recovering the data from each independent file, putting individual records back together would be difficult.  Adding amd maintaining some meaningless files would complicate the problem more, like throwing pieced from one jigsaw puzzle in with another.

The other method involves scrambling the contents of each record in a way to make it unreadable.  But there has to be a rule as to how it was scrambled, so that it can be unscrambled by applying the same rule in reverse.

You can also change the data by modifying each character or each word or dword combination.  The XOR and rotate commands in the ASM and some BASIC languages are particularly good for this, as no bits are actually lost, they are merely flipped or moved, and thus there is a counter process for restoring them.

Keep in mind that the typical PC is filled with thousands of files, and nobody knows what they all are.  Hiding additional files is not a real challenge, and by avoiding names that giveaway their nature or use, or extentions that indicate what type of file they are or program is associated with them, you can make it hard to track them down.  And if your method of encryption prevents the contents from being easily read by a person spying on your contents, you have effectively blocked even the most determined hacker from knowing if he has the goods in hand or not.

One of the problems that has to be considered is the nature of the data base file or records, and even the fields.  How are they separated?  Are they null-byte
terminated?  Are they separated by tabs?  Do they end in a CRLF?  Or are they all fixed length, and the offsets can be calculated externally?  Are they assiciated with pointers and length indicators?  Are the referenced by external indexes?  These are important considerations, because you do not want to trample over your boundaries by accident.  Nor do you want to accidently duplicate any separation data when you change your other characters to something different. So here are some points to consider:

If you employ fixed length records, then you can scramble the contents of each record as you like.  If you are usnig null-terminated strings, then you cannot allow any null-byte to be generated as a result of your emcryption process.  If you use CRLF to mark the end of a record, you cannot produce a CRLF as part of the string contents (and you may have some problems if you generate either a CR or LF character by itself).  If you intend to us comma, emi-colon, or tab to mark field separations, and you are encrypting on field boundaries, then you cannot allow these characters to result from your encryption method either.

Unfortunately, XOR and rotate instructions are no guarantee that you will not accidently produce an unacceptable result at certain times.  They are best employed if dealing with fixed length records where you have no restrictions.
In other cases, some form of code substitution or swapping characters about would be the best choice.  Code substitution would take all the allowable characters as a group, and you would use some method of chosing a successor to the present character.  In code swapping, you would just switch the existing characters around so as to scramble the contents.

The techniques that can be employed here are virtually endless.  And this is good, because the chances are that whatever method you adopt, if it is not an obvious first choice, will leave the hacker in the dark.  Only by studying your program in operation and deducing your method, or by observing the data in memory, is he likely to capture any part of it.  And there are easier pickings around out there, so it ls like any good lock, it discourages the casual burgler.
Any real theft would almost have to be done by an insider, and you can even prevent most employees from being aware of the safeguards involved.

Let's talk about one record, and show some of the simple things that can be done to render it unreadable:
ACCOUNT RECORD
000123456 Mapleton, John Edward 071-23-9810, VISA 0123-4567-8901 05/10 $129.85 07/23/2007 Linksys Router

The first method just involves dispersing the contents of this record into a number of files, four characters at a time:
Quote
0001 --> file A
2345 --> file B
6 Ma --> file C
plet --> file D
on,  --> file E
John --> file F
Edw --> file G
ard  --> file H
071- --> file I
23-9 --> file J
810, --> file K
VIS --> file L
A 01 --> file M
23-4 --> file N
567- --> file O
8901 --> file P
05/ --> file Q
10 $ --> file R
129. --> file S
85 0 --> file T
7/23 --> file U
/200 --> file V
7 Li --> file W
nksy --> file X
s Ro --> file Y
uter --> file Z
I've made no pretense of scrambling anything, but without telling you the name of each file and the order in which the data is to be recovered, you have been left with the difficult job of decypering what needs to be done here to recoverf the data,  It becomes even harder when additional data from other records are added to each file as well.  And the sequence of files could be rotated -- for instance, for the second record, you begin writing to file B through Z, then finall y A.  For the third record, you begin writing to file C through Z, then to A and B.   

For your destracting files, you could make them useful by an overlap process,  Instead of just every four characters, they could begin with an offset of two,  That way they overlap the original set, but are not identical:  In some cases they could even help recover data if any files become corrupted.
Quote
0001 --> file A             0123 --> file AB
2345 --> file B             456  --> file BC
6 Ma --> file C             Mapl --> file CD
plet --> file D             eton --> file DE
on,  --> file E             , Jo --> file EF
John --> file F             hn E --> file FG
Edw --> file G             dwar --> file GH
ard  --> file H             d 07 --> file HI
071- --> file I             1-23 --> file IJ
23-9 --> file J             -981 --> file JK
810, --> file K             0, V --> file KL
VIS --> file L             ISA  --> file LM
A 01 --> file M             0123 --> file MN
23-4 --> file N             -456 --> file NO
567- --> file O             7-89 --> file OP
8901 --> file P             01 0 --> file PQ
05/ --> file Q             5/10 --> file QR
10 $ --> file R              $12 --> file RS
129. --> file S             9.85 --> file ST
85 0 --> file T              07/ --> file TU
7/23 --> file U             23/2 --> file UV
/200 --> file V             007  --> file VW
7 Li --> file W             Link --> file WX
nksy --> file X             sys  --> file XY
s Ro --> file Y             Rout --> file YZ
uter --> file Z             er00 --> file ZA
Since we've broken the record up into segments of four bytes, we can treat these as DWORDS if we want.  Now PowerBasic gives us a neat command that we can use to reverse the order of every four characters, to make the results less meaningful if chanced upon:  STRREVERSE$().  This is what our final output to the several files would look like:
Quote
1000 --> file A             3210 --> file AB
5432 --> file B              654 --> file BC
aM 6 --> file C             lpaM --> file CD
telp --> file D             note --> file DE
,no --> file E             oJ , --> file EF
nhoJ --> file F             E nh --> file FG
wdE  --> file G             rawd --> file GH
dra --> file H             70 d --> file HI
-170 --> file I             32-1 --> file IJ
9-32 --> file J             189- --> file JK
,018 --> file K             V ,0 --> file KL
SIV  --> file L              ASI --> file LM
10 A --> file M             3210 --> file MN
4-32 --> file N             654- --> file NO
-765 --> file O             98-7 --> file OP
1098 --> file P             0 10 --> file PQ
/50  --> file Q             01/5 --> file QR
$ 01 --> file R             21$  --> file RS
.921 --> file S             58.9 --> file ST
0 58 --> file T             /70  --> file TU
32/7 --> file U             2/32 --> file UV
002/ --> file V              700 --> file VW
iL 7 --> file W             kniL --> file WX
yskn --> file X              sys --> file XY
oR s --> file Y             tuoR --> file YZ
retu --> file Z             00re --> file ZA
Reversing STRREVERSE$() is simple:  Just repeat the operation.  Other techniques are also possible, such as just reversing the two end characters, or the two characters in the middle.  You could even use CVDWD() to change the 4 bytes to a double number, then HEX$(,8) to change it to an eight digit hex value.  To convert it back, use VAL("&H"+stringname) and MKDVD$() to put it back as it was.

I am not advocating one approach over another.  But as a programmer, whatever concerns you have about protecting your program, you have to consider that your client has that much concern and more about protecting his data, his business, and his reputation.  I would suggest that you at least look into the matter to the extent of making him aware that he is at some risk, and work on your own methods of protecting the data that his livelihood depend upon, and that his customers trust him with.  If you do so, that could help put you at the head of the pack.
  •  

Donald Darden

Working with files also leaves some other telltales that you might want to deal with.  For instance, if the datetime stamp on all the files are recent or pretty much the same, this may help a hacker deduce the ones you are likely using.  So redating the files may help hide the fact that you are using them.  Another clue could be that all the files might reside at the same depth when it comes to the directory tree, so creating a few staggered folder levels can help mask their use.  And of course the filenames should bear little resemblance to each other, even in length or naming conventions.  You also might break up your pattern of how many charaters to write so that the several files are of different sizes.

Years ago, I leaned that as a soldier, being completely hidden is a more effective defence than just being well protected.  IF they can't see you, they are less likely to attack you.  If they know where your cover is, they can and will search for its weak point or call in big guns to root you out or kill you.

This is the philosophy of hiding in plain sight.  You can't completely avoid the evidence that your program and the data files reside on the system, but you can make it so that their presence and relationships to each other are made totally inobvious.

In an effort to please those that might prefer to use familiar BASIC syntax to
effect changes, I also used some fairly obvious techniques to alter the string
characters and show how effective this can be.  But there is an even more effective way, which is to use pointers and employ the XOR or rotate steps that were mentioned above.  Now rememer, I advised that this should only occur with fixed length data.  Well, if you split the record into fixed length segments, this would work.  So can be discussed in more detail here.

When using pointers, you can specify the type and size of the data that is associated with that pointer.  Obviously a string pointer points to some aspect of a string.  a STRPTR points to the first character of a string.  A VARPTR, in PowerBasic, when used with dynamic strings, points to a descriptor of that string, which includes both the STRPTR and the number of characters in the string (the string length, or LEN).

Using the STRPTR, and any offset from the first character, you can reach any portion of the string.  Pretty much what MID$() permits you to do.  But the use of a pointer comes without the safeguards that MID$() has, so you must use it at your own risk.

The advantage is that you could establish a different pointer reference, such as for a DWORD, then set it to point somewhere within the string.  You can then
manipulate that part of the string as though it were a DWORD value at that location.  Thus you could rotate it left, rotate it right, or perform an XOR against it with some known value. and this would change that portion of the string in place in memory.  Simple and quick, and with other offsets, you can repeat or
perform other acts on the string at will.

Using XOR is a very powerful tool because it allows you to incorporate an
external value into the process.  This could be your elusive reference key.  If
they do not know what the original data was, or have the reference key, then efforts to unravel the code will be made super hard.  But with the power of computers, and large amounts of data to process, it is still conceivable that the reference key can be discovered through exhausive analysis.

There is a way to extend the reference value way beyond the scope of any efforts to analyze it, and that is by incorporating it mathematically with any equation that produces an irrational number.  This means it cannot be expressed as an exact quantity n/m, where n and m are integers.  There are an infinite
number of irrational numbers, such as PI or natural e .  If you chose a well known irrational number, you might make it too easy for a hacker to discover your trick.  But you can always find others.

Note that you can produce a number like 1/3 and get a nonterminating fraction like .33333333333333333333.  But this is no good for our purposes, because it repeats, and if it repeats, it is the same as using it over and over again.  You need a nonrepeating fraction, one that never repeats, no matter how far it is extended.  Since the reference key can then be used to generate a nonrepeating fraction, you can make an XOR mask as long as you could possibly need it to be.  By not exposing that reference key, or the manner in which the mask is generated, you have created a coding method that is virtually unbreakable.

Note that once you establish the reference key and method of creating the XOR mask, you are locked into its continued use.  Your only means of getting away from it it to create a whole new XOR mask, and unencrypt your data base with the old maxk while re-encrypting it with the new.  You can certainly do this if the need arrises, but that one reference key is too valuable to share with anyone.  You need another method for giving people the ability to use the reference key without actually disclosing it to them.

You can do this by having the reference key and XOR method embedded in the program itself.  Then require any user to have an authorized userid and password to access and run the program.  This requires a userid and password management file that only the system administrator can access.  The SA adds or removes users as required as part of the business, and the initial password is set so that it expires on first time use.  The designated user must then set a new password at that time.  The program, is able to change the password portion of the SA's file for a user that can successfully log in.  The SA cannot read the password stored, since the program uses its encryption power to render the stored password unreadable and unrecoverable by normal means.

Every unique user has their own userid, and only they should know the current
password.  Every activity involving the data base can be logged against the specific user.  This gives the owner full accountability of who did what and when.  The log should also be hidden and encrypted so that its journalling of events cannot be compromised.   The absence or corruption of the userid/password file or journal log should be the cause of an alarm, and should also be correctable by the SA in an effort to restore the program and data base to normal use as quickly as possible.

A lot of programmers simply will not go to this length to protect a company's data.  First, they are not prepared to introduce this technology on their own.  Second, the customer probably does not require it, being somewhat unaware or unconcerned with the risks involved.  Third, it takes a lot of work and forethought as to how this can all be done successfully.  You will note that many products may be on the market already that attempt to handle some or all of the requirements set forth for a particular customer, and it may prove more cost effective and easier to go that route.  But this discussion will likely have caused you to think harder on the topic than you had cause to do before.
  •  

Donald Darden

#3
So much for what I consider the foundation for why protection might be necessary, and the general areas where protection might be needed.  Now let's look as some methods for achieving some degree of protection.

First, I want to introduce two functions:  One is named Encode(), and the other is named Decode().  Each is passed a string, and returns a string of the same length.  The purpose of each function should be evident from its name.

The example program code also creates an endless series of strings of various lengths, then fills them randomly with spaces and capital letters. This is the aa string.  The bb string receives the encoded version of aa after it is processed by Encode().  The IF 1 THEN allows you to see what happens when the contents of aa are modified and used in bb.  If you change the IF 1 THEN to IF 0 THEN to surpress this code, the rest of the program will just continue until a mismatch betwee the initial contents of aa and the Decode(bb) results differ.  Properly done and implemented, this program will just loop because aa will always equal Decode(bb). 

#COMPILE EXE
#DIM ALL
#DEBUG ERROR ON
#REGISTER NONE

FUNCTION encode(sourcestr AS STRING) AS STRING
  LOCAL aa AS STRING
  LOCAL a, b AS LONG
  aa=sourcestr
  b=LEN(aa)-3
  IF b>0 THEN
    a=STRPTR(aa)
    ! mov eax,a
    ! mov ecx,b
ecode:
    ! mov edx,[eax]
    ! ror edx,1
    ! mov [eax],edx
    ! inc eax
    ! loop ecode
  END IF
  FUNCTION=aa
END FUNCTION

FUNCTION decode(sourcestr AS STRING) AS STRING
  LOCAL aa AS STRING
  LOCAL a, b AS LONG
  aa=sourcestr
  b=LEN(aa)-3
  IF b>0 THEN
    a=STRPTR(aa)+b-1
    ! mov eax,a
    ! mov ecx,b
dcode:
    ! mov edx,[eax]
    ! rol edx,1
    ! mov [eax],edx
    ! dec eax
    ! loop dcode
  END IF
  FUNCTION=aa
END FUNCTION

FUNCTION PBMAIN
  LOCAL aa, bb AS STRING
  LOCAL a, b AS LONG
  RANDOMIZE
  COLOR 15,1
  CLS
  DO
    aa=SPACE$(RND*5000+1)
    FOR a=1 TO LEN(aa)
      IF RND>.85 THEN INCR a
      MID$(aa,a)=CHR$(65+RND*26)
    NEXT
    LOCATE 1,1
    INCR b
    PRINT b,LEN(aa)
    bb=encode(aa)
    IF 1 THEN
      PRINT LEFT$(aa,SCREENX-1)
      PRINT LEFT$(bb,SCREENX-1)
      WAITKEY$
    END IF
    IF decode(bb)<>aa THEN
      PRINT
      COLOR 15,1
      PRINT aa
      COLOR 14,2
      PRINT bb
      WAITKEY$
    END IF
  LOOP
END FUNCTION

Note that Encode() and Decode() perform complimentary functions.  One does
a rotate right, then other a rotate left.  They agree in the amount of rotation used.  One processes the string from left to right, the other from right to left.
The rotations could have been on byte, word, or dword boundaries, but by using dword (four consecutive bytes), the shifting becomes compounded over portions of the string.  Had I elected to use word or byte boundaries instead, the results would have been different.  I could have begun with rotate left, followed by a rotate right for the decode stage.

Bit shifting obviously works, but the XOR function really helps mask the original contents.  XOR has to work against some known quantity or value, which has to be exactly the same at each stage of the encode and decode sequence, but can be made to vary between stages.  This woul dmake it much harder to detect the pattern, since it would not only involve an unknown that the outsider would have to discover, but it might not even be a constant.
  •  

Donald Darden

#4
If you are going to include XOR as one of the functions for encoding and decoding your lines of text or whatever, then I suggest just creating a separate function for that purpose and calling it as needed,  The advantage is that XOR forms its own natural complement, unless you decide to make it more complicated.

I am going to include a separate XOR process with the previous example used above.  I am using an embedded constant at the start of each call to the XOR operation, but I rotate it between uses, which changes its effects on any subsequent character codes that it gets XOR'ed with.

#COMPILE EXE
#DIM ALL
#DEBUG ERROR ON
#REGISTER NONE


FUNCTION xorcode(sourcestr AS STRING) AS STRING
  LOCAL aa AS STRING
  LOCAL a, b AS LONG
  aa=sourcestr
  b=LEN(aa)
  IF b>0 THEN
    a=STRPTR(aa)
    ! mov esi,a
    ! mov ecx,b
    ! mov edx,21385  'example of a reference key
xorit:
    ! mov al,[esi]
    ! xor al,dl
    ! mov [esi],al
    ! rol edx,1
    ! loop xorit
  END IF
  FUNCTION=aa
END FUNCTION

FUNCTION encode(sourcestr AS STRING) AS STRING
  LOCAL aa AS STRING
  LOCAL a, b AS LONG
  aa=sourcestr
  a=STRPTR(aa)
  b=LEN(aa)-3
  IF b>0 THEN
    ! mov eax,a
    ! mov ecx,b
ecode:
    ! mov edx,[eax]
    ! ror edx,1
    ! mov [eax],edx
    ! inc eax
    ! loop ecode
  END IF
  FUNCTION=xorcode(aa)
END FUNCTION

FUNCTION decode(sourcestr AS STRING) AS STRING
  LOCAL aa AS STRING
  LOCAL a, b AS LONG
  aa=xorcode(sourcestr)
  b=LEN(aa)-3
  IF b>0 THEN
    a=STRPTR(aa)+b-1
    ! mov eax,a
    ! mov ecx,b
dcode:
    ! mov edx,[eax]
    ! rol edx,1
    ! mov [eax],edx
    ! dec eax
    ! loop dcode
  END IF
  FUNCTION=aa
END FUNCTION

FUNCTION PBMAIN
  LOCAL aa, bb AS STRING
  LOCAL a, b AS LONG
  RANDOMIZE
  COLOR 15,1
  CLS
  DO
    aa=SPACE$(RND*5000+1)
    FOR a=1 TO LEN(aa)
      IF RND>.85 THEN INCR a
      MID$(aa,a)=CHR$(65+RND*26)
    NEXT
    LOCATE 1,1
    INCR b
    PRINT b,LEN(aa)
    bb=encode(aa)
    IF 1 THEN
      PRINT LEFT$(aa,SCREENX-1)
      PRINT LEFT$(bb,SCREENX-1)
      WAITKEY$
    END IF
    IF decode(bb)<>aa THEN
      PRINT
      COLOR 15,1
      PRINT LEFT$(aa,SCREENX*(SCREENY-1)/2)
      COLOR 14,2
      PRINT LEFT$(bb,SCREENX*(SCREENY-1)/2)
      WAITKEY$
    END IF
  LOOP
END FUNCTION

The advantage of using multiple encoding methods should be obvious.  But just to reiterate some of them, you are forcing the hacker to deduce the following:
(1)  What mathed(s) were involved
(2)  How those methods were implemented
(3)  The sequence in which those methods were used
(4)  In the case of XOR, what the constant or source reference was.
(5)  For any operations spanning multiple bytes, whether you worked left to
right or the reverse, or even in any particular byte, word, or dword sequence
(6)  With multiple byte encoding methods, you also have an issue with the starting and ending points - if someone attempted to decode a whole file at once, it would fail if the encoding were done on one record at a time, even if they got everything else exactly right.

Note that the XOR function I created only works on one byte at a time.  I could have made it work on word or dword references, much as I did with the Encode and Decode functions. but then I would have to deal with two functions, one to go left to right, and the other from right to left, and I also would have had to go to extra lengths to determine the current shift state of the XOR reference value as I attempted to undo the encoding done earlier.

If every customer has a different reference value unique to their copy of the program, than they cannot read each other's records.  That is another power of using the XOR method, because it is very key dependent.  You have to have the right version of the program, you have to be able to use the program (remember the userid and password specification eariler?), and only then is the corresponding data extractable from the encoded data files.
  •  

Donald Darden

#5
Since we are talking about customer data, we might want to turn 90 degrees at this point and think about the manner in which customer records are built up.  I've mentiond data bases several times, but only portions of the customer data may actually work well in typical data structures.

Every list represents a type of data structure, usually with an x-y or row-column referencing system.  A phone list, or mail list, or book list, or list of expenses are all examples.  These are familiar concepts and translate well into arrays with one or two indexes.  In Excel or other spreadsheet, you frequently break a record up into a number of cells, often separating specific fields from each other. the Alignment is sommonly on vertical columns, and we have headers to identify the contents of the individual cells below that point.  You might have header columns named Last Name, First Name, MI, Phone No, Address, City, ST, ZIP, DOB, and so on.  It's a pretty good way to consolidate similar information and categorize it.

But suppose you wanted to expand your basic list to handle customer accounts.  You might want to add business address, shipping, address credit lender, credit card number, expiration code, account number, items bought, items paid for, items shippede, when shipped, returns, issued RMAs, account balance, and so on.

The trouble is in some of these cases, is that the data is neither static, nor is it necessarily on a one-for-one basis.  The customer may have a long running account, and may have moved a number of times, spedified different shipping addresses for different orders, used different credit cards on different purchases, had maultiple transactions, and each transaction involves a number of items, a quantity of items, weight and shipping costs, different carriers or shipping methods, and on and on.

Giving up on the spreadsheet approach, you may decide to try a TYPE structure and use a whole range of field elements, trying to work out the maximum size of each field, and all the possible fields that the client would ever use for his customer accounts.  Now there are several things wrong with this approach.  First, you are asking the client for the very last word on how many fields, the absolute maximum on the length of every field, and other qualifiers that the client can't possibly know.  Even if you explore all his existing records, the maximums you come up with would still prove to be inadequate at some point in the future.  And his business might change, which would necessitate a lot of redesign work, code revision, and database rework.

More likely, you may decided that one type structure would be inadequate to represent everything.  So you dream up multiple types that all bear at least part of the information.  You may decide that you will have type structures for the individual, for the account, for each order, for each transaction, another for each shipment, and so on.  You hope to use some common linking mechansim, possibly based on the account number, to pull everything together.

Alright, so let's say you somehow succeed in outlining an approach that looks like it should work and do what you want it to do.  But then the client says he wants to be able to look at a client's whole history of purchases and prior transactions that goes back years.  You realize that your model has only taken into account the current and last known information - the data in a flat model would be overwritten by updates and changes, even by new orders.

In looking about for a better model, you might realize that a log file or type of diary forms a written record that reflects changes made and when, and can even indicate who was responsible.  Let's just call this a journal, although the same name is used with a slightly different purpose in some other applications.  You could also call it a ledger if you prefer.

The idea of a journal or ledger is that once written, it is suppose to be inalterable.  (I use inalterable rather than unalterable to indicate that this is a initiative-based effort rather than an actual physical restraint).  You merely note any changes through additional entries,  Thus you can begin at the beginning and work to the present, or from the present and work back to the beginning, or review the state of the account at any point.  You can also review any transactions that had happened up to that point as well

The question is then, if you finally elect to rely primarily on a journal or ledger approach, how do you make this work?

Think of this as though it were a movie film.  The movie film resides in a series of containers, each representing about 15 minutes of playing time when loaded into a projector.  But you don't see the whole 15 minutes at once, you have to wait until a portion of the film is framed in front of a light and focused through a lens in order to be recognized.  If the film has to be stopped, You have a marker to where you are in watching the film, and you have the choice to continue from there, or begin again at the beginning,  With some projectors, you can even play it backwards if you choose.

Let's begin with the idea of creating a single frame that will become a part of that long film.  The frame would be representative of any form data that we have collected, which might be typlified by the use of some type structures.  We might actually begin our process by describing the forms or the type structures we are going to use, just in case they change later.  Or we might arbitrarily just begin with linked data fields that are simply strings with associative names.  And it could be some combination as well.

Here is an example:  We imagine the user wants to begin with a new account, so we provide that as a menu option.  If that option is picked, we want a minimum set of questions answered and verified so that we can assign a new acount number and begin tracking whatever might follow.  So the user indicates that all the data fields are complete, and we want to first validate, or have validated, the information provided.  Then we automatically assign a unique account number.  This will be our first journal entry foir this account.  But rather than write the form data to the file alone, we are going to write the field names followed by the data.  And we will encode it to protect the contents as well. using a technique similar to that has been already discussed.

Now we have many choices at this point:  We could add it to a general ledger, or we could begin a new journal just for this account.  We could even do both, as a means of trying to keep our data safe.  If we do this in a general ledger, we need to mark the point where this new acocunt begins so that we can return to it later.  We can do that as part of the initial account information, then write the whole thing into the new account journal file to start it off.

When the customer adds another credit card, places an order, gives a different shipping address, or whatever, our processes have to allow for this.  In order to manage some of it, we have to let the customer see his account data and make changes.  When the customer confirms thechanges, we can write either the individual changes or the whole account record out to the journalling processes again.  But we probably want to keep a copy of the account record with the current information somewhere handy, wich probably would be a record of accounts file.

The shopping cart is the way we relate to new purchases, and what we do is essentially a checkout and bagger operation, with the transactions going into the journal files.  In our account journal, we can merely append entries to what already exists, since it is all about that one account.  But in our general ledger, we have to recognize that we have a threading situation, since everything else is going in there as well, for all accounts and all other activities.  Now a thread is really more like a linked chain in this situation,  You have a record that has to be embedded among other records that are not related, and you have to point to the last related record, and allow a place for a pointer to the next related record.  That is two fields that have to be made part of the journal entry.  Each journal entry also has to have a length field included.  Now if we were not encoding the data, then we could probably get by with ASCIIZ strings which are null-byte terminated, but our encoded data can accidently flag portions of our record as any other character or character combination, so best to have the length clearly marked for the purpose of getting the exact number of characters back.  This is the form our general ledger record might have:

[prior rec ptr][next rec ptr][rec len][####################]. 

The first three square bracket sets represent the prior record
pointer or offset in the file, the next record pointer or offset, and
the length.  The actual record entry, encoded, is represented by
the square brackets with pound signs between.  For the individual account file, you just need the [prior rec ptr][rec len] and the record. or the [prior rec ptr]
[next rec ptr] and the record.  That is because without intervening records from other sources in that file, one of the fields can be deduced.  On the other
hand, if you retain the three sets of brackets, it gives you another form of sanity check to the contents of your files.

You probably have to deal with other journals as well.  For instance,
each order has to deal with existing inventory, orders to the packing and shipping department, notification to the shipping company, the printing of shipping material, and of course charges or payments.

The interesting thing about the journalling approach is that you can invent new pieces as you go along, and you don;t have to be concerned about field sizes or anticipating everything.  Suppose a customer wants to add a second credit card to his account, or make payments via PayPal.  You figure out how you want to update the account info and use it, but it does not nullify or alter any accounts that do not have that requirement, and the only effect is that you now might have additional field names like credit#2, that show up in the journal process.  When they come up in the future, your program can reflect them as appropriate.
  •  

Donald Darden

Over the years, I've read many posts from aspiring programmers that wanted to learn how to transistion to programming full time.  Back when I got started, being able to program was considered an art, and there was little competition.  You could pretty much call yourself that, and the work came looking for you.  In fact, it was common to see a poster with a chimp's picture with the words: "Two weeks ago I culdn't spell Programmer, now I are one". 

This is not true any more.  Lots of people program now, even it it is just managing a number of applications with script files and processing some simple information on a computer.  So the degree of how much programming in involved, and what specific knowledge is required, become the real issues.  And there has been a lot of specialization involved.

The discussion here has been with some of those people in mind, the ones that have asked what it takes to become a programmer.  A lot of people decide that it really requires a continuous and ever deeper study of the art of programming itself.  It is sort of like an aspiring painter that enrolls in one art course after another, always striving to learn new methods, techniques, and way of achieving effects.  There is no doubt that there is value in doing this, but where is the transistion from being a mere student of the art, to becoming an artist?

Painters have to know what to paint.  Programmers have to know what to program.  Painters can look around for possible subjects, and be creative, hoping that others will appreciate their finished work.  Others paint on commission, where someone else decides what needs to be painted.  Programmers may face similar choices.  But people don't buy programs for their beauty as static objects, they buy them for their functionality or entertainment value.

If you have followed the earlier discussion, it's probably occurred to you that in an effort to write a business level application, that it would probably help to know something about the business itself, or something about how businesses operate.  If you aren't into business management, you might have found some of the discussion more into things you weren't really aware of, or had failed to consider on your own.  If you have a background in business, you might have been amused by the many considerations that were not even considered, such as taxes, commissions, coupons, cash flow, the various roles that agents have (sales, advertising and promotions, order processing, customer service, tech support, warehouse, quality assurance, buyers, returns, and others).

You might get the idea that maybe you would need to know quite a bit about the needs of the client in order to write a program that would integrate into his operations effectively.  That would be a good thought.  It suggests that perhaps the role of the programmer is not really about programming as much as it is putting the computer to work to benefit the client.  If you are going to be a one person business, you have to be able to meet the client on his own grounds, and the more you know about what he has to deal with, the more you have to offer when discussing your role and function.

You can also consider the team approach, where you join with people that have complementary knowledge and skills.  For instance, if you look at the needs of small businesses that want to grow, and you think there is a future for you there, then either you or someone in your team should have the expertise to offer to help make this happen.  And you may find that it is less about writing new code, then finding existing code and processes that would fit right in and work for that client.  It could easily turn out that as the team programmer, you have little to contribute on your ownm except for your knowledge of what is available and best suited in each instance.

Like a doctor who once had visions of being a successful surgeon, you might find your life religated to listening to people cough, voice their complaints, looking at test results, and prescribing medication.  Your future in programming may not be what you envision it to be, because just as the doctor found, you may be outclassed in your preferred area, and forced to serve the client's needs rather than sticking to your original goals.
  •  

Donald Darden

Now let's look at some of the many, and often very good reasions, for avoiding the use of any type of data protection, and possibly for not trying to protect your program as well.

The first is the element of trust.   Do you like dealing with excessively suspicious people?  It tends to be offensive, doesn't it?  When you work for a client, you expect him to trust you and your work.  So if you write code that locks him into a dependency on your software to access it, it heightens the need for trust to an extreme degree.  You may have to work hard at building that trust factor in your relationship with your client by being frank and direct, and avoiding any signs of withholding critical information from him.

Another factor is temptation.  There is no doubt that keeping secrets is a form of power, because it gives you a way to avoid oversight and supervision.  Even if you are as honest as the day is long, the client or someone else may question just how sure they can be about whatever it is you are masking with your code.  You may have to deal with accusations and suspicion, which can really damage your relationship with your client.

A third factor is the lack of standards.  There is no question that at least a part of your code is noncompliant with established standards.  Now that is not in itself a bad thing, because standards are meant for information exchange and the use of proven, common elements, but this can also make you be seen as a renegade or someone who is going against the norm.

A fourth factor, as strange as it may seem, could be a legal one.  It may in fact be against some laws or regulations for you to render data into a form that cannot be easily read by the government.  This is a very murky area at best.  There is always a struggle between what the government wants to know, how much it can legally know, and how much right to privacy you or our company is entitled to.  Just because you feel it's your business, the government may question your need to keep such secrets.  You already see where the government is going its best to secure the right to access records, be given a back door into methods of public encryption, and are believed to be watching all manner of communications for threats against the country, the government, or members of the government.

A fifth factor is the matter of audits.  Audits are performed by thrid parties that validate existing records and transactions.  Audits can be internal or external, and on the behalf of the owners or by another party, such as the IRS or other government agency.  It could be ordered by a bankrupcy court, or even requested by some owners to ensure that management has been doing a good job.  It could be done by management to ensure that employees have been honest and doing their jobs as intended, or to explain unexpected losses.

There seems to be no doubt that any efforts to conceil data from being accessed by hackers and unscrupulous employees will run counter to the needs of others who feel entitled to access it for independent verification.  Now you could position yourself that you will help any legitimate claimant to that data to access it, even provide basic tools for the purpose that are not generally available, but whenever someone else is forced to change the way they do things to the way you allow, it creates friction, anger, and often distrust.

The code given above shows how simple the act of encoding and decoding information really is, and there are so many methods available that it may seem strange that it is not done more frequently.  But most business and computing matters benefit from a high degree of openness and cooperation.  It just seems to go counter to those involved with either to hide so much.  However, if you really want to protect your data, if it is that important, you may have to think about doing the unthinkable.  And one way to halp keep the negative side from getting out of control is to keep it to yourself.

There are probably some fairly happy compromixes that can be worked out as well.  For instance, You might be able to arrange to encode records within a standards-based data structure.  People could access the data structure, and the decoding just has to be done internally to the program using the database,
The decode function could be made a DLL that the auditor could call as part of their auditing methods.  It's just something that needs to be thought out.

Another consideration:  Much of a personal record is written in a way that you can tell if it has been tampered with.  For instance the name Taylor, John Richard is self verifying simply because we can read it.  If you saw instead,
drahciR nhoJ ,rolyaT, you might correctly reason that this is not the original content, and even work out the changes that took place.  But digits are different.  If you saw a number 888-555-1234, or its reverse 432-155-5888,
or a rotated variant 488-855-5123, then it's difficult for you or the computer to determine if the number is in fact valid, unless you can determine contraints on that particular type of data.  For instance, if you can deduce that this is probably a telephone number field, you could try to match up the area code with those associated with that address.  But it might be just enough of a change to prevent 99.9% of the hackers out there from being able to extract and use the information in that file.

Suppose you had a complete record for this individual:

Taylor, John Richard 0123-45-6789 888-555-1234 2212 Eastside Road, Middlesex, NM, 12345

If you just took all the digits provided in this record and rotated them one place to the right in place of the next digit, then this record would become:

Taylor, John Richard 5012-34-5678 988-855-5123 4221 Eastside Road, Middlesex, NM, 21234

The records still seems to pass self-validation, and most people on sight would believe that it is correct as it stands.  But this is another case of hiding in plain sight.  It is very easy to set the contents right, but first you would have to guess the rule of change that was used.  And there are many possible rules that can be used, and the validity of each attempt would require substantial work, making it an unattractive prospect.  And we haven't even discussed digit manipulation yet such as subtracting each digit from 9 and using the result, or adding some offset value, such as beginning with 1 and incrementing up with each digit, and retaining just the last digit as a replacement digit.

Take those registration codes that you sometimes get when someone sells you a progam that you can download and install on your computer.  The current trend is to get or give you a registration name, such as John R. Taylor.  Then they use that exact name to generate a registration key, such as 12804-4BCVD-5KN06-LP5JR.  Don't let the digits and letters fool you.  Many are the result of adding some offset value or indixing into a string of replacement values, such as using the string "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" or some variant as a way of obscuring the original source.  They can't keep you from copying the program to another computer, but they can always trace it back to its source and the original transaction because it has to be used with a valid registration name and matching registration key, so John Taylor is now responsible for where it ends up at.  And if John Taylor becomes recognized as the source of abuse, his account can be flagged so that he cannot get any upgrades or added support.

The critical ingredient here is that while two pieces appear unrelated, there is actually a specific relationship between the two that only the program provider knows, And the program provider is unlikely to make that secret known, which helps reserve his entitlement to market that program openly and receive some  return for it.  People who create methods for generating registration keys have another advantage:  Their methods do not have to have a complementary method to support them.  That is because all they have to do is take the original source and generate the registration key again, then verify that the two registration keys are exactly the same.
  •  

Donald Darden

From the prior discussion, it should be recognized that steps to protect the client's data required the client's understanding, cooperation, and agreement.  The pros and possible cons need to be spelled out in some detail, and some provision made for audits, if nothing else.

But when it comes to protecting your own programs, that is a matter that you have to work out for yourself.  When you write programs, your exact relationship with the client becomes a factor in deciding who actually owns the program and the source code.  If it's your source code, then you have to protect it.

It might make sence to keep a copy of your source code on the client's site.  After all, if the client needs a fix or support, having the code there along with the compiler tools means having the tools in place for hands-on or possibly remote support.  But all that is then accessable to anyone else having access to the client's premise.  You might also have fewer concerns about keeping the exact version for that client intact and available, and having adequate backups and distribution of those backups.

But even if you do not keep the source code on site, just the mere presence of your executable and support files exposes your program to being hacked.  Making your program only work in the presence of the right userid and password is a nice concept, but when someone examines your program intently, they can generally determine how this is actually implemented.  For instance, since your program probably calls on the system via API calls, they can look in your program for where that happens.  They can intercept messages in the message loop as well, and use keystroke loggers and other spyware to uncover what the user is doing that enables the program.

There are generally four points of attacking your program:  First, most people might attempt to examine it as it is stored on the hard drive or other media.  This is the way most novices would tackle the problem.  Someone who is more advanced in the art, or who is using available hacker tools, may attempt to examine it im memory after it gets loaded, on the premise that all external encryption methods have since been torn away.  And the third approach is to look at the nature of your program, how it has to fit the requirements of the operating system in order to be executed there, and look for ways to unmask what it does through the system calls.  The fourth approach, already mentioned, is an attempt to intercept or monitor what the user is doing in order to activate and use the program.

There is no way to prevent these types of assaults on your program, except to recognize that they are not part of a legitimate business model.  In other words, it does not serve the client to let the program be stolen and used elsewhere, and your continued success is needed by the client to help him stay in business because he might need your services again later.  But someone who might break this rule would be a disgrunted employee or unscrupulour individual who breaks bonds of trust for some monetary or other gain.

Efforts to defeat an abuser have varied.  For instance, you can make your program call home automatically and secure permission to run.  If the program is stolen or being run from a different computer than originally installed on, that permission can be denied.  The client then has to negotiate to get the program reinstated to your good graces.  Another technique is to link your program to some external device, generally called a doggle.  This unique device has to be present in order for your program to run, and limits your program to the computer where the doggle resides.  Some people use the computer's MAC address as a form of modern doggle, since each one is unique.

But if a hacker can analyze your code to the point of finding the decision point where a branch instruction is used to control program access, they can alter the instruction so that the program will continue to run, regardless.  To try and prevent this type of attack, some programmers provide any number of branches, each one possibly validating another qualifier for the program to run, and the hacker may spend a lot of time trying to search out each one and alter it, yet still finding there must be more because the program still won't run.

I encountered one scheme where the programmer adopted his own memory management scheme so that he could obscure how that memory was used and what parts represented data and what parts represented code.  I've seen where programmers store information in reverse order on a hard drive, and where, in the older X86 architecture, they used odd combinations of segment and offset addressing, which made it harder to determine where references where located.

Most concerns about protecting one's programs involve techniques, but in some cases, people want to punish any abuse by wiping out hard drives, causing the data to become corrupted, causing the system to crash, or something of that nature.  My advice is, don't even consider it.  Aside from questions of what criminal acts you might be charged with or liabilities you might incur, it will tarnish your reputation and ruin our prospects.  No company would willingly do business with anyone that walked around with a loaded gun in their hand, ready to shoot, and they would see your program as representing a real threat to their business.
  •