jehna/humanify

Expand reserved Keywords list for renaming variables or functions

Opened this issue · 7 comments

Hello, I found this repository and the capabilities from using it is miles better than Obfuscate io and other similar de-obfuscation websites.

However, there are many more reserved keywords that are not addressed in your humanify/src/openai/is-reserved-word.ts file. I found that the W3 schools website gave a more comprehensive list than what is found on the MDN website.(which looks to be where the reserved list you made seems to have been generated from)

https://www.w3schools.com/js/js_reserved.asp & https://github.com/AnanthaRajuC/Reserved-Key-Words-list-of-various-programming-languages/blob/master/language-files/JavaScript%20Reserved%20Words.md & https://www.geeksforgeeks.org/javascript-reserved-words/

Most notably the keyword crypto was generated by your tool and will NOT work as it is a reserved HTML keyword.

One possibility of preventing this problem from occurring is by making sure all generated keywords have the _ character appended to the leftmost side of the keyword. similar to how C programmers dealt with changes without breaking older code. https://stackoverflow.com/questions/56211904/why-does-c-have-keywords-starting-with-underscore
I do, also, notice that obfuscation tools that rename vars to hex keywords will also have the underscore character appended to the leftmost side. likely to make sure all generated keywords are not accidentally used elsewhere.

Instead of how C uses the underscore to declare new reserved keywords, we will be taking advantage that NONE of the reserved keywords have the underscore character. Thus future proofing it for the foreseeable future.

For now, adding the keywords in the W3 schools lists would work just fine. However, there are also many more other keywords that could be used that we may not know about and appending the underscore to all generated keywords will prove to be a more robust method.

Also, I wish to make a note that the GPT will produce varying names. Seems like the first one I was given produced subpar results but subsequent retries made better names. This was tested a few days apart. I wonder if the GPT changes based on input given or something because this new iteration of names for this specific JS file is just better.

I also made an attempt to rewrite some of the code to use GPT-4o, and the new openAI package version. I then compared the returned output from your version of Humanify and my updated version of Humanify. Strangely enough your older package produced better names than the GPT-4o version. https://github.com/Acters/humanify-ng/tree/GPT4o

Most notably the keyword crypto was generated by your tool and will NOT work as it is a reserved HTML keyword.

Hmm. This is an interesting view; e.g. crypto is not a reserved Javascript keyword per se, so it's completely valid to do:

const crypto = 'foo'

But it is a global variable in browsers, which will cause shadowing to occur, which can break any code that is depending on the global variable.

Very good that you brought this up, I've completely missed this. I'll fix this in the upcoming versions.

I think I'd need to grab a list of all global browser variables to fix this reliably. Didn't find anything sufficient with a quick googling, so I'm thinking if I would add Playwright so I could do Object.keys(window) in an actual browser window and get a list of global variables

I suggest using https://www.npmjs.com/package/globals and adding an option to choose node or browser

Maybe we can think of adding extra checks if the renamed variables clash with one another? I haven't seen evidence of this happening but it could be an unknown corner case.

Maybe we can think of adding extra checks if the renamed variables clash with one another?

See also:

I think I'd need to grab a list of all global browser variables to fix this reliably. Didn't find anything sufficient with a quick googling, so I'm thinking if I would add Playwright so I could do Object.keys(window) in an actual browser window and get a list of global variables

@jehna I think this might come handy to get a list of all reserved browser keywords.

// Utility to get all property names from an object, including inherited ones
function getAllPropertyNames(obj) {
  let props = new Set();

  while (obj) {
    Object.getOwnPropertyNames(obj).forEach(prop => props.add(prop));
    obj = Object.getPrototypeOf(obj);
  }

  return [...props];
}
const documentProperties = getAllPropertyNames(document);
console.log("All document properties (including inherited):", documentProperties);

const windowProperties = getAllPropertyNames(window);
console.log("All window properties (including inherited):", windowProperties);

However, there are many more reserved keywords that are not addressed in your humanify/src/openai/is-reserved-word.ts file. I found that the W3 schools website gave a more comprehensive list than what is found on the MDN website.(which looks to be where the reserved list you made seems to have been generated from)

w3schools.com/js/js_reserved.asp & AnanthaRajuC/Reserved-Key-Words-list-of-various-programming-languages@master/language-files/JavaScript%20Reserved%20Words.md & geeksforgeeks.org/javascript-reserved-words

One possibility of preventing this problem from occurring is by making sure all generated keywords have the _ character appended to the leftmost side of the keyword. similar to how C programmers dealt with changes without breaking older code

While debugging solutions to #117, @j4k0xb mentioned @babel/type's toIdentifier / isValidIdentifier:

The proper fix would be to use toIdentifier from @babel/types

@j4k0xb @jehna From a 2sec search I couldn't find rendered docs, but here's the relevant source for each:

toIdentifier definitely seems like a more robust approach than the current 'prefix with _' approach for sure.

Though I wonder if a 'proper fix' should also involve tweaking how we prompt for/filter the suggestions coming back from the LLM itself as well. Like instead of just forcing an invalid suggestion to be valid (with toIdentifier), we could detect that it's invalid (with isValidIdentifier) and then provide that feedback to the LLM, asking it to give a new suggestion; probably with some max retry limit; after which we could fall back to using the invalid suggestion run through toIdentifier, or log a warning and leave it un-renamed or similar.

Originally posted by @0xdevalias in #117 (comment)

toIdentifier calls isValidIdentifier, which calls isKeyword / isStrictReservedWord, which uses the list of reserved words here:

I also note that humanify/src/openai/is-reserved-word.ts doesn't exist anymore in the new v2 version of humanify; so it seems it was only related to v1:

Though it seemed to contain at least a few words that weren't included in Babel's list, from a quick manual diff:

abstract
boolean
byte
char
double
final
float
goto
int
long
native
short
synchronized
throws
transient
volatile

I'm not sure how important those are, and why they were on the original reserved list but not in Babel's.


Also, I wish to make a note that the GPT will produce varying names. Seems like the first one I was given produced subpar results but subsequent retries made better names. This was tested a few days apart. I wonder if the GPT changes based on input given or something because this new iteration of names for this specific JS file is just better.

See also:


This is an interesting view; e.g. crypto is not a reserved Javascript keyword per se, so it's completely valid to do:

const crypto = 'foo'

But it is a global variable in browsers, which will cause shadowing to occur, which can break any code that is depending on the global variable.

Very good that you brought this up, I've completely missed this. I'll fix this in the upcoming versions.

I suggest using npmjs.com/package/globals and adding an option to choose node or browser

If this is to be included, I think that it should be a configurable option; so that we can pick not just between node/browser, but also choose to potentially disable it entirely.