improve script/style extract

Describe the problem

svelte uses regex to extract information about script and style nodes in .svelte files

Recent changes increased the complexity of these regex to improve their accurary and allow for ankle brackets in attributes eg generics="<...>"

svelte/packages/svelte/src/compiler/preprocess/index.js

Line 257 in 9926347

    
           /<!--[^]*?-->|<style((?:\s+[^=>'"/]+=(?:"[^"]*"|'[^']*'|[^>\s]+)|\s+[^=>'"/]+)*\s*)(?:\/>|>([\S\s]*?)<\/style>)/g;

These regexes for script and style tags also extract the content to be passed to svelte.preprocess.

Due to their complexity, their performance degrades when processing larger files and they are hard to read/maintain (case in point, #9551 )

They are also replicated in other svelte tooling that also needs access to this information without running a full svelte.parse.

Describe the proposed solution

I'd like to suggest a few improvements.

extract these into a utility either exported from svelte or a new utility library
add more tests that check all possible combinations. Eg the current implementation does not allow whitespaces other than space after <script or any whitespace preceding the closing bracket in </script >

Fixing those would make the regex slower and more complex again, but i think we can make reasonable limitations to what a svelte component should be allowed to use for declaring script and style blocks. Which opens up an entirely different avenue:

replace regex with a string walking extractor that explicitly only seeks top level scripts at the beginning of .svelte and style blocks at the end. This allows us to skip the entire template block in the middle.

The resulting extract code would be a bit larger than a regex, but much safer and with better performance characteristics (speed and memory). A super simple start can be found here https://jsben.ch/ODOHu

For backwards compatibility, we could offer a fallback to either full parse or the previous regex approach. but most projects using prettier-plugin-svelte with script-template-style ordering should have no issue with it.

Alternatives considered

rethink parsing entirely and make this extract no longer needed
leave as is

Importance

would make my life easier