trapd00r/LS_COLORS

Integrate from fileinfo.com list of file extensions

rpdelaney opened this issue · 24 comments

Should be some kind of metadata, but maybe not in the existing metadata group

Edit: This was originally about .vcf and .vcard extensions but it sprawled into something much more ambitious.

I've added a few filetypes in my personal branch. About 2500 of them, so I didn't want to destroy the work you've done with sorting files in the master branch, and sorting all of these would be way too tedious. Just a heads up. :)

Showing 1 changed file with 3,038 additions and 651 deletions.

Holy ... what? Where did you get all this?

I scraped data from the wikipedia article on file extensions. Wanted to send you a pm about it but there's no such feature yet on github it seems. :)

Hahaha. Holy crap. You don't get any performance issues or anything with that? When you type env does your terminal explode?

It might be possible to organize these (relatively) rapidly using a script that prints Wikipedia's description of a filetype and gives you buttons to hit for which category to put it in. But that's still 3000 bloody mouse clicks.

In #109 I planned to make a CONTRIBUTING.md. Maybe we could stick them in there and invite people to add them to the appropriate category, but then we might as well just hyperlink to the Wikipedia article you got them from. I dunno what's best.

env 0.00s user 0.00s system 51% cpu 0.007 total
I haven't noticed any performance drops what so ever. :)

Yeah, I don't know the best way to approach this either. If the wiki description for each extension read something like audio/mp3, image/jpeg or whatever, it would be possible to do this programmaticly...

However, I've made a somewhat clean dump of the extensions and descriptions if anyone's up for sorting all of these out somehow: https://github.com/trapd00r/LS_COLORS/blob/japh/wiki_fileext.txt

I figured using libmagic would work wonders (file uses it: file foo.*). The issue, though, is that it guesses the filetype based on the first few bytes of a file, so you can't just touch all of these 3k file.extensions since they'll be empty. You'll have to actually create the files in question.

Here's the database that libmagic uses: https://github.com/threatstack/libmagic/tree/master/magic/Magdir

ftftft

Give me a few days...

Why days? I could cross-reference that super fast in sqlite. If this is a lot of work for you, stand back! I got this.

Still not sure we want to do this though. env is going to fill up my terminal buffer...

Sure, go ahead. I added everything in a dictionary: https://gist.github.com/trapd00r/554f03450ed114fee191e794c87b0215

I am not sure either, but there's no performance issues so why not, really. :D

Great, that will be super easy to parse.

Some of these are kind of giving me lulz though. '9.PNG' => "NinePatchDrawable Image", really? But I'll probably only include those that I can cross-reference with libmagic.

I am not sure either, but there's no performance issues so why not, really. :D

I use direnv and environment variables for various purposes so I often do env | grep -i foo. I'm not going to enjoy all the extra accidental collisions with LS_COLORS, especially since each false match will scroll everything off my terminal buffer. Might just have to write a wrapper of some kind that extracts what I need with some explicit exclusion of LS_COLORS so I don't ever accidentally hit it. Edit: Now that I think about it I bet there is something that could handle this for me. I'll look around.

Anyway, the point is my use case is probably not the normal one, so if performance is really that much of a non-issue then there's little reason not to include these if we can automate the categorization.

Also, would it be a goal to automate the scraping / categorization? That seems horrendously over-engineered but people will be updating the list on Wikipedia ...

edit: a script to build the LS_COLORS out of some kind of database (dunno what format yet, probably simple json would do it) could be useful regardless. That would enable us to do things like have names/labels for the colors themselves and then associate extensions with the named labels, etc.

Yeah, forgot to tell you but the extensions in my dict above is scraped from fileinfo.com - their descriptions were a lot better (and also more extensions). And yeah, some of them are pretty bonkers...

I'm all for automation, I'll tinker more with this tomorrow after a good nights sleep...

We could cheat and scrape from their already defined categories but not sure if every filetype is categorized. Maybe it's good enough anyway.

Edit: If you're going to scrape anything, do note that only 500 results are showed by default - you'll have to scroll down and click view full list

https://github.com/trapd00r/LS_COLORS/tree/motherofgod/bin/scrape_fileinfo

  • auto-scrape from fileinfo.com
  • every file extension categorized
  • every file extension commented
  • a valid LS_COLORS file generated on STDOUT, including folding markers for vim :)

Btw. This wasn't an issue with the entries I scraped from wikipedia (only +2500), but this is over 11k entries and, welp, we run into the 120KiB limit per env var.

MAX_ARG_STRLEN is a constant defined as PAGESIZE*32 in /path/to/linux/headers/include/uapi/linux/binfmts.h. Cannot be changed without recompiling the kernel.

It's kind of a big deal because:

git⸢motherofgod」% eval $(dircolors -b ./auto.LS_COLORS)
% env
zsh: argument list too long: env
% perl -ehi
zsh: argument list too long: perl
% date
zsh: argument list too long: date

Heh. Maybe some kind of shell extension could delegate highlighting to a subprocess. Zsh might be able to do that with a plugin, but I use bash. Regardless, even if it were workable to do that delegation, we'd actually really have to worry about performance now. Some directories have thousands of files in them. And only really weird people are going to want to install extensions like that just so that they can have a special color for LogonStudio Windows Vista Logon Screen. Speaking for myself, I may be weird, but I'm not likely to be among those weird people.[1]

What I'm saying is, I think we need two things:

  1. Figure out what the upper limit actually is, in concrete terms, for how many file types we can support in a cross-platform way (read: without doing anything radical like what I described above).
  2. Figure out a way to reduce these into a list of types that constitute the low hanging fruit, within that limit.

[1]: Speaking of which, most of these file types have no significance in a *nix environment, which is where 99% of users of LS_COLORS will be.

Given that all extensions use the ecma-48 spec notation and each extension have 5 chars we could do roughly 9k (13 chars per entry). And agree, a curated list would work better, however then this whole automation thing falls short.

I might be able to trim the list quite effectively, I happened to write a thing while playing around with this...

https://github.com/trapd00r/File-Extension

That's cool. Let me look into this and get back to you.

Btw, are we concerned about who holds the copyright for the descriptions of the file types at fileinfo.com? I haven't looked into that at all.

Kind of small potatoes but if you have imagemagick installed, identify -list format is a pretty handy list of graphics extensions with descriptions.

Why days? I could cross-reference that super fast in sqlite. If this is a lot of work for you, stand back! I got this.

Still not sure we want to do this though. env is going to fill up my terminal buffer...

For folks concerned with blowing up their environment, try something like this:
Use dircolors to set the environment variable, but strip the export and eval the output.
Then set an alias that expands the current environment's LS_COLORS value. This should be in your rc file and not your profile so that its executed on every interactive shell invocation.

type dircolors >/dev/null 2>&1 && {
    eval `{ dircolors -b ${XDG_CONFIG_HOME}/sh/dir_colors 2>/dev/null || dircolors -p | dircolors -b ; } | sed '$d'`
}
alias ls="LS_COLORS=\"\${LS_COLORS:-${LS_COLORS}}\" ls ${COLOR_OPTS} -h --time-style=long-iso"

On my way to Zanzibar right now but I stumbled upon this on hackernews:
http://fileformats.archiveteam.org/wiki/Category:File_formats_by_extension

Pretty comprehensive and with a lot of information on each type.

Would it make sense to compile this file list into a YAML file, ala vivid's config? This could be done as part of #195 .

  • These are two big pieces of work with a lot of risk of breakage. They should be done separately, to make the work easier to perform and easier to roll back in the worst case.
  • The migration to vivid should be done first.