[FEATURE REQUEST] Scrape the git documentation website to build the json

Question

[FEATURE REQUEST] Scrape the git documentation website to build the json

ReedyBear opened this issue 4 years ago · 14 comments

Saw this on Twitter (a link to reddit) & thought it was really cool. Looked at it and had an idea I thought I ought to share. It's something I may be interested in working on in my free time, but I can't make any promises. I might be interested in integrating it into an open source git wrapper library I've made, too.

Is your feature request related to a problem? Please describe.
Only a few commands are currently explained & writing the rest of them by hand would be very time consuming, I imagine.

Describe the solution you'd like
To scrape the git documentation site and generate json for git-commands.ts
At first glance, it seems like the scraping may be rather straightforward. The structure of the pages appears to be pretty simple, and the flag is separated from the description, semantically.

Describe alternatives you've considered
Trying to find already scraped/parsed git documentation or a library that does the scraping.

Additional context
These docs use a <dt> for each -flag & a <dd> for the description. The description might need to be cleaned up, unless you want commands.ts to have html in the descriptions.

Links could be scraped from https://git-scm.com/docs but it might be easier to just make an array of doc page urls to scrape. https://git-scm.com/docs/git-add is where the image comes from.

Answer 1 · 2021-02-21T01:11:28.000Z

https://github.com/git/htmldocs <- Might be easier scraping this repo than scraping their website.

Answer 2 · 2021-02-21T06:00:17.000Z

I have a working prototype in my fork: https://github.com/Taeluf/what-the-git/blob/main_scrape-docs/dev/index.html

For https://git-scm.com/docs/git-add it outputs:

Full output

[
	{
		"name": "<pathspec>…",
		"aliases": [],
		"description": "Files to add content from.  Fileglobs (e.g. *.c) can\nbe given to add all matching files.  Also a\nleading directory name (e.g. dir to add dir/file1\nand dir/file2) can be given to update the index to\nmatch the current state of the directory as a whole (e.g.\nspecifying dir will record not just a file dir/file1\nmodified in the working tree, a file dir/file2 added to\nthe working tree, but also a file dir/file3 removed from\nthe working tree). Note that older versions of Git used\nto ignore removed files; use --no-all option if you want\nto add modified or new files but ignore removed ones.\n\nFor more details about the <pathspec> syntax, see the pathspec entry\nin gitglossary[7].",
		"isString": true
	},
	{
		"name": "dry-run",
		"aliases": [
			"-n"
		],
		"description": "Don’t actually add the file(s), just show if they exist and/or will\nbe ignored.",
		"isString": false
	},
	{
		"name": "verbose",
		"aliases": [
			"-v"
		],
		"description": "Be verbose.",
		"isString": false
	},
	{
		"name": "force",
		"aliases": [
			"-f"
		],
		"description": "Allow adding otherwise ignored files.",
		"isString": false
	},
	{
		"name": "interactive",
		"aliases": [
			"-i"
		],
		"description": "Add modified contents in the working tree interactively to\nthe index. Optional path arguments may be supplied to limit\noperation to a subset of the working tree. See “Interactive\nmode” for details.",
		"isString": false
	},
	{
		"name": "patch",
		"aliases": [
			"-p"
		],
		"description": "Interactively choose hunks of patch between the index and the\nwork tree and add them to the index. This gives the user a chance\nto review the difference before adding modified contents to the\nindex.\n\nThis effectively runs add --interactive, but bypasses the\ninitial command menu and directly jumps to the patch subcommand.\nSee “Interactive mode” for details.",
		"isString": false
	},
	{
		"name": "edit",
		"aliases": [
			"-e"
		],
		"description": "Open the diff vs. the index in an editor and let the user\nedit it.  After the editor was closed, adjust the hunk headers\nand apply the patch to the index.\n\nThe intent of this option is to pick and choose lines of the patch to\napply, or even to modify the contents of lines to be staged. This can be\nquicker and more flexible than using the interactive hunk selector.\nHowever, it is easy to confuse oneself and create a patch that does not\napply to the index. See EDITING PATCHES below.",
		"isString": false
	},
	{
		"name": "update",
		"aliases": [
			"-u"
		],
		"description": "Update the index just where it already has an entry matching\n<pathspec>.  This removes as well as modifies index entries to\nmatch the working tree, but adds no new files.\n\nIf no <pathspec> is given when -u option is used, all\ntracked files in the entire working tree are updated (old versions\nof Git used to limit the update to the current directory and its\nsubdirectories).",
		"isString": false
	},
	{
		"name": "no-ignore-removal",
		"aliases": [
			"-A",
			"--all"
		],
		"description": "Update the index not only where the working tree has a file\nmatching <pathspec> but also where the index already has an\nentry. This adds, modifies, and removes index entries to\nmatch the working tree.\n\nIf no <pathspec> is given when -A option is used, all\nfiles in the entire working tree are updated (old versions\nof Git used to limit the update to the current directory and its\nsubdirectories).",
		"isString": false
	},
	{
		"name": "ignore-removal",
		"aliases": [
			"--no-all"
		],
		"description": "Update the index by adding new files that are unknown to the\nindex and files modified in the working tree, but ignore\nfiles that have been removed from the working tree.  This\noption is a no-op when no <pathspec> is used.\n\nThis option is primarily to help users who are used to older\nversions of Git, whose \"git add <pathspec>…\" was a synonym\nfor \"git add --no-all <pathspec>…\", i.e. ignored removed files.",
		"isString": false
	},
	{
		"name": "intent-to-add",
		"aliases": [
			"-N"
		],
		"description": "Record only the fact that the path will be added later. An entry\nfor the path is placed in the index with no content. This is\nuseful for, among other things, showing the unstaged content of\nsuch files with git diff and committing them with git commit\n-a.",
		"isString": false
	},
	{
		"name": "refresh",
		"aliases": [],
		"description": "Don’t add the file(s), but only refresh their stat()\ninformation in the index.",
		"isString": false
	},
	{
		"name": "ignore-errors",
		"aliases": [],
		"description": "If some files could not be added because of errors indexing\nthem, do not abort the operation, but continue adding the\nothers. The command shall still exit with non-zero status.\nThe configuration variable add.ignoreErrors can be set to\ntrue to make this the default behaviour.",
		"isString": false
	},
	{
		"name": "ignore-missing",
		"aliases": [],
		"description": "This option can only be used together with --dry-run. By using\nthis option the user can check if any of the given files would\nbe ignored, no matter if they are already present in the work\ntree or not.",
		"isString": false
	},
	{
		"name": "no-warn-embedded-repo",
		"aliases": [],
		"description": "By default, git add will warn when adding an embedded\nrepository to the index without using git submodule add to\ncreate an entry in .gitmodules. This option will suppress the\nwarning (e.g., if you are manually performing operations on\nsubmodules).",
		"isString": false
	},
	{
		"name": "renormalize",
		"aliases": [],
		"description": "Apply the \"clean\" process freshly to all tracked files to\nforcibly add them again to the index.  This is useful after\nchanging core.autocrlf configuration or the text attribute\nin order to correct files added with wrong CRLF/LF line endings.\nThis option implies -u.",
		"isString": false
	},
	{
		"name": "chmod=(+|-)x",
		"aliases": [],
		"description": "Override the executable bit of the added files.  The executable\nbit is only changed in the index, the files on disk are left\nunchanged.",
		"isString": false
	},
	{
		"name": "--pathspec-from-file=<file>",
		"aliases": [],
		"description": "Pathspec is passed in <file> instead of commandline args. If\n<file> is exactly - then standard input is used. Pathspec\nelements are separated by LF or CR/LF. Pathspec elements can be\nquoted as explained for the configuration variable core.quotePath\n(see git-config[1]). See also --pathspec-file-nul and\nglobal --literal-pathspecs.",
		"isString": true
	},
	{
		"name": "pathspec-file-nul",
		"aliases": [],
		"description": "Only meaningful with --pathspec-from-file. Pathspec elements are\nseparated with NUL character and all other characters are taken\nliterally (including newlines and quotes).",
		"isString": false
	},
	{
		"name": "",
		"aliases": [],
		"description": "This option can be used to separate command-line options from\nthe list of files, (useful when filenames might be mistaken\nfor command-line options).",
		"isString": false
	}
]

Answer 3 · 2021-02-21T06:07:23.000Z

I see some issues with this:

an empty "name": ":"
Names containing < and > are not cleaned up
chmod=(+|-)x <- what to do with that?
Some descriptions are VERY long
- Should they stay long?
- Should they include html? Ithey don't rn)

todos:

Clean up the code
Add command name output (right now its just the flags)
Scrape from an array of urls (easier than scraping the page that lists the urls)
Convert to typescript (not something I know how to do nor particularly want to do)

Answer 4 · 2021-02-21T10:53:39.000Z

This is a good idea, and I actually thought about it, but I'm seeing a few issues for that solution, which is why I didn't try to implement it further:

As this is a front-end tool, I'm afraid the request time wil go up drastically because the app would have to parse a webpage for every request
The purpose of the app is to explain things simply, and the git docs seem to me a little too complicated. It's really convoluted and not very straight to the point for beginners (lots of jargon, complicated-ish terms and it really lacks examples).

I'm skeptical about that solution. Would it maybe work better if we made small articles to explain complicated terms? Embedded in the HTML of the page? Or if we just hovered over some terms and it gave us a definition for it?

Answer 5 · 2021-02-22T21:24:12.000Z

Per-request scraping would definitely be yucky. The scraping could be done as part of a build process, or just done once & store the json in the repo.

Yeah, I definitely agree the git docs can be complicated & more refined explanations would be better... But then would you rather give no explanation or a complicated explanation? I lean toward complex, but also have been so overwhelmed by complicated in-depth documentation before that it wasn't helpful, so idk.

The hover-for-definition (or maybe click to be mobile friendly) seems like a good idea. Or maybe a checkbox or toggle button to switch between hand-crafted explanations & git-doc definitions.

Something like this, but in html/css & not horribly ugly:

My thinking there is to toggle results on the full page, instead of one flag at a time as hovering would do.

Answer 6 · 2021-02-22T21:27:29.000Z

Or at least do the scraping to generate placeholder json in the repo with "description": "No description available" or something. Having the structure in place could make it much easier to fill out explanations as free time & desire allows.

Answer 7 · 2021-02-24T07:12:54.000Z

Your solution would make sense. However, what bothers me is that we would have to mix generated docs with written docs, and I would rather not have docs at all for one command and just say that it's not supported and then give a link to the git documentation rather than giving them the git documentation description. I feel like that would be a bit more intuitive. What do you think about that implementation instead?

Answer 8 · 2021-02-24T07:17:30.000Z

May I suggest an alternative solution? I was thinking about making a less generic error message, and instead of showing that it's not valid, we should show that it's not implemented yet, and return a link to the git documentation.

Answer 9 · 2021-02-24T16:46:33.000Z

I think that sounds good. I get disliking the mixing of generated & written docs. Creates a less consistent user experience.

The less generic error message & linking to docs would be good. Might want to differentiate then, between "command not supported yet" and "command doesn't exist"

Answer 10 · 2021-02-24T17:08:28.000Z

What would you want to do with unsupported flags? (when the command itself & SOME flags are supported)

Answer 11 · 2021-02-24T21:27:16.000Z

The less generic error message & linking to docs would be good. Might want to differentiate then, between "command not supported yet" and "command doesn't exist"

That's exactly what I was thinking about. Do you think you could implement something like this?

What would you want to do with unsupported flags? (when the command itself & SOME flags are supported)

I think we can have another type of error so the user knows what they didn't do right. That would be the best answer, to have different kind of error messages for different errors instead of the really generic confusing one I implemented.

Answer 12 · 2021-02-25T16:41:51.000Z

I'll think about it & get back to you. it wouldn't be for a a week or two probably. Not a lot of free time for side projects this week/weekend.

Answer 13 · 2021-03-15T18:58:01.000Z

I'm not likely to get around to doing this. There's a slim chance I get a spark of motivation at some point & do it, but its not likely. I have too many other side projects to work on when I have free time.

Answer 14 · 2021-03-15T19:40:33.000Z

Alright dude, it's fine. Thank you for the interest in my project still! I'm going to still leave the issue open for anyone willing to implement a solution for this.