scipy/docs.scipy.org

Remove numpy documentation from google searches

mattip opened this issue · 12 comments

Searching for "numpy standard normal" today in google shows me these top three results

https://docs.scipy.org/doc/numpy-1.15.0/reference/...
No information is available for this page.
Learn why

numpy.random.standard_normal — NumPy v1.20.dev0 Manual
numpy.org › devdocs › reference › random › generated
Draw samples from a standard Normal distribution (mean=0, stdev=1). Note. New code should use the standard_normal method of a default_rng() instance ...

numpy.random.standard_normal — NumPy v1.18 Manual
numpy.org › stable › reference › random › generated
Draw samples from a standard Normal distribution (mean=0, stdev=1). Note. New code should use the standard_normal method of a default_rng() instance ...

The top result is still from this site, even though the robots.txt prevents showing it.

Apparently we could use this google tool https://www.google.com/webmasters/tools/removals to remove the results entirely. Should we?

pv commented

Maybe the issue is that Google has not yet re-crawled the pages, or somehow has ignored the changes? Maybe the robots.txt entry prevents it from re-crawling? Should we instead remove it? A comment from a SEO expert would be useful here. The thing is that if you just remove the old docs, you'll probably lose the weights from all existing links pointing to them.

E.g. also https://www.google.com/search?q=numpy+array has first hit docs.scipy.org/doc/numpy/ even though that redirects to numpy.org.

I changed now the /doc/numpy to numpy.org/doc redirect to "Moved Permanently" instead of a temporary redirect. Moreover, the old docs have been down-weighted with a <link rel="canonical" pointing to the new docs since #38 so in principle the canonical ones should come first. This seems to work for Scipy (and the google-juice from the old docs presumably feeds to the new ones).

pv commented

I removed the Disallow: /doc/numpy* line from robots.txt now, as apparently it's not doing what it was expected to do. We can add it back if we are sure it's the optimal thing to do...

Let's wait a few weeks to see how this pans out, then we can see if anything need adjusting.

pv commented

The google docs seem to say that robots.txt is indeed the wrong tool for this:
https://support.google.com/webmasters/answer/6062608?hl=en

You may want to add a similar javascript solution to numpy.org. Otherwise, I guess google may end up preferring some old version docs, and people will be confused landing on them (e.g. also via other sources than google). The rel=canonical tags it adds also down-weight automatically the obsolete doc versions.

Checking today (two weeks later) the old scipy page is still the top search result for the google query https://www.google.com/search?hl=en&q=numpy%20standard%20normal. Duckduckgo does lower the position of the scipy.org pages https://duckduckgo.com/?q=numpy+standard+normal&t=hk&ia=web but they are still there.

pv commented

When I look at the google search console now, it shows that google had last indexed most of the pages in early june (and some in april) and shows most pages still blocked by the robots.txt. So probably one needs to wait longer (or submit a sitemap to them to trigger the update).

Thanks, I will set a reminder for two more weeks.

Two weeks later, the 1.15 pages are still the top two links :(

I wish Google would show the documentation for Python3.8 by default instead of polluting it with 2.7.

pv commented

The numpy.random pages are maybe the worst cases, because they were moved/changed in the numpy random rewrite (and google indexed some of them while the canonical link was still invalid, before #42).

For other numpy functions it seems numpy.org links are generally now on top.

pv commented

Google seems to now have numpy.org/doc/stable on top.

Cool. I will close this then. Thanks @pv. Please reopen or open a new issue if there is more we can do.