vezaynk/Sitemap-Generator-Crawler

Blacklist not working

Kristiansky opened this issue · 9 comments

I have added few pages from my site into the blacklist array, but despite that, they appear in the sitemap everytime.

$blacklist = array( "/de/", "/de/*", "/private/", "/private/*", "*.jpg", "*.png", );
This is my blacklist array. When i open the xml file:
<url> <loc>https://www.mywebsite.com/de</loc> <changefreq>daily</changefreq> <priority>1</priority> </url> <url> <loc>https://www.mywebsite.com/de/sonnenschirme</loc> <changefreq>daily</changefreq> <priority>1</priority> </url>

$blacklist needs absolute urls. For /de/ you would either want https://website.com/de/ or */de/.

I have changed the array as you told me:
$blacklist = array( "https://www.mywebsite.com/de/", "https://www.mywebsite.com/de/*", "https://www.mywebsite.com/private/", "https://www.mywebsite.com/private/*", "*.jpg", "*.png", );
But in the sitemap /de links still appear. 😞

Post the full config, I'll take a look.

<?php
/*
Sitemap Generator by Slava Knyazev. Further acknowledgements in the README.md file.

Website: https://www.knyz.org/
I also live on GitHub: https://github.com/knyzorg
Contact me: Slava@KNYZ.org
*/

//Make sure to use the latest revision by downloading from github: https://github.com/knyzorg/Sitemap-Generator-Crawler

/* Usage
Usage is pretty strait forward:
- Configure the crawler by editing this file.
- Select the file to which the sitemap will be saved
- Select URL to crawl
- Configure blacklists, accepts the use of wildcards (example: http://example.com/private/* and *.jpg)
- Generate sitemap
- Either send a GET request to this script or run it from the command line (refer to README file)
- Submit to Google
- Setup a CRON Job execute this script every so often

It is recommended you don't remove the above for future reference.
*/

// Default site to crawl
$site = "https://www.may-online.com/en";

// Default sitemap filename
$file = "../sitemap-generated.xml";
$permissions = 0644;

// Depth of the crawl, 0 is unlimited
$max_depth = 0;

// Show changefreq
$enable_frequency = true;

// Show priority
$enable_priority = true;

// Default values for changefreq and priority
$freq = "daily";
$priority = "1";

// Add lastmod based on server response. Unreliable and disabled by default.
$enable_modified = false;

// Disable this for misconfigured, but tolerable SSL server.
$curl_validate_certificate = true;

// The pages will be excluded from crawl and sitemap.
// Use for exluding non-html files to increase performance and save bandwidth.
$blacklist = array(
	"https://www.may-online.com/de/",
	"https://www.may-online.com/de/*",
	"https://www.may-online.com/private/",
	"https://www.may-online.com/private/*",
	"*.jpg",
	"*.png",
);

// Enable this if your site do requires GET arguments to function
$ignore_arguments = false;

// Not yet implemented. See issue #19 for more information.
$index_img = false;

//Index PDFs
$index_pdf = true;

// Set the user agent for crawler
$crawler_user_agent = "Mozilla/5.0 (compatible; Sitemap Generator Crawler; +https://github.com/knyzorg/Sitemap-Generator-Crawler)";

// Header of the sitemap.xml
$xmlheader ='<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">';

// Optionally configure debug options
$debug = array(
	"add" => true,
	"reject" => true,
	"warn" => true
);


//Modify only if configuration version is broken
$version_config = 2;

Here's what's generated

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
  <url>
    <loc>https://www.may-online.com/de</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/sonnenschirme</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/impressum</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/agb</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/datenschutz</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/sonnenschirme/restaurant-cafe</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/sonnenschirme/ampelschirme</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/sonnenschirme/ampelschirme/mezzo</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/unternehmen/referenzen</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
</urlset>

Found the issue. Blacklist seems to be ignored when it caused by a redirect. Oh the pleasures of parsing the web!

By the way, to format code blocks, it's 3 backticks. The initial $site is trusted and is never checked against blacklists.

For some reason, it's refusing to go to the /en site. The reformatter chokes on it for some reason, probably somehow related to the redirection.

FYI, redirecting from the root is bad practice.

I was wrong. It is not related to the redirect. Your link looks as such: <a href=" https://www.may-online.com/en">en</a>. That is not okay. The space before the https:// makes it invalid. Web browsers are smart enough to remove it, my script is not.

I was suppose to close this issue via commit.
Redirection bug was fixed in b894362