Blacklist not working
Kristiansky opened this issue · 9 comments
I have added few pages from my site into the blacklist array, but despite that, they appear in the sitemap everytime.
$blacklist = array( "/de/", "/de/*", "/private/", "/private/*", "*.jpg", "*.png", );
This is my blacklist array. When i open the xml file:
<url> <loc>https://www.mywebsite.com/de</loc> <changefreq>daily</changefreq> <priority>1</priority> </url> <url> <loc>https://www.mywebsite.com/de/sonnenschirme</loc> <changefreq>daily</changefreq> <priority>1</priority> </url>
$blacklist
needs absolute urls. For /de/
you would either want https://website.com/de/
or */de/
.
I have changed the array as you told me:
$blacklist = array( "https://www.mywebsite.com/de/", "https://www.mywebsite.com/de/*", "https://www.mywebsite.com/private/", "https://www.mywebsite.com/private/*", "*.jpg", "*.png", );
But in the sitemap /de links still appear. 😞
Post the full config, I'll take a look.
<?php
/*
Sitemap Generator by Slava Knyazev. Further acknowledgements in the README.md file.
Website: https://www.knyz.org/
I also live on GitHub: https://github.com/knyzorg
Contact me: Slava@KNYZ.org
*/
//Make sure to use the latest revision by downloading from github: https://github.com/knyzorg/Sitemap-Generator-Crawler
/* Usage
Usage is pretty strait forward:
- Configure the crawler by editing this file.
- Select the file to which the sitemap will be saved
- Select URL to crawl
- Configure blacklists, accepts the use of wildcards (example: http://example.com/private/* and *.jpg)
- Generate sitemap
- Either send a GET request to this script or run it from the command line (refer to README file)
- Submit to Google
- Setup a CRON Job execute this script every so often
It is recommended you don't remove the above for future reference.
*/
// Default site to crawl
$site = "https://www.may-online.com/en";
// Default sitemap filename
$file = "../sitemap-generated.xml";
$permissions = 0644;
// Depth of the crawl, 0 is unlimited
$max_depth = 0;
// Show changefreq
$enable_frequency = true;
// Show priority
$enable_priority = true;
// Default values for changefreq and priority
$freq = "daily";
$priority = "1";
// Add lastmod based on server response. Unreliable and disabled by default.
$enable_modified = false;
// Disable this for misconfigured, but tolerable SSL server.
$curl_validate_certificate = true;
// The pages will be excluded from crawl and sitemap.
// Use for exluding non-html files to increase performance and save bandwidth.
$blacklist = array(
"https://www.may-online.com/de/",
"https://www.may-online.com/de/*",
"https://www.may-online.com/private/",
"https://www.may-online.com/private/*",
"*.jpg",
"*.png",
);
// Enable this if your site do requires GET arguments to function
$ignore_arguments = false;
// Not yet implemented. See issue #19 for more information.
$index_img = false;
//Index PDFs
$index_pdf = true;
// Set the user agent for crawler
$crawler_user_agent = "Mozilla/5.0 (compatible; Sitemap Generator Crawler; +https://github.com/knyzorg/Sitemap-Generator-Crawler)";
// Header of the sitemap.xml
$xmlheader ='<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">';
// Optionally configure debug options
$debug = array(
"add" => true,
"reject" => true,
"warn" => true
);
//Modify only if configuration version is broken
$version_config = 2;
Here's what's generated
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>https://www.may-online.com/de</loc>
<changefreq>daily</changefreq>
<priority>1</priority>
</url>
<url>
<loc>https://www.may-online.com/de/sonnenschirme</loc>
<changefreq>daily</changefreq>
<priority>1</priority>
</url>
<url>
<loc>https://www.may-online.com/de/impressum</loc>
<changefreq>daily</changefreq>
<priority>1</priority>
</url>
<url>
<loc>https://www.may-online.com/de/agb</loc>
<changefreq>daily</changefreq>
<priority>1</priority>
</url>
<url>
<loc>https://www.may-online.com/de/datenschutz</loc>
<changefreq>daily</changefreq>
<priority>1</priority>
</url>
<url>
<loc>https://www.may-online.com/de/sonnenschirme/restaurant-cafe</loc>
<changefreq>daily</changefreq>
<priority>1</priority>
</url>
<url>
<loc>https://www.may-online.com/de/sonnenschirme/ampelschirme</loc>
<changefreq>daily</changefreq>
<priority>1</priority>
</url>
<url>
<loc>https://www.may-online.com/de/sonnenschirme/ampelschirme/mezzo</loc>
<changefreq>daily</changefreq>
<priority>1</priority>
</url>
<url>
<loc>https://www.may-online.com/de/unternehmen/referenzen</loc>
<changefreq>daily</changefreq>
<priority>1</priority>
</url>
</urlset>
Found the issue. Blacklist seems to be ignored when it caused by a redirect. Oh the pleasures of parsing the web!
By the way, to format code blocks, it's 3 backticks. The initial $site
is trusted and is never checked against blacklists.
For some reason, it's refusing to go to the /en
site. The reformatter chokes on it for some reason, probably somehow related to the redirection.
FYI, redirecting from the root is bad practice.
I was wrong. It is not related to the redirect. Your link looks as such: <a href=" https://www.may-online.com/en">en</a>
. That is not okay. The space before the https://
makes it invalid. Web browsers are smart enough to remove it, my script is not.