- About package
- Installation
- Setup
- Render page
- Options
- Set up bot's user agents
- Cache lifetime (TTL)
- Set ignored routes
- Differentiate Phantomjs spider rendering from normal web browsing
- Supported redirects
- On/Off debug messages
- Response statuses
- Enable 404 page and correct responses (IR)
- Enable 404 page and correct responses (FR)
- Important notes
- How to install Phantomjs to server
- Testing
- Test with CURL
- Test with Google
- Original Spiderable documentation
This is a branch of the standard meteor spiderable
package, with some merged code from
ongoworks:spiderable
package. Primarily, this lengthens the timeout to 30 seconds and
size limit to 10MB. All results will be cached to Mongo collection, by default for 3 hours (180 minutes).
This package will ignore all SSL error in favor of page fetching.
This package supports "real response-code" and "real headers", this means if your route returns 301
response code with some headers
the package will return the same headers. This package also has support for JavaScript redirects.
This package tested with iron-router and flow-router, with and without next packages:
This package has build-in caching mechanism, by default it stores results for 3 hours, to change storing period set Spiderable.cacheLifetimeInMinutes
to other value in minutes.
meteor add jazeee:spiderable-longer-timeout
import { Spiderable } from 'meteor/jazeee:spiderable-longer-timeout';
On server and client this tells Spiderable that everything is ready. Spiderable will wait for Meteor.isReadyForSpiderable
to be true
, which allows for
finer control about when content is ready to be published.
Router.onAfterAction( function () {
if (this.ready()) {
Meteor.isReadyForSpiderable = true;
}
});
Array of Regular Expressions, of bot's user agents that we want to serve statically, but do not obey the _escaped_fragment_ protocol
.
Optionally set or extend Spiderable.userAgentRegExps
list.
Spiderable.userAgentRegExps.push(/^vkShare/i);
Default Bots:
/facebookExternalHit/i
/linkedinBot/i
/twitterBot/i
/googleBot/i
/bingBot/i
/yandex/i
/google-structured-data-testing-tool/i
/yahoo/i
/MJ12Bot/i
/tweetmemeBot/i
/baiduSpider/i
/Mail\.RU_Bot/i
/ahrefsBot/i
/SiteLockSpider/
How long cached Spiderable results should be stored (in minutes). Note:
- Should be set before
Meteor.startup
- Value should be {Number} in minutes
- To set a new cache lifetime you need to drop index on
createdAt_1
. - Default value: 180 (3 hours)
Spiderable.cacheLifetimeInMinutes = 60; // 1 hour in minutes
If you want to change your cache lifetime, first - drop the cache index. To drop the cache index, run in Mongo console:
db.SpiderableCacheCollection.dropIndex('createdAt_1');
/* or */
db.SpiderableCacheCollection.dropIndexes();
Spiderable.ignoredRoutes
- is array of strings, routes that we want to serve statically, but do not obey the _escaped_fragment_
protocol. This is a server only parameter.
For more info see this thread.
Spiderable.ignoredRoutes.push('/cdn/storage/Files/');
Spiderable.customQuery
- additional get
query will be appended to http request.
This option may help to build different client's logic for requests from phantomjs and normal users
- If
true
- Spiderable will append___isRunningPhantomJS___=true
to the query - If
String
- Spiderable will appendString=true
to the query
Spiderable.customQuery = true;
// or
Spiderable.customQuery = '_fromPhantom_'
// Usage:
Router.onAfterAction( function () {
if(Meteor.isClient && _.has(this.params.query, '___isRunningPhantomJS___') {
Session.set('___isRunningPhantomJS___', true);
}
});
Show/hide server's console messages, set Spiderable.debug
to true
to show server's console messages
- Default value:
false
Spiderable.debug = true;
Memory allocation for PhantomJS (in bytes). Note:
- Should be set before
Meteor.startup
- Value should be {Number} in bytes
- Default value: 10485760 (10MB)
Spiderable.bufferSize = 10 * 1024 * 1024; // 10MB in bytes
Request timeout length. Note:
- Should be set before
Meteor.startup
- Value should be {Number} in milliseconds
- Default value: 30000 (30 seconds)
Spiderable.requestTimeout = 30 * 1000; // 30 seconds in minutes
You able to send any response status from phantomjs, this behavior may be easily controlled via special HTML
/JADE
comment:
201
-<!-- response:status-code=201 -->
401
-<!-- response:status-code=401 -->
403
-<!-- response:status-code=403 -->
500
-<!-- response:status-code=500 -->
This directive accepts any 3-digit value, so you may return any standard or custom response code.
- Create template which you prefer to return, when page is not found
- Set iron router's
notFoundTemplate
- Include a comment
<!-- response:status-code=404 -->
on your template. This way, we can ensure spiderable sends a404
status code in the response headers - Enable iron router's
dataNotFound
plugin. See below or read more about iron-router plugins
Router.configure({
notFoundTemplate: '_404'
});
Router.plugin('dataNotFound', {
notFoundTemplate: Router.options.notFoundTemplate
});
template(name="_404")
// response:status-code=404
h1 404
h3 Oops, page not found
p Sorry, page you're requested is not exists or was deleted
<template name="_404">
<!--response:status-code=404-->
<h1>404</h1>
<h3>Oops, page not found</h3>
<p>Sorry, page you're requested is not exists or was deleted</p>
</template>
- Create template which you prefer to return, when page is not found
- Include a comment
<!-- response:status-code=404 -->
on your template. This way, we can ensure spiderable sends a404
status code in the response headers - Set flow router's
notFound
property. See below or read more about flow-router not found routes
// With layout
FlowRouter.notFound = {
action() {
BlazeLayout.render('_layout', {content: '_404'});
}
}
// Without layout
FlowRouter.notFound = {
action() {
BlazeLayout.render('_404');
}
}
template(name="_404")
// response:status-code=404
h1 404
h3 Oops, page not found
p Sorry, page you're requested is not exists or was deleted
<template name="_404">
<!--response:status-code=404-->
<h1>404</h1>
<h3>Oops, page not found</h3>
<p>Sorry, page you're requested is not exists or was deleted</p>
</template>
window.location.href = 'http://example.com/another/page';
window.location.replace 'http://example.com/another/page';
Router.go('/another/page');
Router.current().redirect('/another/page');
Router.route('/one', function () {
this.redirect('/another/page');
});
Set Meteor.isReadyForSpiderable
to true
when your route is finished, in order to publish.
Deprecated Meteor.isRouteComplete=true
, but it will work until at least 2015-12-31 after which I'll remove it...
See code for details
If you deploy your application with meteor bundle
, you must install
phantomjs (http://phantomjs.org) somewhere in your
$PATH
. If you use Meteor Up, then meteor deploy
can do this for you.
Spiderable.originalRequest
is also set to the http request. See issue 1.
Test your site by appending a query to your URLs: URL?_escaped_fragment_=
as in http://your.site.com/path_escaped_fragment_=
curl
your localhost
or host name, if you on production, like:
curl http://localhost:3000/?_escaped_fragment_=
curl http://localhost:3000/ -A googlebot
Use Fetch as Google
tools to scan your site. Tips:
- Observe your server logs using tail -f or mup logs -f
Fetch as Google
and observe that it takes 3-5 minutes before displaying results.- Use an uncommon URL to help you identify your request in the logs. Consider adding an extra URL query parameter. For example:
# Simple test with test=1 query
curl "http://localhost:3002/blogs?_escaped_fragment_=&test=1"
# Set the date in the query, which will show up in Meteor logs, with a unique date. (Turn on `Spiderable.debug=true`)
TEST=`date "+%Y%m%d-%H%M%S"`; echo $TEST; curl "http://localhost:3000/blogs?_escaped_fragment_=&test=${TEST}"
Interpreting Fetch as Google
results:
- The tool will not actually hit your server right away.
- It appears to provide a simple scan result without the extra
?_escaped_fragment_=
component. - Wait several minutes more. Google appears to request the page, which will show up in your logs as
Spiderable successfully completed
. - Search on Google using
site:your.site.com
- Make sure Google lists all relevant pages.
- Look at Google's cached version of the pages, to make sure it is fully rendered.
- Make sure that Google sees the pages with all data subscriptions complete.
PhantomJS can be temperamental, and can be a challenge to work with.
If PhantomJS is failing on your server, you can try running it directly to help debug what is broken.
On the server console, try running phantomjs --version
Also, you can run this package's PhantomJS script. In order to do so, you'd need to find the phantom_script.js file.
# Find phantom_script.js
PHANTOM_SCRIPT=$(find /opt/YOUR_WEB_APP/app/ -name phantom_script.js)
# Verify that you found just one
echo ${PHANTOM_SCRIPT}
# Try running phantomjs with that script
phantomjs --load-images=no --ssl-protocol=TLSv1 --ignore-ssl-errors=true --web-security=false ${PHANTOM_SCRIPT} http://localhost
# Verify that it succeeded (should return 0)
echo $?
spiderable
is part of Webapp. It's
one possible way to allow web search engines to index a Meteor
application. It uses the AJAX Crawling
specification
published by Google to serve HTML to compatible spiders (Google, Bing,
Yandex, and more).
When a spider requests an HTML snapshot of a page the Meteor server runs the client half of the application inside phantomjs, a headless browser, and returns the full HTML generated by the client code.
In order to have links between multiple pages on a site visible to spiders, apps
must use real links (eg <a href="/about">
) rather than simply re-rendering
portions of the page when an element is clicked. Apps should render their
content based on the URL of the page and can use HTML5
pushState
to alter the URL on the client without triggering a page reload. See the Todos
example for a demonstration.
When running your page, spiderable
will wait for all publications
to be ready. Make sure that all of your publish functions
either return a cursor (or an array of cursors), or eventually call
this.ready()
. Otherwise, the phantomjs
executions
will fail.