/seo-proxy

SEO Proxy proof of concept

Primary LanguageJava

SEO Proxy proof of concept

This webservice was made as a proof-of-concept to show that it is possible to generate static versions of the pages in a client-side javascript app on the server with a purely java based solution.

As this is a proof-of-concept it has not been tested in a production setting, and it likely won't be kept up to date!

This webservice was designed to do the following things:

  • Act as a proxy for the JS app that runs on a separate webserver (another tomcat, an apache, …)
  • HTML requests will be accepted by the HTMLUnitService that will use HTMLUnit to generate a snapshot of the JS app, and return it as static HTML
  • These static HTML pages are cached in memory (15 minutes by default)
  • Requests for assets (css, js, images, …) will be passed on to the server hosting the JS app
  • Because the original javascript links are kept intact, clients with javascript enabled will run the JS app when the page is loaded. The JS app will likely need to be modified to be able to handle the fact that the initial content is already rendered. In the dsember prototype this was done by simply throwing away the pre-rendered contents before starting the app.
  • Clients without javascript support will simply see the static page, virtually identical to the one that would be generated by the JS app.

The purpose of serving both crawlers and regular users the same pre-rendered page, is because Google Scholar considers serving the static page only to their crawlers, and not to regular users a form of cloaking. However this app can be easily adjusted to only serve the static version to crawlers by either looking for _escaped_fragment_ in the query-string, or by inspecting the User-Agent of the client.

Installation

  • Start from a working DSpace codebase.
  • Clone this repository to [dspace-src]/dspace/modules/seo-proxy
  • Add the following profile to your [dspace-src]/dspace/modules/pom.xml:
<profile>
    <id>dspace-seo-proxy</id>
    <activation>
        <file>
            <exists>seo-proxy/pom.xml</exists>
        </file>
    </activation>
    <modules>
        <module>seo-proxy</module>
    </modules>
</profile>
  • If you're using a different DSpace version than 5.4, adjust <version> in the <parent> section of [dspace-src]/dspace/modules/seo-proxy/pom.xml
  • Configure at least the targetUri parameter in [dspace-src]/dspace/modules/seo-proxy/web.xml to match your JS app host.
  • Rebuild DSpace
  • Deploy the seo-proxy webapp.

Configuration options

There are two provided implementations of org.dspace.seoproxy.AbstractHTMLUnitServlet. The difference between them is how they decide that the page has finished loading.

HTMLUnitServlet

org.dspace.seoproxy.HTMLUnitServlet will just wait for a set amount of time after the page has loaded for AJAX operations to finish, before it takes the snapshot.

You can set the param waitTimeInMs in web.xml to determine the amount of time it should wait. The default is 2000ms. e.g.:

<servlet>
   <servlet-name>HTMLUnitServlet</servlet-name>
   <servlet-class>org.dspace.seoproxy.HTMLUnitServlet</servlet-class>
   <init-param>
       <param-name>targetUri</param-name>
       <param-value>http://localhost:4200</param-value>
   </init-param>
   <init-param>
       <param-name>waitTimeInMs</param-name>
       <param-value>1000</param-value>
   </init-param>
</servlet>

You can use this implementation if you're starting out with your own JS app. But it is recommended that you write your own implementation of AbstractHTMLUnitServlet that has some way to determine when to finish based on knowledge of the JS app.

DSEmberServlet

org.dspace.seoproxy.DSEmberServlet is somewhat optimized to work with the dsember DSpace UI prototype. It will use knowledge about how the ember works to decide when the page has finished loading.

It has no configuration options.

General

Both implementations can set following options for the cache:

  • maxCacheSize: the max number of pages in the cache, default 10000
  • cacheExpireDurationInMinutes: the number of minutes before a cached page is automatically removed.

e.g.:

<servlet>
   <servlet-name>HTMLUnitServlet</servlet-name>
   <servlet-class>org.dspace.seoproxy.DSEmberServlet</servlet-class>
   <init-param>
       <param-name>targetUri</param-name>
       <param-value>http://localhost:4200</param-value>
   </init-param>
   <init-param>
       <param-name>maxCacheSize</param-name>
       <param-value>1000</param-value>
   </init-param>
   <init-param>
       <param-name>cacheExpireDurationInMinutes</param-name>
       <param-value>60</param-value>
   </init-param>
</servlet>