/RImpala

Primary LanguageRApache License 2.0Apache-2.0

Disclaimer:

This repository is a fork detached of the following project: https://github.com/Mu-Sigma/RImpala

Due to the inactivity of the projects or ignoring my comments/fixes I decided to separate the development in a separate Github project, why not forked? Because fork apparently do not appear in searches and if anyone has problems with outdated versions will make it more difficult to find.

#RImpala

RImpala is an R package that helps you to connect and execute distributed queries using Cloudera Impala. Impala supports jdbc integration and this feature is used by RImpala to establish a connection between R and Impala.

##Installating RImpala

To use this package you must also have access to a Hadoop cluster running Cloudera Impala with at least one populated table defined in the Hive Metastore.

###Install JDBC jars for RImpala

  • Download the Impala JDBC zip fileto the client machine that you will use to connect to Impala servers.
  • Extract the contents of the zip file to a location of your choosing. For example:
    • On Linux, you might extract this to a location such as /opt/jars/.
    • On Windows, you might extract this to a folder such as C:\Program Files\impala-jars.
  • We will use this location in rimpala.init()

###Install RImpala

  1. Compressed package: R CMD INSTALL RImpala_0.1.6_nullfixed.tar.gz

  2. Source code: R CMD INSTALL ./RImpala ##Loading RImpala and connecting to Impala

  3. Find the ip of the machine and the port where the Impala service is running.

  4. Find the location where you have unziped the JDBC jars in the above section.

  5. Launch R

  6. library("RImpala") rimpala.init(libs="/path/to/JDBC/jars/") result = rimpala.query("your query"); by default rimpala.init() searches "/usr/lib/impala" for the JDBC jars.

Here are links to more information on Cloudera Impala:

##Requirements