Disclaimer:
This repository is a fork detached of the following project: https://github.com/Mu-Sigma/RImpala
Due to the inactivity of the projects or ignoring my comments/fixes I decided to separate the development in a separate Github project, why not forked? Because fork apparently do not appear in searches and if anyone has problems with outdated versions will make it more difficult to find.
#RImpala
RImpala is an R package that helps you to connect and execute distributed queries using Cloudera Impala. Impala supports jdbc integration and this feature is used by RImpala to establish a connection between R and Impala.
##Installating RImpala
To use this package you must also have access to a Hadoop cluster running Cloudera Impala with at least one populated table defined in the Hive Metastore.
###Install JDBC jars for RImpala
- Download the Impala JDBC zip fileto the client machine that you will use to connect to Impala servers.
- Extract the contents of the zip file to a location of your choosing.
For example:
- On Linux, you might extract this to a location such as /opt/jars/.
- On Windows, you might extract this to a folder such as C:\Program Files\impala-jars.
- We will use this location in
rimpala.init()
###Install RImpala
-
Compressed package:
R CMD INSTALL RImpala_0.1.6_nullfixed.tar.gz
-
Source code:
R CMD INSTALL ./RImpala
##Loading RImpala and connecting to Impala -
Find the ip of the machine and the port where the Impala service is running.
-
Find the location where you have unziped the JDBC jars in the above section.
-
Launch R
-
library("RImpala") rimpala.init(libs="/path/to/JDBC/jars/") result = rimpala.query("your query");
by default rimpala.init() searches "/usr/lib/impala" for the JDBC jars.
Here are links to more information on Cloudera Impala:
##Requirements
- Java (>= 1.5)
- R (>= 2.7.0)
- rJava (>= 0.5-0)
- Impala JDBC driver jars