big-data-europe/docker-hive

LOAD DATA unable to locate files

Closed this issue · 7 comments

Hello,

I am trying to load data from files into HIVE using this docker compose environment, but it is unable to locate the files. I am PUTTING the data files through the REST API of the name node, port 50070 (If I remember it correctly) without problems, I can see the files through the file browser and running hdfs commands inside the name node container but the hive server docker container doesn't seem to recognise the same directory tree as the name node, so when I am using the hive instruction to LOAD DATA with the putted path it fails. Inside the docker container of hive server, obviously hdfs doesn't have that file either. It is only able to load files if I create them locally in the hive server container (using a volume because I didn't find an editor :p).

I haven't changed anything in the compose file, appart from using the version 3 and creating a common network and defining dependencies instead of links.

Would you able to point me in the right direction? How can I change the hive configuration to check I am writing to the same name name node?

I am new to hadoop and hive! So apologize for any architecture misconception.

Thanks,
Yeray

I'm not sure if I understand correctly what you are trying to do but if you use LOAD DATA with a path from another container then it is expected to fail because the hive container and the other container do not share a filesystem.

Hi @gmouchakis, that is right. What I am trying to do is to push data into HDFS through the REST API of the HDFS system and then to be able to use it from hive, pretty much as I am doing in production environments with my company cluster.

I also discovered the /user/hive/warehouse path created by HIVE does not appear on the HDFS filesystem defined by this docker compose environment, which means the hive docker container is somehow pointing to a different HDFS, but I still can't find the root of the problem.

If I am right, HIVE uses namenode to find in which data node you have the information in the HDFS filesystem, but not in this containerized environment. Everything should happen by http interactions between docker containers isn't it?

Hi @enanablancaynumeros! Was a stupid mistake from my side, Hadoop inside Hive container was not setup to work with remote hadoop. Fixed now

I've also updated hadoop to 2.8. If you still have this problem, feel free to reopen the issue.

I broke jdbc connector now, give me a moment.

First start-up the system as written in README.md.

Connect to hive:

➜  docker-hive git:(master) ✗ docker exec -it hive-server bash
root@d16fce776021:/opt# /opt/hive/bin/beeline -u jdbc:hive2://localhost:10000
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://localhost:10000
Connected to: Apache Hive (version 2.1.1)
Driver: Hive JDBC (version 2.1.1)
17/04/24 14:52:53 [main]: WARN jdbc.HiveConnection: Request to set autoCommit to false; Hive does not support autoCommit=false.
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.1.1 by Apache Hive
0: jdbc:hive2://localhost:10000> LOAD DATA LOCAL INPATH './../examples/files/kv1.txt' OVERWRITE INTO TABLE pks;
Error: Error while compiling statement: FAILED: SemanticException [Error 10001]: Line 1:74 Table not found 'pks' (state=42S02,code=10001)
0: jdbc:hive2://localhost:10000> CREATE TABLE pks (foo INT, bar STRING);
No rows affected (0.171 seconds)
0: jdbc:hive2://localhost:10000> LOAD DATA LOCAL INPATH './../examples/files/kv1.txt' OVERWRITE INTO TABLE pks;
No rows affected (0.291 seconds)
0: jdbc:hive2://localhost:10000> SELECT * FROM pks;
+----------+----------+--+
| pks.foo  | pks.bar  |
+----------+----------+--+
| 238      | val_238  |
| 86       | val_86   |
| 311      | val_311  |
...

Connect to namenode and check that Hive is writing to the right HDFS:

➜  docker-hive git:(master) ✗ docker exec -it namenode bash
root@d15fdc2345a6:/# hadoop fs -ls /tmp/hive/root
Found 2 items
drwx------   - root supergroup          0 2017-04-24 14:46 /tmp/hive/root/50d7fa4d-e731-4a2b-8ed2-bd9b2c7dbfd7
drwx------   - root supergroup          0 2017-04-24 14:35 /tmp/hive/root/69eb874b-f76d-4446-acb9-babb80211e17

Now if you do the same thing from hive-server container you will see the same thing:

➜  docker-hive git:(master) ✗ docker exec -it hive-server bash
root@d16fce776021:/opt# hadoop fs -ls /tmp/hive/root
Found 2 items
drwx------   - root supergroup          0 2017-04-24 14:46 /tmp/hive/root/50d7fa4d-e731-4a2b-8ed2-bd9b2c7dbfd7
drwx------   - root supergroup          0 2017-04-24 14:35 /tmp/hive/root/69eb874b-f76d-4446-acb9-babb80211e17

Hi @earthquakesan, Thanks for the answer! I got it working if I only use the new entrypoint script, but it doesn't work under other use cases you haven't describe in your previous steps if I try to use the new hadoop and jdbc version.

For some reason I haven't had time to identify, it is possible to execute beeline inside of the docker container and now it points to the right HDFS, but if you try to hit the port 10000 from outside or other docker containers it refuses the connection, which affects plain curl calls and external jdbc drivers.

Let me know if you can not reproduce the problem. I run docker-compose build --no-cache after your changes a couple of times and still got that error.

@enanablancaynumeros I need a description of how you run the application, because I have no problems running the example app from here: https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-JDBC

What I did to run it:

  1. Copy hadoop and hive from docker container to my dev environment
docker cp hive-server:/opt ./
➜  hive-example-app ls opt 
hadoop-2.8.0  hive
  1. Copy/paste code from the example to HiveJdbcClient.java and compile it:
javac HiveJdbcClient.java
  1. Download hadoop-core from mvn central:
# Copy it to opt/hadoop-2.8.0/
➜  hive-example-app ls opt/hadoop-2.8.0/hadoop-core-1.2.1.jar 
opt/hadoop-2.8.0/hadoop-core-1.2.1.jar
  1. Create run.sh script (my current working dir is /home/ivan/Workspace/Apps/hive-example-app/):
#!/bin/bash
HADOOP_HOME=/home/ivan/Workspace/Apps/hive-example-app/opt/hadoop-2.8.0
HIVE_HOME=/home/ivan/Workspace/Apps/hive-example-app/opt/hive
 
echo -e '1\x01foo' > /tmp/a.txt
echo -e '2\x01bar' >> /tmp/a.txt
 
HADOOP_CORE=$(ls $HADOOP_HOME/hadoop-core*.jar)
CLASSPATH=.:$HIVE_HOME/conf:$(./opt/hadoop-2.8.0/bin/hadoop classpath)
 
for i in ${HIVE_HOME}/lib/*.jar ; do
    CLASSPATH=$CLASSPATH:$i
done
 
java -cp $CLASSPATH HiveJdbcClient
  1. Run run.sh:
chmod +x run.sh
./run.sh

It will fail to read a.txt, because it is created on your host and not on remote host. Therefore you first have to copy the file there:

docker cp /tmp/a.txt hive-server:/tmp/

Otherwise, I can not see any problems with this sample application. Here is the output:

➜  hive-example-app ./run.sh 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/ivan/Workspace/Apps/hive-example-app/opt/hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/ivan/Workspace/Apps/hive-example-app/opt/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/04/26 10:55:53 INFO jdbc.Utils: Supplied authorities: localhost:10000
17/04/26 10:55:53 INFO jdbc.Utils: Resolved authority: localhost:10000
Running: show tables 'testHiveDriverTable'
testhivedrivertable
Running: describe testHiveDriverTable
key	int
value	string
Running: load data local inpath '/tmp/a.txt' into table testHiveDriverTable
Running: select * from testHiveDriverTable
1	foo
2	bar
Running: select count(1) from testHiveDriverTable
2