f0cker/crackq

Unable to start Crackq with Ubuntu 20.04 and Nvidia Drivers

HardChalice opened this issue · 22 comments

Prerequisites

Enable debugging:
sudo docker exec -it crackq /bin/sed -i 's/INFO/DEBUG/g' /opt/crackq/build/crackq/log_config.ini
Unable to do as the crackq container fails to start.

Prior to reaching this point I also had to tweak the Nvidia + Ubuntu Dockerfile as Python3.7 and Python3.7-Dev are not available on Ubuntu 20.04 and would throw an error when running ./install.sh /docker/nvidia/ubuntu . I changed these to Python3.8 and Python3.8-Dev which fixed the issue.

I needed to uncomment ENV DEBIAN_FRONTEND noninteractive otherwise install.sh would hang on setting up a timezone for tzdata.

I also needed to change FROM nvidia/cuda:runtime-ubuntu20.04 to specify a version number. Using the most recent, I went with FROM nvidia/cuda:11.6.0-runtime-ubuntu20.04.

# Update & install packages for installing hashcat
RUN apt-get update && \
    apt-get install -y wget p7zip gcc g++ make build-essential git libcurl4-openssl-dev libssl-dev zlib1g-dev python3.8 \
    python3.8-dev python3-pip libldap2-dev libsasl2-dev libssl-dev xmlsec1 libxmlsec1-openssl

Describe the bug

Running the sudo docker-compose -f docker-compose.nvidia.yml up --build command, Crackq is unable to start. The error thrown is after a long traceback through python imports is the following:
ImportError: cannot import name 'soft_unicode' from 'markupsafe' (/usr/local/lib/python3.8/dist-packages/markupsage/__init__.py

See picture below for the entire traceback.

To Reproduce

Steps to reproduce the behavior:

1. Pull the Crackq Repo

2. Following the readme, install the latest Docker and Docker-compose

# Install Docker
sudo apt-get update
sudo apt-get install \
  ca-certificates \
  curl \
  gnupg \
  lsb-release

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io

# Install Docker-Compose
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o 
/usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose

#Verify Install
docker-compose --version

3. Download the latest Nvidia Server Drivers.

This may vary depending on GPU's. I'm using 7 ZOTAC 1080ti's and I installed the recommended drivers shown by running the following.

# Check your GPU/Drivers
ubuntu-drivers devices

# If ubuntu-drivers isn't installed, install the following and run the previous command again:
sudo apt install ubuntu-drivers-common

# Based off the output, install the recommended nvidia drivers
# Using a server version prevents the GUI from being installed
sudo apt install nvidia-<driver>-server 

# Reboot the system
sudo reboot

# Check Driver Change and system recognizes the GPU's
nvidia-smi

#If the previous command fails, in my case it was due to Nvidia-persistenced not running. I fixed it by running the following:
sudo -i 
nvida-smi -pm 1
exit

# Confirm by running the previous command again and receiving the expected output
nvidia-smi

4. Install Nvidia Docker

This part took some trial and error as the Crackq readme says the following:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
sudo apt-get install nvidia-container-runtime

However per the Nvidia-Docker Docs, they recommend installing nvidia-docker2 which resulted in some problems for me. Instead I followed the Crackq readme and did the following after installing nvidia-container-runtime per Nvidia-Container-Runtime.

# Create SystemD drop-in file 
sudo mkdir -p /etc/systemd/system/docker.service.d
sudo tee /etc/systemd/system/docker.service.d/override.conf <<EOF
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd --host=fd:// --add-runtime=nvidia=/usr/bin/nvidia-container-runtime
EOF
sudo systemctl daemon-reload
sudo systemctl restart docker

# Verify GPU's can be recognized by a container
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

5. Run Install.sh

sudo ./install.sh docker/nvidia/ubuntu

6. Configuration

I ran through the configuration portion. Did everything documented in the Crackq Configuration

# Gen Secret Key
python3 -c 'import secrets; print(secrets.token_urlsafe())'

# Copy to crackq.conf
[app]
SECRET_KEY: secret_key_generated_above

# Didn't add any additional wordlists, just used the rockyou.txt that comes with install
# Moved the config file to proper location
sudo mv crackq.conf /var/crackq/files/
sudo chown crackq:crackq /var/crackq/files/crackq.conf
sudo chmod 640 /var/crackq/files/crackq.conf

I skipped any type of authentication setup and modified the custom nginx config and added my own certs to the proper directory

sudo cp ./cfg/crackq_nginx.conf /var/crackq/files/nginx/
sudo cp cert.pem /var/crackq/files/nginx/conf.d/certificate.pem
sudo cp priv.pem /var/crackq/files/nginx/conf.d/private.pem

Expected behavior

After all this I should be able to run the application with either:
sudo docker-compose -f docker-compose.nvidia.yml up --build

Or
sudo docker-compose -f docker-compose.nvidia.yml up -d

Debug output

This is the output I receive from the docker logs as the containers were starting:

image

Additional context

To add, I was able to temporarily workaround this by modifying the Nvidia/Ubuntu/Dockerfile again and including this command
RUN pip3 install markupsafe==2.0.1 per this issue forum.

However this then led to an issue where Pypal also failed to import. Unfortunately I don't have the output from the docker logs, however the logs threw a similar traceback stack as the one above listing that it was unable to locate pypal.

If I missed anything please let me know.

Moving past that above,

Running into a new issue of jobs failing being submitted. The original error is the following:

crackq    | ERROR    app.py:1891 log_exception 2022-03-17 17:17:43,195 Exception on /api/queuing/all [GET]
crackq    | Traceback (most recent call last):
...
crackq    |   File "/usr/local/lib/python3.8/dist-packages/rq/job.py", line 517, in restore
crackq    |     self.meta = self.serializer.loads(obj.get('meta')) if obj.get('meta') else {}
crackq    | _pickle.UnpicklingError: invalid load key, '{'.

Updated requirements.txt

configparser==5.0.1
Flask==1.1.4
redis==3.5.3
rq==1.10.1
marshmallow==3.9.1
pytest==6.1.2
pytest-cov==2.10.1
flake8==3.8.4
python-ldap==3.3.1
Flask-Sessionstore==0.4.5
SQLAlchemy==1.3.24
Flask-SQLAlchemy==2.5.1
SQLAlchemy-Utils==0.38.2
flask-talisman==0.7.0
pysaml2==6.5.1
flask-Login==0.5.0
Flask-Cors==3.0.9
Flask-SeaSurf==0.2.2
Flask-Migrate==3.0.1
bcrypt==3.2.0
Flask-Bcrypt==0.7.1
pathvalidate==2.3.1
markupsafe==2.0.1

Updating rq to version 1.10.1 didn't throw the same error as above however I am receiving a new error:

crackq    | DEBUG    crackqueue.py:170 error_parser 2022-03-21 15:43:43,875 Parsing error message: Traceback (most recent call last):
crackq    |   File "/usr/local/lib/python3.8/dist-packages/rq/worker.py", line 1061, in perform_job
crackq    |     rv = job.perform()
crackq    |   File "/usr/local/lib/python3.8/dist-packages/rq/job.py", line 821, in perform
crackq    |     self._result = self._execute()
crackq    |   File "/usr/local/lib/python3.8/dist-packages/rq/job.py", line 844, in _execute
crackq    |     result = self.func(*self.args, **self.kwargs)
crackq    |   File "/opt/crackq/build/crackq/run_hashcat.py", line 688, in hc_worker
crackq    |     hcat = runner(hash_file=hash_file, mask=mask,
crackq    |   File "/opt/crackq/build/crackq/run_hashcat.py", line 192, in runner
crackq    |     raise ValueError('Aborted, speed check failed: {}'.format(err_msg))
crackq    | ValueError: Aborted, speed check failed: Work-horse was terminated unexpectedly (waitpid returned 139)

Thanks for the update. Have a look in /utils there's a couple of scripts that will help you get more info on the error message for the speed_check queue as it's a hidden job queue
python3 rq_queryqueue.py speed_check

^ this will get you the list of jobs, copy the job id in question there

python3 rq_queryjob.py speed_check <job_id>

^ this will get a more detailed error message for that job

You may need to modify the scripts as it looks like the name resolution has changed in docker networking recently:

redis_con = Redis('redis', 6379)
to
redis_con = Redis('127.0.0.1', 6379)

Running rq_queryjob.py outputs the following for a failed speed check:

Description: crackq.run_hashcat.show_speed(attack_mode=3, brain=True, hash_file='/var/crackq/logs/1c5ce07dd02e41b89cf52e2b025f4593.hashes', hash_mode=1000, mask='?a?a?a?a?a?a', name='Test', pot_path='/var/crackq/logs/crackq.pot', session='1c5ce07dd02e41b89cf52e2b025f4593', speed_session='1c5ce07dd02e41b89cf52e2b025f4593_speed', username=True, wordlist2=None, wordlist=None)
Result: None
Status: failed
Execution info: Work-horse was terminated unexpectedly (waitpid returned 139)
Meta {}

OK. If you tick disable brain does it run the job or give more detail in the error?

Disabling brain runs the job from what I've noticed. Still dont have a decent pool of test hash files to use since it finished that test_customer_domain.hashes.

Moving past that above,

Running into a new issue of jobs failing being submitted. The original error is the following:

crackq    | ERROR    app.py:1891 log_exception 2022-03-17 17:17:43,195 Exception on /api/queuing/all [GET]
crackq    | Traceback (most recent call last):
...
crackq    |   File "/usr/local/lib/python3.8/dist-packages/rq/job.py", line 517, in restore
crackq    |     self.meta = self.serializer.loads(obj.get('meta')) if obj.get('meta') else {}
crackq    | _pickle.UnpicklingError: invalid load key, '{'.

Updated requirements.txt

configparser==5.0.1
Flask==1.1.4
redis==3.5.3
rq==1.10.1
marshmallow==3.9.1
pytest==6.1.2
pytest-cov==2.10.1
flake8==3.8.4
python-ldap==3.3.1
Flask-Sessionstore==0.4.5
SQLAlchemy==1.3.24
Flask-SQLAlchemy==2.5.1
SQLAlchemy-Utils==0.38.2
flask-talisman==0.7.0
pysaml2==6.5.1
flask-Login==0.5.0
Flask-Cors==3.0.9
Flask-SeaSurf==0.2.2
Flask-Migrate==3.0.1
bcrypt==3.2.0
Flask-Bcrypt==0.7.1
pathvalidate==2.3.1
markupsafe==2.0.1

Updating rq to version 1.10.1 didn't throw the same error as above however I am receiving a new error:

crackq    | DEBUG    crackqueue.py:170 error_parser 2022-03-21 15:43:43,875 Parsing error message: Traceback (most recent call last):
crackq    |   File "/usr/local/lib/python3.8/dist-packages/rq/worker.py", line 1061, in perform_job
crackq    |     rv = job.perform()
crackq    |   File "/usr/local/lib/python3.8/dist-packages/rq/job.py", line 821, in perform
crackq    |     self._result = self._execute()
crackq    |   File "/usr/local/lib/python3.8/dist-packages/rq/job.py", line 844, in _execute
crackq    |     result = self.func(*self.args, **self.kwargs)
crackq    |   File "/opt/crackq/build/crackq/run_hashcat.py", line 688, in hc_worker
crackq    |     hcat = runner(hash_file=hash_file, mask=mask,
crackq    |   File "/opt/crackq/build/crackq/run_hashcat.py", line 192, in runner
crackq    |     raise ValueError('Aborted, speed check failed: {}'.format(err_msg))
crackq    | ValueError: Aborted, speed check failed: Work-horse was terminated unexpectedly (waitpid returned 139)

Hello, I cannot get past the issue where Pypal failed to import. How did you solve it? Please describe the solution

Try disabling the brain and it might show a more detailed error message.

I'm getting the same invalid load key, '{' exception. It seems like the job submission is confused. It's also odd that even when I disable brain in the job, I see brain=True.

DEBUG    cq_api.py:206 get_jobdetails 2023-07-22 13:17:55,939 Parsing job details:
crackq.run_hashcat.hc_worker(attack_mode=0, brain=True, hash_file='/var/crackq/logs/51f4a7faca84400296c3c0beae784d62.hashes', hash_mode=1000, increment=False, increment_max=None, increment_min=None, mask=None, mask_file=False, name='test with disable brain', outfile='/var/crackq/logs/51f4a7faca84400296c3c0beae784d62.cracked', pot_path='/var/crackq/logs/crackq.pot', potcheck=False, restore=0, rules=['/var/crackq/files/rules/OneRuleToRuleThemAll.rule'], session='51f4a7faca84400296c3c0beae784d62', username=False, wordlist2=None, wordlist='/var/crackq/files/wordlists/rockyou.txt')
DEBUG    crackqueue.py:170 error_parser 2023-07-22 13:17:55,940 Parsing error message: Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/rq/worker.py", line 1013, in perform_job
    rv = job.perform()
  File "/usr/local/lib/python3.6/site-packages/rq/job.py", line 709, in perform
    self._result = self._execute()
  File "/usr/local/lib/python3.6/site-packages/rq/job.py", line 732, in _execute
    result = self.func(*self.args, **self.kwargs)
  File "/opt/crackq/build/crackq/run_hashcat.py", line 694, in hc_worker
    benchmark=benchmark, benchmark_all=benchmark_all)
  File "/opt/crackq/build/crackq/run_hashcat.py", line 192, in runner
    raise ValueError('Aborted, speed check failed: {}'.format(err_msg))
ValueError: Aborted, speed check failed:  invalid load key, '{'.

DEBUG    crackqueue.py:176 error_parser 2023-07-22 13:17:55,940 Parsed error:   invalid load key, '{'.

Regarding
python3 rq_queryqueue.py speed_check

like @adnahmed , I had to get an older version of rq for the script to run.

pip3 install  "rq==1.13.0"

The error was

Traceback (most recent call last):
  File "rq_queryqueue.py", line 5, in <module>
    from rq import use_connection, Queue
ImportError: cannot import name 'use_connection'

Once I did get the script to run, I got the same load key error that I see in the UI.

python3 rq_queryqueue.py speed_check

Traceback (most recent call last):
  File "rq_queryqueue.py", line 27, in <module>
    cur_list = started.get_job_ids()
  File "/usr/local/lib/python3.6/site-packages/rq/registry.py", line 143, in get_job_ids
    self.cleanup()
  File "/usr/local/lib/python3.6/site-packages/rq/registry.py", line 225, in cleanup
    job = self.job_class.fetch(job_id, connection=self.connection, serializer=self.serializer)
  File "/usr/local/lib/python3.6/site-packages/rq/job.py", line 521, in fetch
    job.refresh()
  File "/usr/local/lib/python3.6/site-packages/rq/job.py", line 899, in refresh
    self.restore(data)
  File "/usr/local/lib/python3.6/site-packages/rq/job.py", line 875, in restore
    self.meta = self.serializer.loads(obj.get('meta')) if obj.get('meta') else {}
_pickle.UnpicklingError: invalid load key, '{'.
f0cker commented

Don't bother debugging this, I've got updates to push imminently with the docker container and all python libs updated. Should be available later today, I'm just cleaning it up.

f0cker commented

Check out the dev branch, this should all be fixed there now.

f0cker commented

This should be fixed in master, let me know if it's still not working.

I am running from master branch with Python 3.8 on Ubuntu. Jobs will run fine for a while and then start failing. Message below. A docker compose down+up gets jobs running again.

INFO     conf.py:18 hc_conf 2023-08-08 20:42:12,445 Reading from config file /var/crackq/files/crackq.conf
INFO     run_hashcat.py:116 runner 2023-08-08 20:42:12,463 Running hashcat
ERROR    run_hashcat.py:188 runner 2023-08-08 20:42:12,491 Speed check failed: RuntimeError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/rq/worker.py", line 1418, in perform_job
    rv = job.perform()
  File "/usr/local/lib/python3.8/dist-packages/rq/job.py", line 1222, in perform
    self._result = self._execute()
  File "/usr/local/lib/python3.8/dist-packages/rq/job.py", line 1259, in _execute
    result = self.func(*self.args, **self.kwargs)
  File "/opt/crackq/build/crackq/run_hashcat.py", line 988, in show_speed
    hcat = runner(hash_file=hash_file, mask=mask,
  File "/opt/crackq/build/crackq/run_hashcat.py", line 159, in runner
    hc.hashcat_session_execute()
SystemError: <method 'hashcat_session_execute' of 'pyhashcat.hashcat' objects> returned a result with an error set
f0cker commented

Try adding this to the docker-compose file:

#runtime: nvidia
        deploy:
                resources:
                        reservations:
                                devices:
                                        - driver: nvidia
                                        - capabilities: [gpu]

Note 'runtime: nvidia' is removed. I'll do some further testing when I get the chance, probably on the weekend.

With this docker-compose:

    crackq:
        build:
            context: ./build
            dockerfile: Dockerfile
        image: "nvidia-ubuntu"
        ports:
            - "127.0.0.1:8080:8080"
        depends_on:
            - redis
        networks:
            - crackq_net
        container_name: "crackq"
        hostname: "crackq"
        volumes:
            - /var/crackq/:/var/crackq
            - ./crackq:/opt/crackq/build/crackq/
        stdin_open: true
        #        runtime: nvidia
        #        Add "deploy" per https://github.com/f0cker/crackq/issues/33
        deploy:
                resources:
                        reservations:
                                devices:
                                        - driver: nvidia
                                        - capabilities: [gpu]
        user: crackq
        tty: true
        environment:
                PYTHONPATH: "/opt/crackq/build/"
                MAIL_USERNAME: ${MAIL_USERNAME}
                MAIL_PASSWORD: ${MAIL_PASSWORD}

I get this error starting:

[+] Running 3/4
 ✔ Network crackq_crackq_net  Created                                      0.1s
 ✔ Container redis            Started                                      0.4s
 ⠿ Container crackq           Starting                                     0.6s
 ✔ Container nginx            Created                                      0.0s
Error response from daemon: could not select device driver "nvidia" with capabilities: [[]]

f0cker commented

is this with the nvidia devel image (nvidia/cuda:12.2.0-devel-ubuntu20.04)?

is this with the nvidia devel image (nvidia/cuda:12.2.0-devel-ubuntu20.04)?

This happened with both devel and runtime images

It took over a week of running jobs through the system, but the <method 'hashcat_session_execute' of 'pyhashcat.hashcat' objects> returned a result with an error set exceptions came back. As before, a down+up of the containers brought it back online.

Failed again today with the same <method 'hashcat_session_execute' of 'pyhashcat.hashcat' objects> returned a result with an error set. This time, I ran a small test before I restarted the containers:

# works in host - GPUs recognized
nvidia-smi 

# fails in docker
sudo docker exec -it crackq nvidia-smi

Failed to initialize NVML: Unknown Error

After restarting the containers the GPU is visible again.

sudo docker exec -it crackq nvidia-smi

Wed Aug 23 18:41:03 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   70C    P0    32W /  70W |      2MiB / 15360MiB |      0%      Default |
|

Obviously this isn't a CrackQ issue, but is there any way in the API we can monitor for GPU "health"? That would be cleaner than waiting for jobs to fail.

Regarding the loss of GPU visibility in the container, 'Failed to initialize NVML: Unknown Error`, I found these issues:

NVIDIA/nvidia-docker#1730
NVIDIA/nvidia-docker#1671

I will stop posting in this issue as it seem totally unrelated to CrackQ. Apologies for the distraction.

Check the v0.1.2 branch out, I believe this should be fixed now. I had no issues testing on some ec2 test boxes. I'll close this off when I merge into master if I don't hear anything, but feel free to reopen

Closing as above