Files retrieval from a index

Question

Files retrieval from a index

Opened this issue 9 years ago · 24 comments

Hello,

When I am running the command -
python manage.py runscript -e snugglefish_service index -- -a getnew -n FOO -c 500000 | xargs -I % -n 1 python manage.py runscript -e crits_scripts get_file -- -m % -o /home/wxs/FOO/%

I am receiving the list of the hashes as a screen output, but the actual directory /home/wxs/FOO remains empty.
What might be a cause of this?
Also it seems that -e option not working for me, every time that I run this command it requires authentication.

Thank you.

Answer 1 · 2015-06-03T16:14:42.000Z

It sounds like the hashes are printed to something other than stdout (likely stderr). Can you try redirecting stderr to stdout in the first part of the pipeline? Something like this:

python manage.py runscript -e snugglefish_service index -- -a getnew -n FOO -c 500000 2>&1 | xargs -I % -n 1 python manage.py runscript -e crits_scripts get_file -- -m % -o /home/wxs/FOO/%

The -e option to runscript uses environment variables for authentication. If they are not set then it falls back to prompting for authentication to CRITs. Please take a look at python manage.py runscript -h.

Answer 2 · 2015-06-03T16:45:07.000Z

After adding 2>&1 it returns error:
CommandError: No module named crits_scripts.scripts.get_file

Also I didn't mention that while running this command I've received this output:
fatal: Not a git repository (or any parent up to mount point /mnt/crits)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set)

Answer 3 · 2015-06-04T08:11:07.000Z

I do have write permissions to the output directory but still nothing saved in it. Can you please advise?
Thank you.

Answer 4 · 2015-06-05T16:56:20.000Z

Have you configured CRITs to use services in where ever you put your them? Most people put them in /data/crits_services and then configure CRITs to use that directory for services.

Answer 5 · 2015-06-07T10:08:37.000Z

First of all I'd like to thank you for your assistance. I am really appreciate it.

I've configured CRITs to use services, but I didn't put these services at the root directory. At the CRITs GUI all services are displaying as available, including snugglefish service. I just can't use snugglefish service because I am not succeeding to finish indexing of the sample files we've uploaded so far.

Is there any log that I can check to see why it is not saving files at the given directory?

Thanks

Answer 6 · 2015-06-07T14:47:33.000Z

I can dig into this a bit on Monday when I'm back in the office.

Answer 7 · 2015-06-08T15:03:12.000Z

Great, looking forward to hear any news from you.
Thanks,

Answer 8 · 2015-06-08T16:17:30.000Z

Can you provide more information about the various directories involved and where you are executing things from. Everything you are doing seems more or less right but it's not clear exactly what you are doing. A direct copy/paste of the commands and the exact output would help, along with a description of where your services live. In particular the error: CommandError: No module named crits_scripts.scripts.get_file is indicative of a problem finding the crits_scripts directory in the services. This makes me think you don't have the entire services repository in place, and have selectively chosen to exclude crits_scripts.

Answer 9 · 2015-06-08T17:02:43.000Z

Hi,

The CRITs directory is /mnt/crits/crits-m
All CRITs services are at /mnt/crits/data/crits_services

The first thing that I did while starting to index for snugglefish is:

python manage.py runscript snugglefish_service index -- -a create -n CW -q "{'source.name': 'CW'}" -d /mnt/crits/data/crits_services/snugglefish_service/index

Then, I am running this command to check the status:
python manage.py runscript snugglefish_service index -- -a status -n CW
The output I am receiving:
fatal: Not a git repository (or any parent up to mount point /mnt/crits)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
Username:
Password:
Name: CW
Directory: /mnt/crits/data/crits_services/snugglefish_service/index/
Created: 2015-06-02 08:02:01.343000
Last update: 2015-06-02 08:02:01.343000
Query: {'source.name': 'CW'}
Last ID: 556f0233780f5256a4b89921
Total objects: 26
Count indexed: 0
Percent indexed: 0.000000
It looks exactly as an example at Snugglefish_service page at Github.

Then I am trying to get first 500 files for the CW index:
python manage.py runscript -e snugglefish_service index -- -a getnew -n CW -c 500 | xargs -I % -n 1 python manage.py runscript -e crits_scripts get_file -- -m % -o /mnt/crits/data/crits_services/snugglefish_service/wxs/CW/%

The output that I see on the screen:
fatal: Not a git repository (or any parent up to mount point /mnt/crits)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
Username:
Password:
d2f825ecfb3d979950b9de92cbe29286
7a83f701287f7624023f2a95fe37a2e6
254ff246f7603d54990f1894b2f129e
and some more hashes...
Total number of hashes 26 , exactly as a total objects in the CW index.

But when I am checking the directory mnt/crits/data/crits_services/snugglefish_service/wxs/CW/ , it is empty.
This the point that I am stuck and can't figure out what I am doing wrong.

Answer 10 · 2015-06-08T17:08:23.000Z

So the hashes are being printed to stderr (I still need to figure out how to get scripts to properly print to stdout instead of stderr). You can get the hashes to stdout by simply redirecting stderr to stdout, as I suggested earlier. Do this by executing:

python manage.py runscript -e snugglefish_service index -- -a getnew -n CW -c 500 2>&1 | xargs -I % -n 1 python manage.py runscript -e crits_scripts get_file -- -m % -o /mnt/crits/data/crits_services/snugglefish_service/wxs/CW/%

Notice I am adding 2>&1 to the end of the first part of the pipeline. This is what you did before but said you were getting a CommandError: No module named crits_scripts.scripts.get_file message. What I want to make sure is that /mnt/crits/data/crits_services/crits_scripts/scripts/get_file.py exists.

Answer 11 · 2015-06-09T09:03:29.000Z

Hi,

I've run this command:
python manage.py runscript -e snugglefish_service index -- -a getnew -n CW -c 500 2>&1 | xargs -I % -n 1 python manage.py runscript -e crits_scripts get_file -- -m % -o /mnt/crits/data/crits_services/snugglefish_service/wxs/CW/%

Output on the screen:
fatal: Not a git repository (or any parent up to mount point /mnt/crits)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
Username: fatal: Not a git repository (or any parent up to mount point /mnt/crits)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
Username:
Password:
I've entered my username and password and it not returned any error but just stuck at this point.
After waiting for about 10 minutes, I've cancelled it by Ctrl+C and received the following:
File "/usr/local/lib/python2.7/dist-packages/django/core/management/init.py", line 392, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/usr/local/lib/python2.7/dist-packages/django/core/management/base.py", line 242, in run_from_argv
self.execute(_args, *_options.dict)
File "/usr/local/lib/python2.7/dist-packages/django/core/management/base.py", line 285, in execute
output = self.handle(_args, *_options)
File "/mnt/crits/crits-master/crits/core/management/commands/runscript.py", line 64, in handle
username = getpass.getpass("Username: ")
File "/usr/lib/python2.7/getpass.py", line 71, in unix_getpass
passwd = _raw_input(prompt, stream, input=input)
File "/usr/lib/python2.7/getpass.py", line 133, in _raw_input
line = input.readline()

Thanks,

Answer 12 · 2015-06-09T16:24:18.000Z

So the -e argument to runscript is used to get CRITs authentication information from the environment. You should set CRITS_USER and CRITS_PASSWORD in your environment and it won't ask you for that information. In this particular case it looks like stdin was being gobbled up.

Read this for more information: https://github.com/crits/crits/wiki/runscript

Answer 13 · 2015-06-09T16:42:00.000Z

Added these environments with "export" command and when I do "printenv" I see CRITS_USER=my user CRITS_PASSWORD= my password , but seems that it doesn't work.
Also I just created an empty .git directory to resolve this error - fatal: Not a git repository (or any parent up to mount point /mnt/crits) It returns new error now : fatal: Needed a single revision.

Can you please advise what's wrong here also?
Thanks

Answer 14 · 2015-06-11T09:10:51.000Z

Hello,

I have additional question. Every time that I am uploading new samples to the CRITs should I delete the existing index and create a new one? Is there any way just to update the existing index with a new sample files?
Also is there any way to reset the "count" back to the 0? Every time that I ran this command python /mnt/crits/crits-master/manage.py runscript -e snugglefish_service index -- -a update -n CW -c $(ls /mnt/crits/data/crits_services/snugglefish_service/wxs/CW/ | wc -l) it doubles the count number. Is this normal and expected?

Thank you,

Answer 15 · 2015-06-11T13:29:32.000Z

You have a bunch of different things going on and are not providing enough information to debug.

fatal: Not a git repository (or any parent up to mount point /mnt/crits)

This sounds like you don't have a git repository cloned for your CRITs install. Can you confirm that is true? Creating an empty .git directory was the wrong thing to do, please remove that.

Do not delete the indexes or create a new one. The method described should update the existing index with the number of new files. Here's the rough outline of what should happen:

You have no snugglefish indexes defined in CRITs. So you create one as described in the documentation. This example would create an index named FOO using the query for source name of FOO and a directory of /snuggles.

python manage.py runscript snugglefish_service index -- -a create -n FOO -q "{'source.name': 'FOO'}" -d /snuggles

You then upload 500 samples to CRITs all with the source FOO and want to index them. The first thing you need to have to create snugglefish indexes are files to index. To get the files out of CRITs and on disk where you can index them you can use:

python manage.py runscript -e snugglefish_service index -- -a getnew -n FOO -c 500 2>&1| xargs -I % -n 1 python manage.py runscript -e crits_scripts get_file -- -m % -o /home/wxs/FOO/%

This will query CRITs for the snugglefish index named FOO and ask for 500 new files since the last time the index was updated. The hashes will be passed to the get_file script which will dump each file in /home/wxs/FOO/. At this point all you have done is retrieved 500 files with the source name of FOO.

The next step is to actually index the files:

find /home/wxs/FOO -type f | snugglefish -i -o /home/wxs/snugglefish-indexes/FOO

This will create indexes in /home/wxs/snugglefish-indexes. When the indexing is done you can move the files from /home/wxs/snugglefish-indexes to /snuggles on your webserver, which is where you told CRITs to find the index files.

The last step is to tell CRITs that you just indexed some files:

python manage.py runscript -e snugglefish_service index -- -a update -n FOO -c $(ls /home/wxs/FOO | wc -l)

At this point you can remove the files created in /home/wxs/FOO and wait until you have more files to index. It is hard to say how long you should wait. Wait too little and you end up wasting a lot of space in your indexes. Wait too long and you end up having to spend a lot of time indexing. It all depends upon how many new files arrive that you want to index.

I also don't think the number is doubling in your case. I suspect what is happening is that you are repeatedly telling CRITs that you indexed 500 files (or whatever your number is) because you are not removing them once you index them.

If you need further help I would humbly request that you try to provide exact copies of what you are doing. Exactly what the commands are, where you are running them from, and exactly what the output is. and try not to do more than one thing at once, otherwise the problems can compound on themselves and become difficult to follow and for me to help you.

Answer 16 · 2015-06-11T14:05:49.000Z

Hi,

It is correct I don't have a git repository cloned for my CRITs install. I just ran "git init" and have created empty .git directory. I just removed it according to your instructions. Can you please advise what should I do next to resolve "Not a git repository" error? What repository should I clone?

Regarding all new samples. In my case there is someone else that uploading new samples every singe day. I have indexed all samples yesterday and I imagine that samples that he'll add today will not be in that index. Am I right? If so how can I update my index with all new samples that have been added today? Is there any way to make this process automated?

Thank you,

Answer 17 · 2015-06-11T14:10:52.000Z

The missing .git directory is not a hard error. I believe it can be ignored.

Once the files are indexed you can remove them from disk, then move the newly created indexes to the directory specified in CRITs. You can do this daily if you want, just automate it with a script.

Answer 18 · 2015-06-11T14:11:41.000Z

And yes, you are right. The "getnew" command will get all new files since the last time you ran that command.

Answer 19 · 2015-06-11T14:38:37.000Z

So, I just add newly created index to the already existing one? If for the moment I have one index named CW.index00000000. so the next one will be CW.index00000001?

Answer 20 · 2015-06-11T16:06:36.000Z

Also regarding setting environment variables CRITS_USER and CRITS_PASSWORD.
I have added them to /etc/environment and to .bashrc but still it is asking for credentials every time I run a command with -e option. Can you advise?

The "Not a git repository" error just making impossible to me insert my username, because it looks like this every time:
fatal: Not a git repository (or any parent up to mount point /mnt/crits)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
Username: fatal: Not a git repository (or any parent up to mount point /mnt/crits)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).

Thanks,

Answer 21 · 2015-06-16T10:22:08.000Z

Hello,
When I run this command:
python manage.py runscript -e snugglefish_service index -- -a getnew -n CW -c 500 | xargs -I % -n 1 python manage.py runscript -e crits_scripts get_file -- -m % -o /mnt/crits/data/crits_services/snugglefish_service/wxs/CW/%

I am receiving "No objects found". What might be a problem?
Also, should I change name for newly created indexes before replacing them to the directory specified in CRITs? I see that new index is exactly the same name as previous CW.index00000000. If I move new index to the directory with the old index it will replace it.

Thanks,

Answer 22 · 2015-06-17T12:55:25.000Z

Any response please?

Answer 23 · 2015-06-30T10:20:47.000Z

Hello,
When I run this command find /mnt/crits/data/crits_services/snugglefish_service/wxs/CW/ -type f | /mnt/crits/data/crits_services/snugglefish_service/snugglefish-master/snugglefish -i -o /mnt/crits/data/crits_services/snugglefish_service/wxs/snugglefish-indexes/CW

I am receiving following errors:
/mnt/crits/data/crits_services/snugglefish_service/snugglefish-master/snugglefish(_Z7handleri+0x1c)[0x40d309]
/lib/x86_64-linux-gnu/libc.so.6(+0x36d40)[0x7fa910af7d40]
/lib/x86_64-linux-gnu/libc.so.6(+0x981d8)[0x7fa910b591d8]
/mnt/crits/data/crits_services/snugglefish_service/snugglefish-master/snugglefish(_ZN11snugglefish4file5writeEPhm+0x145)[0x416169]
/mnt/crits/data/crits_services/snugglefish_service/snugglefish-master/snugglefish(_ZN11snugglefish8indexSet6createEj+0x2d6)[0x4169ec]
/mnt/crits/data/crits_services/snugglefish_service/snugglefish-master/snugglefish(_ZN11snugglefish10nGramIndex10flushIndexEj+0x77)[0x4132d5]
/mnt/crits/data/crits_services/snugglefish_service/snugglefish-master/snugglefish(_ZN11snugglefish10nGramIndex8flushAllEv+0x52)[0x413184]
/mnt/crits/data/crits_services/snugglefish_service/snugglefish-master/snugglefish(_ZN11snugglefish10nGramIndexD1Ev+0x19)[0x412e9f]
/mnt/crits/data/crits_services/snugglefish_service/snugglefish-master/snugglefish(make_index+0x2b9)[0x40e667]
/mnt/crits/data/crits_services/snugglefish_service/snugglefish-master/snugglefish(main+0x74b)[0x40da93]

There are two files I want to index. When I am indexing one file in a time it works fine but when I try to index these 2 files together I am receiving the errors above.
What might be a problem? Thanks.

Answer 24 · 2015-07-02T15:31:37.000Z

When indexing new files for an existing index you need to have the existing index already in the output directory. This way it will be appended to I believe.

The exact error seems to be a problem when writing to a file. Are you sure permissions on the output are correct?