/NetWalk

XML driven network crawler that uses the SSH interface to reach out to a single IP and map the underlying topology of interconnected switches to return an XML topology.

Primary LanguagePython

---NetWalk-----
This program was created by Russell Endicott 
for the purpose of crawling remote switches 
and routers through an SSH interface in order to 
determine network topoplogy and various other
configurations of interest.

It's XML driven in that it expects an input XML
for targets to crawl and returns an XML based
toplogy.

Written by Russell Endicott (rendicott@gmail.com)
----------------

1. Limitations
2. Background
3. Usage
4. Debugging
5. Helper tools
6. Assistance


-=-=-=-=-=-=-=-=- LIMITATIONS -=-=-=-=-=-=-=-=-

Currently the NetWalk crawler only supports crawling one type of switch:

    --- Alcatel-Lucent 6450 Omniswitch

I started to work in the support for the Alcatel Lucent SAR-H 7705 router but
haven't finished yet. 

I tried to design the code so that it was vendor agnostic and all of the vendor
stuff is stored in modules. So that way I can write modules/parsers for each specific
type of equipment. 

-=-=-=-=-=-=-=-=- BACKGROUND -=-=-=-=-=-=-=-=-

This tool was originally created due to the absence of an EMS (Element Management System)
for the small cell switches and routers in the live network. It was designed to address
the issues surrounding break-fix on remote small cell deployments and to aid in locating 
specific equipment on the remote LAN in order to reboot the correct small cell. 

In addition we decided it would be nice to collect more information from the switches and
routers while we were in there crawling the network. The goal is to reach out over a common
interface (i.e., SSH since SNMP and other protocols are not always available) and query the
network elements for as much information as possible by running 'show' commands. The data is
then parsed into an object class structure and some simple logic is run against the objects.
If the result of that analysis is that there are probably more unknown objects reachable on
the same network then another round of discovery crawls is generated and analyzed until we
run out of network elements.

At the end we churn through the full population of discovered network elements and try to build
a parent/child relationship structure. Once this structure is generated and duplicates are removed
it's all dumped into an XML file and the program exits. The ultimate goal of the 'XML in, XML out'
model is that it would eventually be turned into a SOAP API or something easier to use than files.


-=-=-=-=-=-=-=-=- USAGE -=-=-=-=-=-=-=-=-

First of all you'll need to find a Linux server that has open internet access. This was always done
from the 'Small Cell Jump Server' but it could be run from a laptop if you're in out in the wild 
somewhere. Essentially you just need a machine that can run Python and Expect scripts and SSH out to
public IP's.

At the minimum you'll need to copy the following files to a directory on the 'jump' server:

        -rw-r----- 1 russ grouper   27591 May  5 09:14 nwClasses.py
        -rw-r----- 1 russ grouper    6189 Apr 28 23:37 nwConfig.py
        -rw-r----- 1 russ grouper   51206 May 21 15:45 nwCrawl.py
        -rw-r----- 1 russ grouper     120 Mar 20 15:34 nw_NE_Type_SCR_7705.py
        -rw-r----- 1 russ grouper  147195 Apr 27 16:01 nw_NE_Type_SCS_6450.py
        -rw-r----- 1 russ grouper    6575 May 20 12:51 nwOrchestrator.py
        -rw-r----- 1 russ grouper    5180 Apr 27 09:48 nwOUIexplorer.py
        -rw-r----- 1 russ grouper   62468 Jun  4 14:11 nwParser_ALU_7705_6450.py
        -rw-r----- 1 russ grouper   12804 Apr 27 09:42 nwProcessInput.py
        -rw-r----- 1 russ grouper   10416 Apr 29 09:41 nw.py
        -rw-r----- 1 russ grouper 3470344 Apr 16 12:36 oui-database.txt
        -rw-r----- 1 russ grouper     967 Apr 23 12:49 primer-ALU-7705-6450.exp
        -rw-r----- 1 russ grouper    1661 Apr 23 12:50 primer-hop1-ALU-7705-6450.exp
        -rw-r----- 1 russ grouper    2756 Apr 23 12:50 primer-hop2-ALU-7705-6450.exp

So that's essentially:
        -- all of the .py files
        -- the three primer expect scripts
        -- the OUI database txt file

Now if you get to command line in that directory on the server you can start using the crawler. 

Usage is pretty straightforward. If you run the 'nw.py' script like so:

        python nw.py --help

You'll get some usage assistance like so:

        Usage: nw.py [--help] [--debug] [--printtostdout] [--logfile] [--inputfile] [--outputfile]

        Options:
          --version             show program's version number and exit
          -h, --help            show this help message and exit
          -i FILE, --inputfile=FILE
                                This XML file contains the information needed to kick
                                off the crawler. Includes target IP's and possible
                                usernames/passwords. Refer to 'sample-input.xml' for
                                example usage.
          -o FILE, --outputfile=FILE
                                This is the desired output filename that will contain
                                the output topology XML. Default is 'output.xml'

          Debug Options:
            -d DEBUG, --debug=DEBUG
                                Available levels are CRITICAL (3), ERROR (2), WARNING
                                (1), INFO (0), DEBUG (-1)
            -p, --printtostdout
                                Print all log messages to stdout
            -l FILE, --logfile=FILE
                                Desired filename of log file output. Default is
                                "nw.log"
            -f FILE, --fromfile=FILE
                                This adds option to load crawlResults from file
                                instead of doing an actual crawl. Useful for testing
                                relationship parsing. Default is "crawlresults.pkl"

So your basic usage would be:

        python nw.py -i INPUTXMLFILE.XML -o OUTPUTXMLFILE.XML

So it sounds like we need to figure out how to make an input XML file.
That's simple enough. If you look at some of the sample-input XML files that
*should* have been included in the code files you can get a feel for what
the structure should look like. Let's take a look at the 'sample-input.xml'
file. 

        <?xml version="1.0" encoding="utf-8"?>
        <root>
          <target id="1">
            <!-- REQUIRED -->
            <!-- All targets need at least one entrypoint 
                and auth possibility at a bare minimum. Currently with NETWALK 
                v0.5: The parent/child relationship logic will not work if 
                entrypoints from multiple unrelated sites are submitted for a single target.  -->
            <entrypoint id="1">
              <attributes>
                <ipaddr>11.216.26.33</ipaddr>
        ........

If you go through that XML you'll see that it has comments so you can get a feel
for what each field means. At the bare minimum you'll need to change the IP address
for at least one entrypoint in the 'entrypoint' section and you'll need to 
change the usernames and passwords in the 'possibilities' section.

    NOTE: The usernames and passwords are handled in a 'give me a list of possibilities
    and I'll try them until I find one that works' model. This was due to the way the 
    various small cell switches and routers were deployed. There is a mixture of 'standard'
    username/password combinations out on the network. When NetWalk is first running the 
    primer scripts on the IP addresses that it finds it makes a note of which uname/pword
    combination works for a particular network element. For each successful combination 
    it adds weight to that combination moving forward so that it's used first on all the following
    network elements for that crawl. 

So you're most basic input XML would look like this:

        <?xml version="1.0" encoding="utf-8"?>
        <root>
          <target id="1">
            <entrypoint id="1">
              <attributes>
                <ipaddr>11.8.221.193</ipaddr>
                <port>22</port>
              </attributes>
            </entrypoint>
            <auth>
              <possibility id="1">
                <username>admin</username>
                <password>myP4ssW0rd</password>
              </possibility>
            </auth>
          </target>
        </root>

If you saved this above basic XML chunk as 'myfirstcrawl-input.xml' then 
you could launch a crawl like this:

        python nw.py -i myfirstcrawl-input.xml -o myfirstcrawl-output.xml

If everything was successful you wouldn't see very much happening on the screen but
eventually the script would end and you'd have an output xml file sitting on disk. 
Take a look at the 'sample-output.xml' that was included in the code files package
for exhaustive structure details and commentary. 

But let's jump into the debugging section so we can start to get a feel for what the
crawler is doing when it 'crawls'.


-=-=-=-=-=-=-=-=- DEBUGGING -=-=-=-=-=-=-=-=-

The nw.py script has debugging options that will let us watch what the crawler is doing
during the crawl. Take the above example from the USAGE section of this guide--instead of
running this command:

        python nw.py -i myfirstcrawl-input.xml -o myfirstcrawl-output.xml

We can run the same with debugging enabled like this:

        python nw.py -i myfirstcrawl-input.xml -o myfirstcrawl-output.xml -d -1 -p

The '-d' means that we want to specify the debug level. The '-1' means we want the lowest
debug level (i.e., DEBUG) which will enable ALL of the debug messages in the code. The '-p'
means we want to print the output to the screen. Also, the nw.py script will always write to 
a log file no matter what level you specify. By default this is the 'nw.log' file in the 
current directory but you can change it if you want (Check the --help options).

The basic format of the debug messages is as follows:

        TIMESTAMP:DEBUGLEVEL:FUNCTION_NAME  MESSAGE

For example:

    2015-05-05 14:08:02,561:DEBUG:<function adjustAuth at 0x1b5fa28>    have list of auth possibilities length: 4

Would mean that the message was logged on May 5th, 2015 at 14:08 with 'DEBUG' level 
(vs. WARNING or CRITICAL) from the 'adjustauth' function with a message of 
'have list of auth possibilities length: 4'

Not all the lines in the log follow this format but most do. 

-=-=-=-=-=-=-=-=- HELPER TOOLS -=-=-=-=-=-=-=-=-

In addition to the basic flow of nw.py there are a couple 
of helper scripts that were designed to make the running 
of the program easier. 

{{{{{{{{{{{{{{{{{ nwOrchestrator.py }}}}}}}}}}}}}}}}}

So I realize that running the script with the 'input file / output file' method
is bulky and annoying so I made another script called 'nwOrchestrator.py' that 
makes running large groups of crawls much easier.

When you run 'python nwOrchestrator.py' it looks in the current directory for 
any input xml files with the 'nwinput-XXXXXXXXXXXXX--' format, modifies them
to add the normal small cell SCS passwords and runs them into the 'nw.py' script.

You'll have to modify the nwOrchestrator.py script to add your own passwords.

So let's say you had an input XML file in the directory that looked like this:

        nwinput-1-Macquarie_31_32--20150422_151105.xml

It would run the nw.py script like this:

        python nw.py -i nwinput-1-Macquarie_31_32--20150422_151105.xml -o 20150505_141403-Macquarie_31_32.xml -d -1 -p -l 20150505_141403-Macquarie_31_32.log

It assumes the part between 'nwinput' and the '--' is the description (e.g., nwinput-<DESC>--20150422_151105.xml )
and uses it as the description section for the rest of the files involved for that crawl

And at the end it would take all of the expect scripts, the log file, and the expect output logs and zip them up into this file:

        20150505_141403-SUCCESS-Macquarie_31_32.zip

When it's done it keeps going until it runs out of input files. So at the end you'd end up with a
directory full of output XML's and zipped up log files. This makes it easier for large bulk crawls.
The only drawback is that you're storing usernames and passwords in the nwOrchestrator.py file.

{{{{{{{{{{{{{{{{{ nw_offline.py }}}}}}}}}}}}}}}}}

Find the nw_offline.py file and copy it to another directory along with the oui-database.txt file.

Copy the expect output log files from a successful crawl into the directory like this:

        07/21/2015  11:37 AM    <DIR>          .
        07/21/2015  11:37 AM    <DIR>          ..
        05/05/2015  02:06 PM            28,271 expect---23469---GOA_P10-4.sh-output.log
        05/05/2015  02:07 PM            37,228 expect---63927---GOA_P10-4.sh-output.log
        05/05/2015  02:07 PM            37,085 expect---67911---GOA_P10-6.sh-output.log
        05/05/2015  02:07 PM            37,227 expect---71191---GOA_P10-1.sh-output.log
        05/05/2015  02:06 PM            38,997 expect---71395---SCSR--1-GOA_C-9709365834---GOA_P10-3.sh-output.log
        05/05/2015  02:04 PM            38,135 expect---95526---SCSR--1-GOA_C-9709365834---GOA_P10-2.sh-output.log
        05/05/2015  01:58 PM            71,570 expect---97115---GOA_MAIN.sh-output.log
        05/05/2015  02:08 PM            37,173 expect---99923---GOA_P10-5.sh-output.log
        07/21/2015  11:33 AM            11,645 nw_offline.py
        07/21/2015  09:19 AM         3,600,495 oui-database.txt

Run the offline script like this:

        python nw_offline.py -d -1 -p

It will churn through the expect output logs and process the data into NetworkElement objects.

So you can tweak the code so that you could run an IDLE session on the offline files
and be able to poke around in the currentCrawl.loo_ne list of network elements objects. 


-=-=-=-=-=-=-=-=- ADVANCED DEBUGGING -=-=-=-=-=-=-=-=-

In this section we'll go over how the program works so it's easier to follow 
the debug logs.

So in the nw.py file we're importing all of the various modules we need.
Then we're pulling a custom Cfg() class from the nwClasses.py module that was 
initiated as "werd". It's just a placeholder name for what I call the global
config for the modules. If I import this object within all of the modules I can 
basically have global variables that can be passed around as needed.

This 'werd' Cfg() object stores things like the input and output file names and
holds methods for generating incremental numerical identifiers, etc. 

After that we'll import our other NetWalk modules from the other .py files and jump
into the if __name__ == '__main__': function way at the bottom of nw.py.

In this section we're just setting up the logging. Most of this comes from the 
standard 'optparse' packages and 'logging' which I didn't write. After all of the
options are processed and the log files are set up we'll hit the main() method.

In the main() method we're initially setting up the name of the function with 'myfunc'
and giveupthefunc(). We use this so we can log the function name in the debug log
which makes it easer to locate troubles when the log shows exceptions. Immediately
after that we're jumping into the parseXMLInput() function which lives inside the
nwProcessInput.py file. So let's go take a look and see what the parseXMLInput()
function does.

First we open the input XML file which is attached to a property of the Cfg() object
named 'werd'. That was set up during the inital program load when it was processing
input arguments. We're going to process that XML file and turn it into one big 
string. From there we initiate the ElementTree library (ET) and tell it to process
the XML string and store it in a variable called 'tree'. As long as there are no
errors we'll start looping through the various layers of the XML and assigning
temporary variables until we have enough information to create some of our custom
objects. 

So if we go all the way down to the bottom of the function we see where it's 
creating a new "nwClasses.Target" object which is defined in the nwClasses.py
file. If you look at that class and it's __init__ method you'll see that it needs
three input parameters before it will let you create a Target--the target id number,
a list of Entrypoint objects, and a list of Auth objects. The Entrypoint and Auth
objects are also defined in nwClasses.py and are custom to NetWalk. The Entrypoint
and Auth objects are created within the ElementTree 'tree' walk in the above loops.
When it's all said and done you have one big "Target" class object which we can
use to start the actual crawling. We take this Target and attach it to the 'werd'
Cfg() object and call it 'werd.loo_targets' which stands for "list of objects_targets".
I try to name any list that contains custom objects with the "loo_" prefix which
helps me understand how to handle things while I'm walking through loops. 

So the main things that the nwProcessInput.parseXMLInput() function does are:
--process the input XML
--Create various custom objects from the parsed XML:
----Entrypoint objects
----AuthPossibility objects
----Target objects
----Member objects

So now, back to the main() method of nw.py: We log some data and then we jump
in and start looping through that list of Targets that we just built. The first 
thing we do is to launch nwCrawl.burrow() and assign all results to 'crawlresults'.
The nwCrawl.burrow() function is what launches all aspects of the crawler and it's
where the real work happens. We'll go into that in a minute but we'll finish up
this main() method first. After the crawl is completed all we really do is
take the xml and write it to disk and do some logging. 

So now let's take a look at that nwCrawl.burrow() function in the nwCrawl.py file.
Within burrow() we'll first set up a flag as false so at some point we can flag
as true and exit the while loop. Inside the while loop we'll first check to make 
sure we're not at the max crawl depth yet and then kick of nwCrawl.primerBrain().

So the main things the nwCrawl.burrow() function does are:
--Kick off primerBrain() which does the crawling on the current target.
--create a new CrawlResults object and store the crawl results in currentCrawl
--kick off initializeNetworkElements() to create NetworkElement objects from crawl results
--look for more targets to crawl
--run the peckingOrder() function on the current crawl results to establish relationships

So let's take a look at primerBrain():
primerBrain() takes a Target object as the input parameter and starts to work
on it. It starts by looping through the list of Entrypoints attached to the 
Target. For each Entrypoint it keeps working on it until the 'entryfinished' 
flag is set. The whole time it's working on the Entrypoint it's checking 
various flags. The main flags it's checking are:
--entry.reachable: 
----this will get set if for some reason the entry is unreachable (e.g., it's a
----bogus IP address or some other situation that is impossible to work around)
--entry.directfailed:
----This will get set if the entrypoint can't be reached directly from the 
----jump server. This doesn't necessarily mean the entrypoint is unreachable
----so we want to flag it and try it again after first logging into another
----entrypoint (i.e., hopping).
--entry.primersuccess:
----This will get set if the primer script (e.e., primer-ALU-7705-6450.exp) was
----successful. This will tell us a lot of things like if the entrypoint is 
----reachable. But if the primer fails it doesn't mean we give up as long as the
----entrypoint is 'reachable'.
So if various combinations of these parameters are set the primerBrain() will
either set the 'entryfinished' flag to True and go back to burrow() or it will
keep kicking off the primerDriver() function until it can set the 'entryfinished'
flag to True. 

Now we'll look at the primerDriver() function. This function takes a Target and
Entrypoint object and works on it until it gets a 'primersuccess' flag set to 
True or it reaches 5 cycles. It starts by feeding the current Entrypoint object
into the adjustAuth() function which somewhat of a learning algorithm to find
the best password based on previous successes. After it adjusts the authentication
for this round it runs the buildExpectCommand() function with the Entrypoint 
object as input. The buildExpectCommand() takes the entry point and based on
it's authentication, whether it needs hops/direct, and whether it's building
a primer or a self-generated Expect script it will build the string required
to kick off the expect script. For example: it would build a string like this:

"/usr/bin/expect primer-ALU-7705-6450.exp 11.5.160.145 22 admin p4ssw0rd 8"

and store it in Entrypoint.primercommand property. After building the command the
primerDriver() function kicks it off with the runoscommand() function which basically
just runs an os.popen(COMMAND) using whatever command string you fed it. The results
from that runoscommand() come back in the form of a list of result lines.

Now we have some results from the OS command we ran so that's great but now we
need to read that data and determine whether or not we can use it or if we need
to try and pull it again. So first we'll set some flags up like 'directmightfail'
and 'sleepandtryagain'. This way we can set these flags if we see something we
don't like in the results. Now that we have these flags set we can kick off the 
schemaModule.outputInspector() function. This is stored in the schemaModule which 
sounds strange but I'll explain:
---I wanted to keep vendor specific stuff out of the main body of the NetWalk code
---so that the entire thing was as vendor neutral as possible. This way the core
---NetWalk could later be improved to support different vendors. Since reading things
---like the output results from commands run against an Alcatel-Lucent switch are
---very unique to that particular type of equipment I wanted to keep the code that
---reads it separated from the core code. For example, all of the code that is
---Alcatel-Lucent specific is stored in the following files: nwParser_ALU_7705_6450.py,
---nw_NE_Type_SCR_7705.py, and nw_NE_Type_SCS_6450.py
---Rather than reference these files by name I simply import them and reference
---them as "schemaModule". This way I can write modules for different vendors and
---the core code doesn't really change. So for Alcatel-Lucent when I say 
---schemaModule.outputInspector() I'm referring to the outputInspector() function
---inside the nwParser_ALU_7705_6450.py file. The current working schemaModule
---is set within the nwConfig.py file with the following line:
---
---         schemaModule = 'nwParser_ALU_7705_6450'
---
---
So anyways, back to the priverDriver() function...so what we're doing now is 
kicking off the outputInspector() function and telling it to return things like
entry.reachable, authsuccess, sleepandtryagain, and deletesshkeys. Based on the 
output of the inspector we'll set these variables. All we feed into the outputInspector
is the results from the primer output that's stored in entry.primeroutput. 

Now that that's out of the way we'll take a look at the outputInspector() function:
This function takes a list of text and examines it to pull some specific information.
Since the text it's examining is quite specific it lives in the nwParser_ALU_7705_6450
module rather than the main code base. The first thing it does is set up the
regular expression strings so it can detect patterns. Then it goes line by line
in the text looking for matches for those regular expresions. Based on what it
finds it sets certain flags like 'reachable','authsuccess', 'sleepandtryagain',
and 'deletesshkeys'. It returns those flags back to the calling function 
(e.g., primerDriver()). So the main objective of the outputInspector() is to 
determine whether or not the results of an Entrypoint's primer output look normal
or need to be pulled again. The primerDriver() uses this information to determine
whether or not it can flag the entry as finished or if it needs to resubmit it.

So now back to nwCrawl.primerDriver() and we see what primerDriver() does with 
the results. So if the output shows that we have a bad RSA key for an SSH host
we kick off the delete_ssh_keys() function to reset them. If we detected that
the entrypoint is reachable but 'directfailed' we set that flag so that on 
the next loop primerDriver() can rebuild the Expect command to try and go through
another known working element and "hop" instead. We can also take note of the 
'authsuccess' flag and on the next loop use adjustAuth() to change which one of the
password possibilities we're using. Sometimes we just get a message back from the
primer that means the remote network element isn't ready for an SSH session as 
opposed to a straight refusal. In this case we set a flag to 'sleepandtryagain'
and run the nwCrawl.sleeper() function which waits for a configurable period of 
time before primerDriver() tries to run the primer script again. Otherwise, if
everything worked and all the flags are set properly we return the Entrypoint
object back to primerBrain() with all of it's information about how reachable it
is. 

Back in primerBrain() we either kick off another round of primerDriver() or 
set the 'entryfinished' flag to stop the while loop depending on whether or not
the primer was successful. If everything was successful we jump back to burrow().

Now back in burrow() we set up a new nwClasses.CrawlResults object to store 
the crawl results now that we actually have something to store. This object is 
what will eventually hold all of our fancy NetworkElement objects that we parse
out of the raw text data we're getting back from switches and routers. Now that 
we have a place to store our fancy objects we kick off the initializeNetworkElements()
function and the real magic starts to happen. 

Inside the initializeNetworkElements() function we receive a Target object and a
CrawlResults object as input parameters. From there we loop through the Entry 
objects attached to the Target object's list of Entrypoint objects. As long as the 
current Entry object has the 'primersuccess' flag set to True we send the Entry
object into the createBaseNE() function. 

Within the createBaseNE() function we start to look at some interesting data
coming back from the network element we ran the primer against. First we have to
initialize a brand new nwClasses.NetworkElement object. When creating a new one
we have to give it a unique ID number. Rather than guess a number we call on the
neIdRequest() method of the Cfg() class (werd.neIdRequest()) to create one for 
us so we don't have to worry about making duplicate ID numbers. Now that we have
an empty NetworkElement object we can start to set its properties based on some
of the info that came back from the primer script. However, since nwCrawl and the 
base NetWalk core has no idea how it's supposed to read the lines in the primer
it calls on a function within the schemaModule() to handle the parsing for it. To
do this it calls on schemaModule.fetchBaseNEdata() and sends it the current Entry
objects 'primeroutput' text list along with the full Entry object. In return it asks
for the fetchBaseNEdata() function to just tell it what the 'hostname' is and what
the 'typestring' is supposed to be. The 'type' concept is supposed to be a 
customizable field so the NetworkElement object has some idea of a way to tag 
the object with a human readable type. For example, a small cell switch is often
referred to as an 'SCS' and a small cell router is an 'SCR'. 

So if we go and look at the schemaModule.fetchBaseNEdata() function:
All this function is doing is kicking off two other functions: detectNEtype() and
detectNEhostname(). All these two sub functions are doing are taking the text 
within the primer output and scanning it to find the hostname...