conferences_cara

For Cara's work on the Nexus Mediacom project

#Background The goal of this project is to match up attendees at a conference with complementary interests or knowledge. There were actually three parts to this, but we trimmed it down in the end and just looked at matching knowledge. The data we've been given is the result of surveys that Nexus sent to the attendees. In the surveys, the attendees answered questions about what they had knowdlege of in terms of smart cities (SC), smart homes (SM) or industrial internet (II). They also had to say what they wanted to learn, in terms of the same three areas. (It was a conference for the three ares.)

The goal is to match up each attendee with the 25 other attendees that most match...matching on what they want to learn versus who has relevant knowledge.

#Data I've given you a file in the "data/" folder called "Registration Back up 19 Sept.xlsx". It's the original spreadsheet from the client. When you import this data, you might find that there are special characters in the names that are giving you trouble...part of the task is to deal with that efficiently. I've also given you a .csv file in which I've overcome this problem. It might be good practice to make sure that you can get from the .xlsx file into pandas (or whatever you're going to use) while maintaining the special characters on your own.

The latter columns are the ones you really care about, staring with column U and going through AG. These are the attendees' answers to survey questions and form the basis of the content-based system I used. Each column is from a pull-down menu with several possible choices. The respondent could choose any or all of the choices, and they are just listed as text fory you. There is also an 'other' field for each, where the attendees could put in whatever they wanted...open text. The first six columns pertain to what the person knows, the second six (after the space column) pertain to what he/she wants to learn. One of the biggest problems I had was to figure out how best to deal with this data...how to get it into a form that the recommender system could use. How would you approach the problem? (My solution lies in the functions "AddUnderscores()" and "SpreadResponses()". I know that pandas has ways of doing this that would probably work better, and R probably does as well.)

#Scripts I've given you five scripts. "MatchingKnowledge.R" is the main one and makes calls to the others. There is also a .Rproj file. If you're familiar with RStudio then you'll know what this is...if not, then play around and get familiar with it. (In brief, RStudio is the ide for R. You'll need to install both for this to work on your machine. You can run R without RStudio, but the ide is very good and I'd recommend using it. If you don't want to actually run the R code, then you can simply look at it in a text editor of course.) With everything you have in this repo, you should be able to run the code and get the output in the form the client wanted, contained in the final deliverable "KnowledgeMatchesTop25_M3.xlsx".

#Areas for improvement The script is messy. First of all, it runs slowly. It takes a good three minutes on my machine to source it and get the output (4GB RAM). I am sure it could go faster, so improving performance would be a great contribution. There is also a lot of old code commented out, which is very poor on my part. This was very-much a work-in-progress, but I stopped progress as soon as I had the deliverable (as is too often the case), so it's in a form that works, but that's messy and inefficient.

There was an annoyingly substantial part of tidying up the responses to the questions to get them to match correctly. For example, in some cases a variable might appear as "wiereless connectivity" and in another it might be "Wireless connectivity". These are the same thing, but R doesn't see that. I just went through it manually to fix this, but it was really time-consuming and does not generalise at all: I'd have to do the same thing for a different survey. Hopefully Mark (the client) will improve this next time so that there is uniformity...it was one of the take-aways from our first run of this project. But, somehow stream-lining this would be a huge help. I've written a function to do that somewhat "CombineSimilarColumns()", but I still have to go through manually and make the catches first, as you'll see starting around line 69 of the "MatchingKnowledge.R" script.

Tha matrix work is pretty straight-forward and starts on line 127. Go through it and see if you can decipher how the matching works here. This is really the meat of the algorithm and, typically, is short and took very little time to write. If you have some ideas for alternative ways to do the matching, I'd very happy to hear them. One thing that I don't like is that the call to the function "distanceToTargets()" happens in a for-loop. This is really inefficient, and this call (line 148-150) is a bottle-neck. I have to think there is a better way to do this without a loop, and if you could improve the performance here, that would also be a big plus. The same thing goes for the call to "RankDistanceToTargets()" in lines 176-187...it's very slow: it's a for-loop containing an if-else statement that calls a function with two for-loops in series. It works, but could really use some improvement.

The final function calls to "GetMatches()" and "GetCompanyMatchesOutput()" retrieve the data into a form that Mark can use. The first takes the list of lists pertaining to the matches and retrieves the attendee data. The second puts it into a form that Mark can use. Annoyingly, he wanted to have the list of each person's matches as a vertical list within one cell of an excel spreadsheet, as you can see in the output file "KnowledgeMatchesTop25_M3.xlsx". That was a complete pain and is one of those annoying things that (1) is easy if you know how to do it (I didn't) and is really hard if you don't and (2) is easily overlooked in a data science project until the end when you finally discuss the deliverable to the client, but is very useful to know in advance!

A few other notes. (1) for the final output file "knowledgeMathchesTop25_M3.xlsx", I had to go to the fourth column and manually ask excel to 'wrap test', or else it appears as a single long string. I could not figure out how to do this on the R side, and it was just easier to do it manually. If you run the script, your output will not have that format when you open it in excel. The output file I've given you I've done this for already. (2) You'll find a lot of attendees are missing the survey data all together. That's because they had registered for the conference before we wrote the survey and they didn't respond to follow-up emails containing the survey. There isn't anything we can do about that, so I ignored those people.

So, this should get you started. Please let me know if you have any questions about what I've done, the project in general or the R code. I'm assuming you'd want to work it up in Python, but if you want to do it in R, that's also fine by me. R is my first language, but I'm not sure which is the better tool for this job, honestly. I had hoped to put this all into an R package, but have not gotten very far on that. Anyway, have a look, let me know what you think and accept my apology in advnace for the poorly-written code and prevalence of for-loops.

jleslie17/conferences_cara

conferences_cara