/StackExchange

Using StackOverflow to determine growth of Android Developers

Primary LanguagePython

#Using StackOverflow to determine growth of Android Developers

StackExchange is an uber-expansive data repository, maintaining all its data using relational databases -

  1. Hosts data for 101 Q&A sites,
  2. Supports 3.3 million users, and
  3. Is home to 6.1 million questions and 11.3 million answers!

StackOverflow - one of the 101 Q&A sites supported.

###For launching queries against StackExcahge: https://data.stackexchange.com/stackoverflow/queries

This data can be downloaded as .csv file!

###Schema: https://data.stackexchange.com/stackoverflow/query/new (on RHS)

###SQL Query for gathering data:

select a.Id as PostId, a.Tags as PostTags, d.Id as AskerId, d.DisplayName as AskerName,d.Reputation as AskerReputation,
(select count(*) from Badges where UserId = a.OwnerUserId) as NumberOfAskerBadges,
a.CreationDate as AskDate, e.Id as AnswererId, e.DisplayName as AnswererName, e.Reputation as AnswererReputation,
(select count(*) from Badges where UserId = b.OwnerUserId) as NumberOfAnswererBadges,
(select OwnerUserId from Posts where Id = a.AcceptedAnswerId) as BestAnswerById,
(select CreationDate from Posts where Id = a.AcceptedAnswerId) as BestAnswerGivenAt
from Posts a 
inner join Posts b on b.ParentId=a.Id
inner join Users d on d.Id=a.OwnerUserId
inner join Users e on e.Id=b.OwnerUserId
where a.PostTypeId=1 and a.Tags like '%android%' 
and a.CreationDate between '2012-10-01 00:00:00' and '2013-04-01 00:00:00'
order by a.Id,b.CreationDate

###Pain-point : Only 50000 rows of data can be fetched from dB at a go - need to manually accumulate data by altering time-frames.

##Data-Visualization: An excellent tool for building graphical representations : Gephi

For tool to generate graphical representations, it needs data in a certain format (.net).

###How to generate your own representation?

  1. python code.py list_of_all_.csv_files (space separated, please!)
  2. Start Gephi
  3. Import the .net file (generated by Py script). -- Play around with different network parameters to understand how network is structured, how it has evolved with time, significant contributions, etc.
  4. Can be used for building network structures for different technologies, and could be used for conducting comparative analysis of their evolution.

###Tracking growth of nodes?

  1. python newNodes.py list_of_all_.csv_files (space separated, please!)

NewNodes.txt will contain the number of unique nodes in each .csv file.