StackExchange Data Explorer is an open source tool to run SQL queries against public data from StackOverflow. Since StackOverflow is the biggest development forum of the world, there is surely a lot of information that companies can actually retrieve from their system in order to take some business decision (this is actually a brilliant place to apply BigData)
Moving now to different issues: I was discussing with some event organizers the possibility of bringing an Android event from the USA to Europe. Since I do live in Munich (and besides being a trendy mobile city, I think is a really cool place to organize such events), I was trying to convince them that Munich was the choice. They were resilient about it, so I needed to prove with some data that Munich would be a really nice election.
At this time, I was thinking how could I use Data Explorer and BigData to support my thesis. I remember two times I used it before, to display the most active developers from Barcelona (the city I lived before), Munich, and the two cities combined. Something similar could be a valid approach. In the previous SQL queries, I was clustering the top developers from each city based on their contribution to questions tagged with the token “android“. So I could possibly group all the developers contribution from a certain city with questions tagged with the same token. I come over with this script:
;WITH USER_BY_TAG AS ( SELECT ROW_NUMBER() OVER(ORDER BY COUNT(*) DESC) Rank, u.Location, COUNT(*) AS UpVotes FROM Tags t INNER JOIN PostTags pt ON pt.TagId = t.id INNER JOIN Posts p ON p.ParentId = pt.PostId INNER JOIN Votes v ON v.PostId = p.Id and VoteTypeId = 2 INNER JOIN Users u ON u.Id = p.OwnerUserId WHERE (LOWER(Location) LIKE '% germany%' OR LOWER(Location) LIKE '% spain%' OR LOWER(Location) LIKE '% holland%' OR LOWER(Location) LIKE '% france%' OR LOWER(Location) LIKE '% italy% ') OR LOWER(Location) LIKE '% netherland%' OR LOWER(Location) LIKE '% united kingdom%' OR LOWER(Location) LIKE '% poland%' OR LOWER(Location) LIKE '% sweden%' AND TagName = 'android' GROUP BY u.Location ) SELECT * FROM USER_BY_TAG WHERE rank <= 1000 ORDER BY upvotes DESC ;
I decided to search in Germany, Spain, Holland/Netherlands, France, Italy, United Kingdom, Poland and Sweden (not that Sweden is a big country in terms of population, but I do work with a bunch of Swedish colleagues ). I did some little experiments to get rid of some nomenclature errors and statistical noise (for instance, I tried also with England, thinking that some developers might registered their selves as English residents instead of UK). After some refining, I came with the following result:
There is a major preponderance of cities from Germany and the UK, with only 5 cities from different countries in the first 20 cities against 9 british cities and 6 Germans. Is not my purpose here to give a full sociologic analysis (i.e., many cities are from UK since StackOverflow is an English based community and there are other local communities in different countries – although 80% of the Internet documentation is English based), but to give a rough approach.
So, now we have a bunch of cities with some numbers. We still need to go through a normalization process (i.e., cities with more population will always have more UpVotes than cities with less population). Since StackOverflow does not provide statistics for cities population (is not either their task) I will correlate this values manually. Thus, I will assign to each city the factor that determines the relationship between UpVotes and population. To get the population value, I did use Wikipedia.
- London: 1.783.457 / 8.174.000 = 0.21
- Reading: 701.683 / 145.700 = 4.815
- Berlin: 570.379 / 3.502.000 = 0.16
Paris: 381.560 / 2.234.105 =0.17
- Amsterdam: 335.999 / 779.808 = 0.43
- Cambridge: 328.603 / 123.900 = 2.65
- Munich: 252.516 / 1.378.000 = 0.18
- Frankfurt: 139.674 / 691.518 = 0.20
- Manchester: 126.113 / 510.700 = 0.24
- Lyon: 124.057 / 474.946 = 0.26
- Warsaw: 116.670 / 1.717.000 = 0.06
- Oxford: 100.504 / 150.200 = 0.66
- Hamburg: 979.30/ 1.799.000 = 0.05
- Madrid: 96.906 / 3.234.000 = 0.02
- Karlsruhe: 95.848 /297.488 = 0.32
- Ulm: 92.675/ 123.672 = 0.74
- Edinburgh: 84.748 / 495.370 = 0.17
- Brighton: 81.612 / 155.919 = 0.52
- Ulverston: 79.056 / 11.524 = 6.86
- Barcelona: 78.130 / 162.1000 = 0.04
And if we now sort it by the coefficient:
- Ulverston (UK): 6.86
- Reading (UK): 4.815
- Cambridge (UK): 2.65
- Ulm (Germany): 0.74
- Oxford (UK): 0.66
- Brighton (UK): 0.52
- Amsterdam (Netherlands): 0.43
- Karlsruhe (Germany): 0.32
- Lyon (France): 0.26
- Manchester (UK): 0.24
- London (UK): 0.21
- Frankfurt (Germany): 0.20
- Munich (Germany): 0.18
- Paris (France): 0.17
- Edinburgh (UK): 0.17
- Berlin (Germany): 0.16
- Warsaw (Poland): 0.06
- Hambug (Germany): 0.05
- Barcelona (Spain): 0.04
- Madrid (Spain): 0.02
There is some very interesting information in this graph:
- The result of UK cities are brilliant. Reading, Cambridge and Oxford, famous for their universities, are all in the top 5
- Ulverston, a tiny city of North West England, scores first in the ranking. It would be interesting to determine the reasons, since there is no known university or industry in the town. Probably a top user from StackOverflow explains it, but it can be discarded as statistical noise.
- For most of the cities, a value between 0 and 1 seems the norm.
- Big capitals are generally in the mid part of the table, except Amsterdam
After this sampling Munich does not score that bad (although better in absolute terms). There are, however, a bunch of other different reasons to choose a place for a certain event (proximity to other places, infrastructures, communication, hosting prices, global number of possible attendants, etc). But after this little experiment, I can only suggest to organizers to move to the city of Ulverston (even if I still think that Munich offers a great beer).
Follow me on Twitter @eenriquelopez !