Naming conventions in Python def function() . A BigQuery adventure using the Github DB Dump.

Query complete (2.1s elapsed, 32.2 GB processed)

In June Felipe Hoffa published a post describing the partnership between google and GitHub, that allows you to query all of the public code posted on GitHub

2 weeks ago i did a analysis on the naming conventions used in the python import statements.

I was pleasantly surprised to see the post being shared by people from all around the world, so i decided to play some more with the Github DB dump.

Now, i did another query, and after google processed 32GB in less then 5 seconds, it returned the top 10000 requests.

def main() is the most popular function used, with 200K python files created.

One of the first interesting things is that there are 1426 python files that use Camel-Case to define the def Main() function.
def Main(): 1,426
def Main(args): 284
def Main(argv): 205

Almost 14K python files pass the argv to the function, and 7.4K python files pass the args
def main(argv): 13,917
def main(args): 7,444

Functions used in at least 2000 python files:

Visual representation of the TOP 10000 function names defined in Python

In the case of the parse_options() both functions use Java naming conventions and not the python way of naming_a_function() with the separator “_

Seems to be something special with the parse word, because at parse_args() there is no consensus for the first or second word if it should be p or P and a or A.

Top used non-public methods names.

Top default arguments send to functions.

Top usage of (x).

Hmm … What does (s) means ?

Final Remarks

As i work more and more with the Github database dump, i get more intrigued by it.

I`m fascinated by what we can obtain from this database, because of the high number of data-points. What i did now here are the top 10.000 results. Imagine what you can find out if you observe the long tail.

For the graphics i`m using the Tableau software. I have 4 days left from the Free Trial, so in the future i will not be able to make this kind of visualizations. If you have a twitter account, you can give a reply to @tableau and ask them to offer me a free licence, to be able to make more python code analysis visualizations and data analysis in the future.

What other things do you think it would be interesting to explore in the GitHub DB Code Dump ?

Next steps.

I did a query asking for all of the python functions. I will not reveal now how many they are, but what i can say is that Google Big Query needed 34 seconds to process the request.

Query complete (33.9s elapsed, 32.2 GB processed)

About Me

In the last 3 years i`m a collaborator with the Organised Crime and Corruption Reporting Projects (OCCRP), were i do data analysis and pattern recognition to uncover patterns of corruption in unstructured datasets.

In September 2016 i have moved to San Francisco, to start a new life. Searching for a Job were i can apply my expertise and pay the rent in SFO.

Currently Building a tool that detects possible fake viral news, before they go viral.

You can find me online on Medium Florin Badita, AngelList, Twitter , Linkedin, Openstreetmap, Github, Quora, Facebook

Sometimes i write on my blog http://florinbadita.com/

Tedx Speaker, Forbes 30 under 30 Europe, European Personality of the Year 2018. Data geek, activist, social entrepreneur. I think all the time about efficiency.

Tedx Speaker, Forbes 30 under 30 Europe, European Personality of the Year 2018. Data geek, activist, social entrepreneur. I think all the time about efficiency.