Using BigQuery & GitHub to Build a Giant Dataset of JavaScript Functions
One of the creative perks of working at BrightMinded is the opportunity to experience the future and work with hot technology. There is an exciting culture of learning and collaboration that encourages us to find bright solutions to some of the biggest challenges.
In this project, I got to explore the gigantic GitHub dataset of open source code and extract a database of JavaScript functions with the aim of exploring patterns in functional code.
How Big is BigQuery?
- Capable of a 1 petabyte scan in 1 minute
- It’s a massively parallel data warehouse
- Used to query planet-scale datasets.
- Uses column-oriented storage for optimal compression
Not only does BigQuery expose a familiar SQL API, but it is also possible to include regular expressions, run external JavaScript through user-defined functions and even call C functions using WebAssembly!
The GitHub Dataset
The GitHub dataset contains all the source code that is clearly marked with an open-source license, including commit messages and file metadata. There have been a lot of fun studies of the dataset ever since its release including: Expressions of Emotion in Commit Messages, Profanity by Programming Language and, of course, the infamous Spaces vs Tabs.
Key Stats:
- Over 2 billion file paths
- 165 million open source code file contents
- 145 million commits
- 48 thousand F-bombs
A sample query to find all the code containing the words “This should never happen”:
SELECT count(*) FROM [bigquery-public-data:github_repos.sample_contents]
WHERE NOT binary
AND content CONTAINS 'This should never happen'
Extracting JavaScript Functions
Step 1 — All The JavaScript
Queries are priced based on the number of rows read, regardless of the number of rows returned, so it is advisable to build a smaller table containing only the data needed.
Query to find all the JavaScript (.js) files:
SELECT (c.content) FROM [bigquery-public-data:github_repos.contents] contents
INNER JOIN [bigquery-public-data:github_repos.files] files
ON files.id = contents.id
WHERE files.path LIKE '%.js'
AND contents.binary = false
Step 2 — Function Extraction
The naive approach to resolving JavaScript functions would be to track the opening and closing of their curly braces, following the keyword “function”. However, this would quickly fall down due to the nature of JS and it’s ‘nested’ and ‘pass by reference’ functions, JSON notation, and regular expressions.
The solution is to parse the code and transverse the resulting syntax tree. Luckily there are some great resources out there that do just this, such as Esprima.
An example NPM module for extracting functions from a string.
functionExtractor= require("function-extractor");
const f = `var a=0; function test(a, b){b=function(){return 99}; return a+b} function test2(a){a=2; return a+ 4}`;
var functions = functionExtractor.parse(f);
Step 3 — The Final Query
With some modifications to the syntax tree generator, I was able to write a “function extractor” module which parsed the functions out of a string of code, including the function names, parameters, code block, and surrounding comments. This was then included in the SQL query which used the BigQuery JSON functions to build a table of results.
CREATE TEMP FUNCTION extractFunctionsWithComments(str STRING)
RETURNS ARRAY<STRING>
LANGUAGE JS AS """
return extractFunctions(str);
"""
OPTIONS (library="gs://github-data/functionExtractor.js")
SELECT JSON_EXTRACT_SCALAR(func, "$.name") AS name,
JSON_EXTRACT_SCALAR(func, "$.params") AS params,
JSON_EXTRACT_SCALAR(func, "$.comments") AS comments,
JSON_EXTRACT_SCALAR(func, "$.blockStart") AS blockStart,
JSON_EXTRACT_SCALAR(func, "$.block") AS block,
contents.path,
FROM 'github_js.github_contents' as contents
CROSS JOIN UNNEST(extractFunctionsWithComments(contents.content)) AS FUNC
Results
The query took just under 4 hours to run and returned about 8 billion functions of which 65 million were unique with a total of 7 million function names!
Longest JavaScript Function Names
With the results, it was easy to find the longest function names used in JavaScript. Here’s a selection of the longest function names (discarding anything which contained underscores or special characters and appeared to be machine generated).
The Future
We have only just scratched the surface of this huge dataset and I look forward to running further experiments which could use a similar process including:
- The use of BigQuery ML to explore relationships between functions and their comments.
- An exploration of “All the JSON” and “All the Arrays”
- Code statistics after minification using UglifyJS