Crawling all NBA defensive teams award data, and structuring it using Danfo.JS
Let's build a scrapper with Node.js to crawl historic data for the awards of All NBA defensive teams.
In another article we will be using this same data to train a machine learning logistic regression model using Tensorflow.JS
1. We will inspect the HTML structure of the website that contains the data, to asses how to crawl it
https://www.nba.com/history/awards/defensive-team
2. Build the scrapper script with Node, using Cheerio.js
3. Import our data into a Danfo.js dataframe, do some clean up, and export it as a CSV for further use.
If you have never used Nodejs and AxiosJS, I would suggest the following article to get started
Axios JS & Node.js a match made in heaven for API consumption
Designing a scrapper
When Designing a scrapper the first thing to do is determining whether the website content we need to crawl is static content meaning that the content gets loaded as part of the page HTML, or is it dynamic content that gets loaded asynchronously using JavaScript after the site have loaded.
The website in question we will be scrapping is:
https://www.nba.com/history/awards/defensive-team
To determine if the content is static or dynamic, we can just look for it in the site HTML, in Chrome we can right click the page and view the source code, there we look for the content:
In this case we are dealing with static content, so just grabbing the page HTML, building a virtual DOM, and doing some string operations can get us the data we want!
If the content was dynamic, we would've had to either identify an API endpoint that is serving the data and request it there, or use a headless browser for server side rendering such as Puppeteer, to crawl it.
Now we need to determine the DOM element that contains the content we want, in Chrome we can select part of the content, and right click on it, then inspect, to see the element as of the DOM tree:
Let's select the most outer element that still contains all the content we want, in this case we can use the class Article_article__2Ue3h
as a unique identifier of the element:
Building our scrapper
On an empty folder in the console run npm init
and initialize a new project.
Install the packages we will be using:
npm install --save danfojs-node cheerio axios
Create a file nba_defensive_team_crawler.js
for our script then let's start the process by:
- Getting the HTML from the site using axios.
- Building a virtual dom with it using cheerio.
- Select the outer element we previously defined using the class identifier
Article_article__2Ue3h
.
We can take advantage of the way the data is separated using the year, and build a regular expression to map the text into an array.
I came up with this regular expression:
/(^\d\d\d\d-\d\d\n)+/gm
Here is a detailed explanation of the regex:
To add the regex to our crawler, we add the following lines to our crawl_award
function:
text = text.split(/(^\d\d\d\d-\d\d\n)+/gm);
console.log(text);
Now it's time to parse it, lets add the following code to our crawl_award
function:
// Array to store the aggregated data
let award_data = [];
// We iterate over the text
for (const [index, element] of text.entries()) {
// the array structure is the following
// ['> NBA History: Award...','2019-20\n', 'First Team\n +...']
// The 0 elemnt of the array contains the header
// the 1 element of the array contains the year
// the 2 element of the array contains the data
// ['header', 'year', 'data','year', 'data','year', 'data',...]
// We only want to process when on 'year' element
if(! (index % 2)) continue;
// Get next year, we will use the final year of the season
const Year = parseInt(element.split('-')[0])+1;
// We get the awarded data from the next element in the array
const awarded = text[index+1].split('\n');
// We calculate where the second team selection starts
const second_team_index = awarded.indexOf('Second Team');
const first_team = awarded.slice(1, second_team_index)
const second_team = awarded.slice(second_team_index+1)
// we parse the first team and add it to our array
award_data = award_data.concat(parsDefensiveTeam(first_team,Year,1));
// we parse the second team and add it to our array
award_data = award_data.concat(parsDefensiveTeam(second_team,Year,2));
}
console.log(award_data);
function parsDefensiveTeam(data, Year, df_number) {
const award_data = []
for( const player of data){
// we extract the team
const Tm = (player.split(',')[1] || "").trim();
// Extract the player name
const Player = player.split(',')[0].trim();
if(!player) continue;
award_data.push({
Year: parseInt(Year),
Player,
Tm,
defensive_team: df_number,
})
}
return award_data;
}
Now we run it and we got our parsed data as a JSON object:
Importing our data into a Danfo.js dataframe and doing some cleanup
In order to do more complex operations with our data, we will use Danfo.js.
Danfo.js is an open-source, JavaScript library providing high-performance, intuitive, and easy-to-use data structures for manipulating and processing structured data.
In the crawl_award
function we now add:
// We create a new dataframe using our award_data
const award_df = new dfd.DataFrame(award_data);
//print the head of our dataframe as a table in the console
award_df.head().print();
Danfo is a really powerful tool, that will allow you to better dive into your data, remember, your data needs a lot of love, and time. Go through it, try to understand it and it's context, but never believe too much in your data.
We can do queries with Danfo in the following manner:
Oops! That query came out as empty, even though Shaq was a multiple time all NBA defensive team player, what's wrong? after exploring our data by year, to a year Shaq had previously won it 2000, we see the problem.
award_df
.query({ column: "Year", is: "==", to: 2000})
.head().print();
The apostrophe is a different character than we were expecting, to fix it we can do the following.
// Replace the apostrophe in all the players names
award_df['Player'] = award_df['Player'].str.replace('’','\'');
Now try the initial query again and see the various times Shaq made an all defensive team.
After some time working with our data i was able to detect some other issues:
- "Metta World" is still named "Ron Artest".
- There is a typo on "Patrick Beverley" name.
Now after our clean up, let's save our data into a CSV file, that way we can easily use it later:
Thats's all folks!
All sources here can be found in the following github repo
https://github.com/AoX04/all-nba-defense
The final CSV result can also be found in the same repo
https://github.com/AoX04/all-nba-defense/blob/main/nba_defensive_teams.csv