Let's build a scrapper with Node.js to crawl historic data for the awards of All NBA defensive teams.

In another article we will be using this same data to train a machine learning logistic regression model using Tensorflow.JS

1. We will inspect the HTML structure of the website that contains the data, to asses how to crawl it

2. Build the scrapper script with Node, using Cheerio.js

3. Import our data into a Danfo.js dataframe, do some clean up, and export it as a CSV for further use.

If you have never used Nodejs and AxiosJS, I would suggest the following article to get started

Axios JS & Node.js a match made in heaven for API consumption

Designing a scrapper

When Designing a scrapper the first thing to do is determining whether the website content we need to crawl is static content meaning that the content gets loaded as part of the page HTML, or is it dynamic content that gets loaded asynchronously using JavaScript after the site have loaded.

The website in question we will  be scrapping is:

To determine if the content is static or dynamic, we can just look for it in the site HTML,  in Chrome we can right click the page and view the source code, there we look for the content:

HTML source code of the page, that contains the content we are trying to crawl

In this case we are dealing with static content, so just grabbing the page HTML, building a virtual DOM, and doing some string operations can get us the data we want!

If the content was dynamic, we would've had to either identify an API endpoint that is serving the data and request it there, or use a headless browser for server side rendering such as Puppeteer, to crawl it.

Now we need to determine the DOM element that contains the content we want, in Chrome we can select part of the content, and right click on it, then inspect, to see the element as of the DOM tree:

Inspecting element in google chrome, and viewing it in the DOM tree using the chrome developer tools

Let's select the most outer element that still contains all the content we want, in this case we can use the class Article_article__2Ue3h as a unique identifier of the element:

selecting the outer element and visualizing it using google chrome developer tools

Building our scrapper

On an empty folder in the console run npm init and initialize a new project.

Install the packages we will be using:

npm install --save danfojs-node cheerio axios

Create a file nba_defensive_team_crawler.js for our script then let's start the process by:

  1. Getting the HTML from the site using axios.
  2. Building a virtual dom with it using cheerio.
  3. Select the outer element we previously defined using the class identifier Article_article__2Ue3h.
const axios = require('axios');
const cheerio = require('cheerio');
const dfd = require("danfojs-node");

async function crawl_award() {
	// we use axios to get html page through http get request
	const {data} = await axios.get('https://www.nba.com/history/awards/defensive-team');
	// Build a virtual dom using cheerio
	// this virtual dom will allow us to do element selection
	const $ = cheerio.load(data);

	// We select the element using the class we identified
	// we get the text data
	let text = $('.Article_article__2Ue3h').text();

2019-20 First Team  Rudy Gobert, Utah Jazz  Giannis Antetokounmpo, Milwaukee Bucks  Anthony Davis, Los Angeles Lakers  Marcus Smart, Boston Celtics  Ben Simmons, Philadelphia 76ers  Second Team  Brook Lopez, Milwaukee Bucks  Kawhi Leonard, LA Clippers  Bam Adebayo, Miami Heat  Patrick Beverley, LA Clippers  Eric Bledsoe, Milwaukee Bucks  * Official Release & voting totals  2018-19 First Team  Rudy Gobert, Utah Jazz  Paul George, Oklahoma City Thunder  Giannis Antetokounmpo, Milwaukee Bucks  Marcus Smart, Boston Celtics  Eric Bledsoe, Milwaukee Bucks  Second Team
our script outputs the data in text format

We can take advantage of the way the data is separated using the year, and build a regular expression to map the text into an array.

I came up with this regular expression:


Here is a detailed explanation of the regex:

^ Beginning.  Matches the beginning of the string, or the beginning of a line if the multiline flag (m) is enabled \d Digit.  Matches any digit character (0-9). \d Digit.  Matches any digit character (0-9). \d Digit.  Matches any digit character (0-9). \d Digit.  Matches any digit character (0-9). - Character.  Matches a "-" character (char code 45). \d Digit.  Matches any digit character (0-9). \d Digit.  Matches any digit character (0-9). \n Escaped character.  Matches a LINE FEED character (char code 10). global search and multiline enabled
Explanation of the regular expression using regexr.com

To add the regex to our crawler, we add the following lines to our crawl_award function:

text = text.split(/(^\d\d\d\d-\d\d\n)+/gm);

[   '> NBA History: Awards\nYear-by-year NBA All-Defensive Teams:\n',   '2019-20\n',   'First Team\n' +     'Rudy Gobert, Utah Jazz\n' +     'Giannis Antetokounmpo, Milwaukee Bucks\n' +     'Anthony Davis, Los Angeles Lakers\n' +     'Marcus Smart, Boston Celtics\n' +     'Ben Simmons, Philadelphia 76ers\n' +     'Second Team\n' +     'Brook Lopez, Milwaukee Bucks\n' +     'Kawhi Leonard, LA Clippers\n' +     'Bam Adebayo, Miami Heat\n' +     'Patrick Beverley, LA Clippers\n' +     'Eric Bledsoe, Milwaukee Bucks\n' +     '* Official Release & voting totals\n',   '2018-19\n',   'First Team\n' +     'Rudy Gobert, Utah Jazz\n' +     'Paul George, Oklahoma City Thunder\n' +     'Giannis Antetokounmpo, Milwaukee Bucks\n' +     'Marcus Smart, Boston Celtics\n' +     'Eric Bledsoe, Milwaukee Bucks\n' +     'Second Team\n' +     'Jrue Holiday, New Orleans Pelicans\n' +     'Klay Thompson, Golden State Warriors\n' +     'Joel Embiid, Philadelphia 76ers\n' +     'Draymond Green, Golden State Warriors\n' +     'Kawhi Leonard, Toronto Raptors\n' +     '* Official Release & voting totals\n',   '2017-18\n',
Now we have an easy iterable array!

Now it's time to parse it, lets add the following code to our crawl_award function:

// Array to store the aggregated data
let award_data = [];

// We iterate over the text
for (const [index, element] of text.entries()) {
	// the array structure is the following
    // ['> NBA History: Award...','2019-20\n', 'First Team\n +...']
    // The 0 elemnt of the array contains the header
    // the 1 element of the array contains the year
    // the 2 element of the array contains the data
	// ['header', 'year', 'data','year', 'data','year', 'data',...]
	// We only want to process when on 'year' element
    if(! (index % 2)) continue;

	// Get next year, we will use the final year of the season
	const Year = parseInt(element.split('-')[0])+1;

	// We get the awarded data from the next element in the array
	const awarded = text[index+1].split('\n');

	// We calculate where the second team selection starts
	const second_team_index = awarded.indexOf('Second Team');

	const first_team = awarded.slice(1, second_team_index)
	const second_team = awarded.slice(second_team_index+1)

	// we parse the first team and add it to our array
    award_data = award_data.concat(parsDefensiveTeam(first_team,Year,1));

	// we parse the second team and add it to our array
	award_data = award_data.concat(parsDefensiveTeam(second_team,Year,2));


function parsDefensiveTeam(data, Year, df_number) {
    const award_data = []
    for( const player of data){
		// we extract the team
    	const Tm =  (player.split(',')[1] || "").trim();
        // Extract the player name
    	const Player = player.split(',')[0].trim();
    	if(!player) continue;

    		Year: parseInt(Year),
    		defensive_team: df_number,
    return award_data;

Now we run it and we got our parsed data as a JSON object:

[   {     Year: 2020,     Player: 'Rudy Gobert',     Tm: 'Utah Jazz',     defensive_team: 1   },   {     Year: 2020,     Player: 'Giannis Antetokounmpo',     Tm: 'Milwaukee Bucks',     defensive_team: 1   },   {     Year: 2020,     Player: 'Anthony Davis',     Tm: 'Los Angeles Lakers',     defensive_team: 1   },   {     Year: 2020,     Player: 'Marcus Smart',     Tm: 'Boston Celtics',     defensive_team: 1   },   {     Year: 2020,     Player: 'Ben Simmons',     Tm: 'Philadelphia 76ers',     defensive_team: 1   },   {     Year: 2020,     Player: 'Brook Lopez',     Tm: 'Milwaukee Bucks',     defensive_team: 2   },
Our parsed data as JSON

Importing our data into a Danfo.js dataframe and doing some cleanup

In order to do more complex operations with our data, we will use Danfo.js.

Danfo.js is an open-source, JavaScript library providing high-performance, intuitive, and easy-to-use data structures for manipulating and processing structured data.

In the crawl_award function we now add:

// We create a new dataframe using our award_data
const award_df = new dfd.DataFrame(award_data);

//print the head of our dataframe as a table in the console
A print in the console of a table with 5 rows with the following columns unanmed index column, Year, Player, Tm (team), defensive_team,
Print of our dataframe

Danfo is a really powerful tool, that will allow you to better dive into your data, remember, your data needs a lot of love, and time. Go through it, try to understand it and it's context, but never believe too much in your data.

We can do queries with Danfo in the following manner:

	.query({ column: "Player", is: "==", to: 'Shaquille O\'Neal'})
Querying for entries of players named Shaquille O'neal in our dataframe

Oops! That query came out as empty, even though Shaq was a multiple time all NBA defensive team player, what's wrong? after exploring our data by year, to a year Shaq had previously won it 2000, we see the problem.

	.query({ column: "Year", is: "==", to: 2000})

The apostrophe is a different character than we were expecting, to fix it we can do the following.

// Replace the apostrophe in all the players names
award_df['Player'] = award_df['Player'].str.replace('’','\'');

Now try the initial query again and see the various times Shaq made an all defensive team.

Table that shows Shaq made second defensive team in 2003, 2001, 2000 while on the los angels lakers
Now we can find Shaq

After some time working with our data i was able to detect some other issues:

  1. "Metta World" is still named "Ron Artest".
  2. There is a typo on "Patrick Beverley" name.
// Replace "Ron Artest" with "Metta World"
award_df['Player'] = award_df['Player'].str.replace('Ron Artest','Metta World');

// Fix typo on Patrick Beverley name
award_df['Player'] = award_df['Player'].str.replace('Patrick Beverly','Patrick Beverley');
Improving our data with a little bit of data wrangling

Now after our clean up, let's save our data into a CSV file, that way we can easily use it later:

And that's it, our data is saved!

Thats's all folks!

All sources here can be found in the following github repo

The final CSV result can also be found in the same repo