Let's build a scrapper with Node.js to crawl historic data for the awards of All NBA defensive teams.
In another article we will be using this same data to train a machine learning logistic regression model using Tensorflow.JS
1. We will inspect the HTML structure of the website that contains the data, to asses how to crawl it
https://www.nba.com/history/awards/defensive-team
2. Build the scrapper script with Node, using Cheerio.js
3. Import our data into a Danfo.js dataframe, do some clean up, and export it as a CSV for further use.
If you have never used Nodejs and AxiosJS, I would suggest the following article to get started
Axios JS & Node.js a match made in heaven for API consumption
Designing a scrapper
When Designing a scrapper the first thing to do is determining whether the website content we need to crawl is static content meaning that the content gets loaded as part of the page HTML, or is it dynamic content that gets loaded asynchronously using JavaScript after the site have loaded.
The website in question we will be scrapping is:
https://www.nba.com/history/awards/defensive-team
To determine if the content is static or dynamic, we can just look for it in the site HTML, in Chrome we can right click the page and view the source code, there we look for the content:

In this case we are dealing with static content, so just grabbing the page HTML, building a virtual DOM, and doing some string operations can get us the data we want!
If the content was dynamic, we would've had to either identify an API endpoint that is serving the data and request it there, or use a headless browser for server side rendering such as Puppeteer, to crawl it.
Now we need to determine the DOM element that contains the content we want, in Chrome we can select part of the content, and right click on it, then inspect, to see the element as of the DOM tree:

Let's select the most outer element that still contains all the content we want, in this case we can use the class Article_article__2Ue3h
as a unique identifier of the element:

Building our scrapper
On an empty folder in the console run npm init
and initialize a new project.
Install the packages we will be using:
npm install --save danfojs-node cheerio axios
Create a file nba_defensive_team_crawler.js
for our script then let's start the process by:
- Getting the HTML from the site using axios.
- Building a virtual dom with it using cheerio.
- Select the outer element we previously defined using the class identifier
Article_article__2Ue3h
.

We can take advantage of the way the data is separated using the year, and build a regular expression to map the text into an array.
I came up with this regular expression:
/(^\d\d\d\d-\d\d\n)+/gm
Here is a detailed explanation of the regex:

To add the regex to our crawler, we add the following lines to our crawl_award
function:

Now it's time to parse it, lets add the following code to our crawl_award
function:
Now we run it and we got our parsed data as a JSON object:

Importing our data into a Danfo.js dataframe and doing some cleanup
In order to do more complex operations with our data, we will use Danfo.js.
Danfo.js is an open-source, JavaScript library providing high-performance, intuitive, and easy-to-use data structures for manipulating and processing structured data.
In the crawl_award
function we now add:

Danfo is a really powerful tool, that will allow you to better dive into your data, remember, your data needs a lot of love, and time. Go through it, try to understand it and it's context, but never believe too much in your data.
We can do queries with Danfo in the following manner:
Oops! That query came out as empty, even though Shaq was a multiple time all NBA defensive team player, what's wrong? after exploring our data by year, to a year Shaq had previously won it 2000, we see the problem.
award_df
.query({ column: "Year", is: "==", to: 2000})
.head().print();
The apostrophe is a different character than we were expecting, to fix it we can do the following.
Now try the initial query again and see the various times Shaq made an all defensive team.

After some time working with our data i was able to detect some other issues:
- "Metta World" is still named "Ron Artest".
- There is a typo on "Patrick Beverley" name.
Now after our clean up, let's save our data into a CSV file, that way we can easily use it later:
Thats's all folks!
All sources here can be found in the following github repo
https://github.com/AoX04/all-nba-defense
The final CSV result can also be found in the same repo
https://github.com/AoX04/all-nba-defense/blob/main/nba_defensive_teams.csv