Instagram is the most happening social networking platform of the present time. Most of the millennials are on it. It was originally intended to share photos with friends and family. But now, it is being used in unimaginable ways. A lot of brands are collaborating with popular pages to promote their products. So, there are people who maintain an Instagram page for a living. This is just to show you that Instagram is not a small deal. So, there are a lot of opportunities in this space. In this post, we will be seeing how to scrape the top hashtags for a given keyword.
The Concept
The concept is pretty simple. Go to the homepage of Instagram, and search for something. You will see a grid view of results. These results are the top results of Instagram. So, the hashtags they use are probably well thought out. These are the hashtags that we will be scraping using nodeJS code.
For this we will be first starting a project, installing dependencies, writing the code and testing it. The dependencies of this projects are the npm packages request and request-promise.
The code
Let us look at the code step by step.
- Let’s install and include the dependencies.
In the terminal
npm install request request-promise --save
In the js fileconst rp = require('request-promise');
- Now, let’s divide the whole script into 3 parts. The main logic and two helper functions. The two function are
scrapeHashtags()
– will extract all the hashtags from the html code.removeDuplicates
– will remove all the duplicate hashtags scraped.// Logic here const scrapeHashtags = (inputText) => { } const removeDuplicates = (arr) => { }
- In the logic part, we need to make a variable
URL
to hold the url of the page we want to scrape. Copy the result page’s url and assign it toURL
. And replace the keyword you had typed earlier with${keyWord}
let URL = `https://www.instagram.com/explore/tags/${keyWord}/`
- Now, we make the actual request to get some real data. For this, we will use the
rp
function and pass inURL
as the argument. This function will return a promise after the response is received. A promise is basically a function that is called after everything inside a function is completely executed. It is generally used with asynchronous functions. In our case, the promise returned will be the html code of the result page. We will come back to this after writing the helper functions.rp(URL) .then((html) => { console.log(html); }) .catch((err) => { console.log(err); });
- Let’s write the two helper function one by one.
scrapeHashtags()
– In this function, we will use regex to find the hashtags from the html code. The regex pattern for a hashtag is/(?:^|\s)(?:#)([a-zA-Z\d]+)/gm
. We will push all the hashtags into a list and finally return the list.const scrapeHashtags = (html) => { var regex = /(?:^|\s)(?:#)([a-zA-Z\d]+)/gm; var matches = []; var match; while ((match = regex.exec(html))) { matches.push(match[1]); } return matches; }
removeDuplicates()
– In this function, we will remove duplicate elements from a list. We will use a generic algorithm which should work for any list. Make an empty array, start pushing elements to it from the array given as argument. If the element already exists in the new array, skip it. This way, the new array will only have unique elements.const removeDuplicates = (arr) => { let newArr = []; arr.map(ele => { if (newArr.indexOf(ele) == -1){ newArr.push(ele) } }) return newArr; }
- We are done with the helper functions. It’s time to use them. Back to the promise returned by the
rp
function. Call thescrapeHashtags()
function on the html code and store the result in a variablehashtags
. Now, call theremoveDuplicates()
function onhashtags
and store it back inhashtags
. Use the map function to add the#
sign to every hashtag. And finally loghashtags
to the console.rp(URL) .then((html) => { let hashtags = scrapeHashTags(html); hashtags = removeDuplicates(hashtags); hashtags = hashtags.map(ele => "#" + ele) console.log(hashtags); }) .catch((err) => { console.log(err); });
- Last but not the least, make a variable called
keyWord
which is expected byURL
. Assign any random word to it.
let keyWord = "developers";
Your final code should look something like this
const rp = require('request-promise'); const cheerio = require('cheerio'); let keyWord = "developers" let URL = `https://www.instagram.com/explore/tags/${keyWord}/` rp(URL) .then((html) => { let hashtags = scrapeHashtags(html); hashtags = removeDuplicates(hashtags); hashtags = hashtags.map(ele => "#" + ele) console.log(hashtags); }) .catch((err) => { console.log(err); }); const scrapeHashtags = (html) => { var regex = /(?:^|\s)(?:#)([a-zA-Z\d]+)/gm; var matches = []; var match; while ((match = regex.exec(html))) { matches.push(match[1]); } return matches; } const removeDuplicates = (arr) => { let newArr = []; arr.map(ele => { if (newArr.indexOf(ele) == -1){ newArr.push(ele) } }) return newArr; }
Testing it
Assign any keyword you want to search to keyWord
. Now, run node filename.js
. This should take a few seconds and then print a giant list of hashtags. Congratulation, you have successfully written a script to scrape hashtags from Instagram.