I thought, how can we angle "Web Scraping for Machine Learning", and I realized that Web Scraping should be essential to Data Scientists, Data Engineers and Machine Learning Engineers.
The Full Stack AI/ML Engineer toolkit needs to include web scraping, because it can improve predictions with new quality data. Machine Learning inherently requires data, and we would be most comfortable, if we have as much high quality data as possible. But what about when the data you need is not available as a dataset? What then? Do you just go and ask organizations and hope that they kindly will deliver it to you for free?
The answer is: you collect, label and store it yourself.
I made a GitHub repository for scraping the data. I encourage you to try it out and scrape some data yourself, and even trying to make some NLP or other Machine Learning project out of the scraped data.
In this article, we are going to web scrape Reddit – specifically, the /r/DataScience (and a little of /r/MachineLearning) subreddit. There will be no usage of the Reddit API, since we usually web scrape when an API is not available. Furthermore, you are going to learn to combine the knowledge of HTML, Python, Databases, SQL and datasets for Machine Learning. We are doing a small NLP sample project at last, but this is only to showcase that you can pickup the dataset and create a model providing predictions.
Table of Contents (Click To Scroll)
- Web Scraping in Python - Beautiful Soup and Selenium
- Labelling Scraped Data
- Storing Scraped Data
- Small Machine Learning Project on Exported Dataset
- Further Readings
Web Scraping in Python With BeautifulSoup and Selenium
The first things we need to do is install BeautifulSoup and Selenium for scraping, but for accessing the whole project (i.e. also the Machine Learning part), we need more packages.
Install all the packages from the Github (linked at the start)
pip install -r requirements.txt
Alternatively, if you are using Google Colab, you can run the following to install the packages needed:
!pip install -r https://github.com/casperbh96/Web-Scrap
Source - Continue Reading: https://mlfromscratch.com/web-scraping-machine-learning/