This is a very simple Rest API which has the purpose of extract all images and top 10 most used words from a given URL by using .NET core 3.1 and Selenium with chrome webdriver.
1 - Download and install the .Net Core 3.1 according to your operating system. https://dotnet.microsoft.com/download/dotnet/3.1
2 - Clone this repository
git clone https://github.com/apandrade/API.extractor.git
3 - Download chrome webdrive for selenium according to your current installed chrome version and extract the chromedriver.exe into a folder of your choice
https://sites.google.com/a/chromium.org/chromedriver/downloads
4 - Open the cloned project in visual studio and open the launchSettings.json file located inside Properties folder and change the value of CHROME_WEBDRIVER_PATH environment variable to the same path where you put the chromedrive.exe in the previous step.
"CHROME_WEBDRIVER_PATH": "C:\\WebDriver\\bin"
Open a window command prompt, navigate to the project folder API.extractor\API.Extractor and run
dotnet run API.Extractor
Now open your web browser and navigate to https://localhost:5001/swagger/index.html to see the swagger API documentation
Open a window command prompt, navigate to the project root folder API.extractor and run
dotnet test UnitTests
There is just one post method on route
api/v1/extractor
The payload expected is the json below where url is the website you want scraping and download is the boolean that indicates if the api should or not download all images for server, if download is false the api will returns the originals urls of images. If download is true be aware that the server will replace all images each request you make.
{
"Url": "https://giphy.com/gifs/brazil-PSKAppO2LH56w",
"download": "true"
}
You can also determine the minimum word size to be counted by setting MinWordSize in appsettings.json file