Web Scraping also known as Screen Scraping, Web Data Extraction, or Web Harvesting is a technique used to extract large amounts of data from one or multiple web sites.
Most websites don't offer the functionality to save the data from their site onto your computer. Typically the only option is to
Right Click > Save As, which can become a very tedious task very quickly. Being able to scrap a site of its content could most certainly have it's uses, such as; perhaps you want to download
Wikipidia(which I heard is only 14GBs with no pictures), or your really into something like PowerShell, you could search Google for all images with powershell in the name and then download them to you computer.[Next Upcoming Post]
In the below function I scrape my web sites homepage for all of it's images, this means that my computer will do a search of my homepage [http://www.matthewkerfooot.com] for all image files and then download them to my local machine.
001
002
003
004
|
$Url = "http://www.TheOvernightAdmin.com"
$iwr = Invoke-WebRequest -Uri $Url
$images = ($iwr).Images | select src
$images
|
Output:
PS C:\> $images
src
---
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2-MeZf9LB70psOUNXQR5bZdkDZCCgfohOI25DusGLDSQqVDFubnV52a-mdwqgCSa9FkvHOUh9J_UYk52AUGP9G7ciSrV-4YxSH8DVeZ2RRSn5bu8CRljguqkfBXmrG18c6bLcJKlgFJI/s1600/3spaces.JPG
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2-MeZf9LB70psOUNXQR5bZdkDZCCgfohOI25DusGLDSQqVDFubnV52a-mdwqgCSa9FkvHOUh9J_UYk52AUGP9G7ciSrV-4YxSH8DVeZ2RRSn5bu8CRljguqkfBXmrG18c6bLcJKlgFJI/s1600/3spaces.JPG
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEih5-DOrMYIyvCdHe5St3MIAPgjoguh4YYoPK8VqFiL3N1wRAH7VBv_KzS2suq7HV8sAtnLnvEsBUeqNVEKAMKRCUzYVvRj9SPjyiAuhRVQ_bUNweRhyphenhyphenh5DuIjoA35A2_zKLfuMkvpy8BQ/s1600/computerlists.JPG
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgAEjZ1mUkKD2oFHcE1zOlFIwvbx4JQn8FZYJqgemAKJeB1oU09vFRle1VrMg19wCLdVeWgx_nUxMrnWkEc8Ued4EdhxXh9erASTtXzgmJhzp-mMMLDW8QWBCWZHnmGzA9fC30jtWGXwQg/s1600/finalproduct.JPG
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsjHq0MYuYo_iR-ztwv9_k9Jrz7uDj9TJ-SOmOkKEsynQvBYoGNJpC_BZtVTdorBG2SJdi3SfnB5Z-1lVVxtcS6GFTBXsKNMr3uUzdePb1OENcUkpngnx2f1KP2PuhWeeE7Pt4mA3rxmc/s1600/sysadminbeer.png
Continued...
PS C:\>
The
Invoke-WebRequest cmdlet used above will filter my webpage for all image files via
select src . Which would give us a list of all of the image paths that we will be downloading later.
The only difference from web scraping and web browsing is that when you are scraping it is usually automated and you are also saving the data, not just viewing it.
Here is the full function
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
|
function Get-WebPageImages
{
<#
.CREATED BY:
Matt Kerfoot
.CREATED ON:
08/11/2014
.Synopsis
Downloads all available images from the specified $URL (A mandatory Variable)
.DESCRIPTION
This funciton will download all images from a specific web page and save them to your desktop by default.
Requires PSv3+
.EXAMPLE
PS C:\> Get-WebPageImages -Url http://www.matthewkerfoot.com -outputpath c:\
#>
[CmdletBinding()]
Param ( [Parameter(Mandatory=$false,
ValueFromPipelineByPropertyName=$true,
Position=0)]
$Url = "http://www.TheOvernightAdmin.com",
$OutputPath = "$env:USERPROFILE\Desktop\"
)
Begin {
$iwr = Invoke-WebRequest -Uri $Url
$images = ($iwr).Images | select src
}
Process {
$wc = New-Object System.Net.WebClient
$images | foreach { $wc.DownloadFile( $_.src, ("$OutputPath\"+[io.path]::GetFileName($_.src) ) ) }
}
End {
Write-Host "Downloading all images from $Url to $OutputPath"
}
}
Get-WebPageImages
|