Monday, August 11, 2014

Download all images from a web page with PowerShell

Web Scraping also known as Screen Scraping, Web Data Extraction, or Web Harvesting is a technique used to extract large amounts of data from one or multiple web sites.

Most websites don't offer the functionality to save the data from their site onto your computer. Typically the only option is to Right Click > Save As, which can become a very tedious task very quickly. Being able to scrap a site of its content could most certainly have it's uses, such as; perhaps you want to download Wikipidia(which I heard is only 14GBs with no pictures), or your really into something like PowerShell, you could search Google for all images with powershell in the name and then download them to you computer.[Next Upcoming Post]

In the below function I scrape my web sites homepage for all of it's images, this means that my computer will do a search of my homepage [http://www.matthewkerfooot.com] for all image files and then download them to my local machine.

001
002
003
004
$Url = "http://www.TheOvernightAdmin.com"
$iwr = Invoke-WebRequest -Uri $Url
$images = ($iwr).Images | select src
$images

Output:

PS C:\> $images
src                              
---
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2-MeZf9LB70psOUNXQR5bZdkDZCCgfohOI25DusGLDSQqVDFubnV52a-mdwqgCSa9FkvHOUh9J_UYk52AUGP9G7ciSrV-4YxSH8DVeZ2RRSn5bu8CRljguqkfBXmrG18c6bLcJKlgFJI/s1600/3spaces.JPG                                
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2-MeZf9LB70psOUNXQR5bZdkDZCCgfohOI25DusGLDSQqVDFubnV52a-mdwqgCSa9FkvHOUh9J_UYk52AUGP9G7ciSrV-4YxSH8DVeZ2RRSn5bu8CRljguqkfBXmrG18c6bLcJKlgFJI/s1600/3spaces.JPG                                
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEih5-DOrMYIyvCdHe5St3MIAPgjoguh4YYoPK8VqFiL3N1wRAH7VBv_KzS2suq7HV8sAtnLnvEsBUeqNVEKAMKRCUzYVvRj9SPjyiAuhRVQ_bUNweRhyphenhyphenh5DuIjoA35A2_zKLfuMkvpy8BQ/s1600/computerlists.JPG                           
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgAEjZ1mUkKD2oFHcE1zOlFIwvbx4JQn8FZYJqgemAKJeB1oU09vFRle1VrMg19wCLdVeWgx_nUxMrnWkEc8Ued4EdhxXh9erASTtXzgmJhzp-mMMLDW8QWBCWZHnmGzA9fC30jtWGXwQg/s1600/finalproduct.JPG                            
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsjHq0MYuYo_iR-ztwv9_k9Jrz7uDj9TJ-SOmOkKEsynQvBYoGNJpC_BZtVTdorBG2SJdi3SfnB5Z-1lVVxtcS6GFTBXsKNMr3uUzdePb1OENcUkpngnx2f1KP2PuhWeeE7Pt4mA3rxmc/s1600/sysadminbeer.png                                                                                                             
Continued...                                                                       
PS C:\> 

The Invoke-WebRequest cmdlet used above will filter my webpage for all image files via select src . Which would give us a list of all of the image paths that we will be downloading later.

The only difference from web scraping and web browsing is that when you are scraping it is usually automated and you are also saving the data, not just viewing it.


Here is the full function

001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
function Get-WebPageImages
{
<#
.CREATED BY:
    Matt Kerfoot
.CREATED ON:
    08/11/2014
.Synopsis
   Downloads all available images from the specified $URL (A mandatory Variable)
.DESCRIPTION
   This funciton will download all images from a specific web page and save them to your desktop by default.
   Requires PSv3+
.EXAMPLE
   PS C:\> Get-WebPageImages -Url http://www.matthewkerfoot.com -outputpath c:\
#>

                                  [CmdletBinding()]
                          Param ( [Parameter(Mandatory=$false,
                                  ValueFromPipelineByPropertyName=$true,
                                  Position=0)]
                                  $Url = "http://www.TheOvernightAdmin.com",
                                  $OutputPath = "$env:USERPROFILE\Desktop\"
                          )

                 Begin {
        
                            $iwr = Invoke-WebRequest -Uri $Url
                            $images = ($iwr).Images | select src

                 }

       Process {

                    $wc = New-Object System.Net.WebClient
                    $images | foreach { $wc.DownloadFile( $_.src, ("$OutputPath\"+[io.path]::GetFileName($_.src) ) ) }
       }

 End {

              Write-Host "Downloading all images from $Url to $OutputPath"

 }

}

Get-WebPageImages






No comments:

Post a Comment