Updated How To Create A Ripper for HTML websites (markdown)

2025-08-21 05:01:30 +02:00 · 2014-06-27 00:12:42 -07:00
parent ad0e35655b
commit 058b2fda4b
1 changed files with 101 additions and 47 deletions
--- a/How-To-Create-A-Ripper-for-HTML-websites.md
+++ b/How-To-Create-A-Ripper-for-HTML-websites.md
@@ -3,91 +3,133 @@ This guide explains how to rip from an unsupported website using RipMe.
 If you like to learn by example, check out the simple [`ImgboxRipper.java`](https://github.com/4pr0n/ripme/blob/master/src/main/java/com/rarchives/ripme/ripper/rippers/ImgboxRipper.java).

 ### Expectations
-* Some knowledge of `git`
 * Some knowledge of the Java programming language
+* Some knowledge of the build tool [`Maven`](http://maven.apache.org/)
+ * This is for dependency resolution & so you don't have to download a ton of .jar files
 * Some knowledge of the HTML DOM, CSS selectors, and the like.

-## Step 0: [Fork this repo](https://help.github.com/articles/fork-a-repo)
+### Step 0: [Fork this repo](https://help.github.com/articles/fork-a-repo)

-## Step 1: Create a new .java file
+### Step 1: Create a new .java file
 Create the file within [`/ src / main / java / com / rarchives / ripme / ripper / rippers`](https://github.com/4pr0n/ripme/tree/master/src/main/java/com/rarchives/ripme/ripper/rippers)

 File should follow the naming scheme `<Site>Ripper.java`

-## Step 2: Extend the [`AbstractHTMLRipper` class](https://github.com/4pr0n/ripme/blob/master/src/main/java/com/rarchives/ripme/ripper/AbstractHTMLRipper.java)
+### Step 2: Extend the [`AbstractHTMLRipper` class](https://github.com/4pr0n/ripme/blob/master/src/main/java/com/rarchives/ripme/ripper/AbstractHTMLRipper.java)

 ```java
 public class YoursiteRipper extends AbstractHTMLRipper {
 ```

-## Step 3: Understand the fields available
+### Step 3: Understand the fields available

 By extending `AbstractHTMLRipper`, you have access to the `this.url` object containing the URL to be ripped.

-## Step 4: Override the required methods
+### Step 4: Constructors

-The methods below are defined in `AbstractHTMLRipper` and should be overridden in your .java file.
+We need to let the superclass know what URL we're working with.
+
+Change the constructor's class name to your ripper's class name.
+
+```java
+    public YoursiteRipper(URL url) throws IOException {
+        super(url);
+    }
+```
+
+### Step 5: Override the required methods
+
+The methods below are defined in `AbstractHTMLRipper` and must be overridden in your .java file.

 ---

-```java
-String getHost()
-```
+#### String getHost()
+
 Returns: The **name** of the website (no need for `.com`).  
 This String is used in naming the save directory.  
-Example: `imgur`
+```java
+    @Override
+    public String getHost() {
+        return "imgur";
+    }
+```

 ---

-```java
-String getDomain()
-```
+#### String getDomain()
+
 Returns: The **domain** of the website.  
 This String is used in the `canRip()` method to determine if a URL can be ripped.  
-Example: `imgur.com`
+```java
+    @Override
+    public String getDomain() {
+        return "imgur.com";
+    }
+```

 ---

-```java
-String getGID(URL)
-```
-Returns: A unique identifier for the album (*Gallery ID* or *GID*).
+#### String getGID(URL)
+
+Returns: A **unique identifier** for the album (*Gallery ID* or *GID*).

 Note: The URL to every album on the website should return a **different** GID.  
 *This is because the save directory will be named in the scheme `HOST_GID`*

 Most rippers use `regex` to strip out the GID.

-Example: `imgur.com/a/abc123` could return `abc123`  
-Example: `somesite.com/gallery.php?id=4321` could return `4321`
+Example: `imgur.com/a/abc123` could return `abc123`
+
+```java
+    @Override
+    public String getGID(URL url) throws MalformedURLException {
+        Pattern p = Pattern.compile("^https?://imgur\\.com/a/([a-zA-Z0-9]+).*$");
+        Matcher m = p.matcher(url.toExternalForm());
+        if (m.matches()) {
+            // Return the text contained between () in the regex
+            return m.group(1);
+        }
+        throw new MalformedURLException("Expected imgur.com URL format: " +
+                        "imgur.com/a/albumid - got " + url + " instead");
+    }
+```

 ---

-```java
-Document getFirstPage()
-```
+#### Document getFirstPage()

-Returns: A Jsoup `Document` object containing the contents of the first page.  
+Returns: A **Jsoup `Document`** object containing the contents of the first page.  

 Tip: Use the [`Http` class](https://github.com/4pr0n/ripme/blob/master/src/main/java/com/rarchives/ripme/utils/Http.java) for easy methods of retrieving the page.

-**Most** rippers simply `return Http.url(this.url).get()`.  
+**Most** rippers just need to get the page, and do so with:
+
+```java
+    @Override
+    public Document getFirstPage() throws IOException {
+        // "url" is an instance field of the superclass
+        return Http.url(url).get();
+    }
+```
+
 This works for the majority of websites (most sites don't require cookies, referrers, etc).

-```java
-Document getNextPage(Document) // Optional!
-```
+---
+
+#### Document getNextPage(Document) // Optional!
+
 Input: Jsoup `Document` retrieved in the `getFirstPage()` method.  
-Returns: The *next* page to retrieve.  
+Returns: The **next page** to retrieve images from.  
 Throws: `IOException` if no next page can be retrieved.

-By default, this method throws an `IOException` within `AbstractHTMLRipper`, meaning it assumes there is no **next page**. Override this method & retrieve the next page if you need to rip multiple pages. See [`ImagebamRipper.java`](https://github.com/4pr0n/ripme/blob/master/src/main/java/com/rarchives/ripme/ripper/rippers/ImagebamRipper.java#L70) for an example of how this is used.
+**Note**: By default, this method throws an `IOException` within `AbstractHTMLRipper`, meaning it assumes there is no **next page**. If you need to rip multiple pages, override this method & retrieve the next page. See [`ImagebamRipper.java`](https://github.com/4pr0n/ripme/blob/master/src/main/java/com/rarchives/ripme/ripper/rippers/ImagebamRipper.java#L70) for an example of how this is used.
+
+---
+
+#### List<String> getURLsFromPage(Document)

-```java
-List<String> getURLsFromPage(Document)
-```
 Input: Jsoup `Document` retrieved in the `getFirstPage()` method (and optionally the `getNextPage()` method).  
-Returns: List of URLs to be downloaded or retrieved.
+Returns: **List of URLs to be downloaded** or retrieved.

 This is where the URLs are *extracted* from the page Document.  
 Some rippers return a list of subpages to be ripped in separate threads (e.g. [`ImagevenueRipper.java](https://github.com/4pr0n/ripme/blob/master/src/main/java/com/rarchives/ripme/ripper/rippers/ImagevenueRipper.java#L67)
@@ -95,11 +137,14 @@ Some rippers return a list of subpages to be ripped in separate threads (e.g. [`
 This is when CSS-Selectors come in handy. Say you wanted to grab every image that appears on the page:

 ```java
-List<String> result = new ArrayList<String>();
-for (Element el : document.select("img")) {
-  el.add(el.attr("src"));
-}
-return result
+    @Override
+    public List<String> getURLsFromPage(Document doc) {
+        List<String> result = new ArrayList<String>();
+        for (Element el : doc.select("img")) {
+            el.add(el.attr("src"));
+        }
+        return result
+    }
 ```

 This would return the source to all images on the page (although they will likely be thumbnails).
@@ -108,9 +153,7 @@ The URLs returned are passed into the next method...

 ---

-```java
-void downloadURL(URL url, int index)
-```
+#### void downloadURL(URL url, int index)

 Input: `URL`: One of the URLs returned by `getURLsFromPage()`  
 Input: `index`: The *number* for this URL (whether it's the 1st image, 2nd image, etc).
@@ -118,10 +161,21 @@ Input: `index`: The *number* for this URL (whether it's the 1st image, 2nd image
 This is where your ripper *downloads* the image/file.  
 Most rippers simply use the `AlbumRipper`'s method:
 ```java
-@Override
-public void downloadURL(URL url, int index) {
-  addURLToDownload(url, getPrefix(index));
-}
+    @Override
+    public void downloadURL(URL url, int index) {
+        addURLToDownload(url, getPrefix(index));
+    }
 ```

 The above will download the URL to the appropriate save directory, guessing the filename to save based on the `url` and the `index`.
+
+### Step 6: Test!
+
+RipMe **automatically detects new rippers** without any other code changes required.
+
+1. Execute the ripper.  
+2. Paste in a URL to the site you're trying to rip.
+3. Click `Rip`
+4. Look at the output for errors, warnings, Exceptions, etc.
+5. Fix any bugs.
+6. Repeat.