1
0
mirror of https://github.com/RipMeApp/ripme.git synced 2025-08-21 05:01:30 +02:00

Updated How To Create A Ripper for HTML websites (markdown)

4_pr0n
2014-06-27 00:12:42 -07:00
parent ad0e35655b
commit 058b2fda4b

@@ -3,55 +3,75 @@ This guide explains how to rip from an unsupported website using RipMe.
If you like to learn by example, check out the simple [`ImgboxRipper.java`](https://github.com/4pr0n/ripme/blob/master/src/main/java/com/rarchives/ripme/ripper/rippers/ImgboxRipper.java). If you like to learn by example, check out the simple [`ImgboxRipper.java`](https://github.com/4pr0n/ripme/blob/master/src/main/java/com/rarchives/ripme/ripper/rippers/ImgboxRipper.java).
### Expectations ### Expectations
* Some knowledge of `git`
* Some knowledge of the Java programming language * Some knowledge of the Java programming language
* Some knowledge of the build tool [`Maven`](http://maven.apache.org/)
* This is for dependency resolution & so you don't have to download a ton of .jar files
* Some knowledge of the HTML DOM, CSS selectors, and the like. * Some knowledge of the HTML DOM, CSS selectors, and the like.
## Step 0: [Fork this repo](https://help.github.com/articles/fork-a-repo) ### Step 0: [Fork this repo](https://help.github.com/articles/fork-a-repo)
## Step 1: Create a new .java file ### Step 1: Create a new .java file
Create the file within [`/ src / main / java / com / rarchives / ripme / ripper / rippers`](https://github.com/4pr0n/ripme/tree/master/src/main/java/com/rarchives/ripme/ripper/rippers) Create the file within [`/ src / main / java / com / rarchives / ripme / ripper / rippers`](https://github.com/4pr0n/ripme/tree/master/src/main/java/com/rarchives/ripme/ripper/rippers)
File should follow the naming scheme `<Site>Ripper.java` File should follow the naming scheme `<Site>Ripper.java`
## Step 2: Extend the [`AbstractHTMLRipper` class](https://github.com/4pr0n/ripme/blob/master/src/main/java/com/rarchives/ripme/ripper/AbstractHTMLRipper.java) ### Step 2: Extend the [`AbstractHTMLRipper` class](https://github.com/4pr0n/ripme/blob/master/src/main/java/com/rarchives/ripme/ripper/AbstractHTMLRipper.java)
```java ```java
public class YoursiteRipper extends AbstractHTMLRipper { public class YoursiteRipper extends AbstractHTMLRipper {
``` ```
## Step 3: Understand the fields available ### Step 3: Understand the fields available
By extending `AbstractHTMLRipper`, you have access to the `this.url` object containing the URL to be ripped. By extending `AbstractHTMLRipper`, you have access to the `this.url` object containing the URL to be ripped.
## Step 4: Override the required methods ### Step 4: Constructors
The methods below are defined in `AbstractHTMLRipper` and should be overridden in your .java file. We need to let the superclass know what URL we're working with.
Change the constructor's class name to your ripper's class name.
```java
public YoursiteRipper(URL url) throws IOException {
super(url);
}
```
### Step 5: Override the required methods
The methods below are defined in `AbstractHTMLRipper` and must be overridden in your .java file.
--- ---
```java #### String getHost()
String getHost()
```
Returns: The **name** of the website (no need for `.com`). Returns: The **name** of the website (no need for `.com`).
This String is used in naming the save directory. This String is used in naming the save directory.
Example: `imgur` ```java
@Override
public String getHost() {
return "imgur";
}
```
--- ---
```java #### String getDomain()
String getDomain()
```
Returns: The **domain** of the website. Returns: The **domain** of the website.
This String is used in the `canRip()` method to determine if a URL can be ripped. This String is used in the `canRip()` method to determine if a URL can be ripped.
Example: `imgur.com` ```java
@Override
public String getDomain() {
return "imgur.com";
}
```
--- ---
```java #### String getGID(URL)
String getGID(URL)
``` Returns: A **unique identifier** for the album (*Gallery ID* or *GID*).
Returns: A unique identifier for the album (*Gallery ID* or *GID*).
Note: The URL to every album on the website should return a **different** GID. Note: The URL to every album on the website should return a **different** GID.
*This is because the save directory will be named in the scheme `HOST_GID`* *This is because the save directory will be named in the scheme `HOST_GID`*
@@ -59,35 +79,57 @@ Note: The URL to every album on the website should return a **different** GID.
Most rippers use `regex` to strip out the GID. Most rippers use `regex` to strip out the GID.
Example: `imgur.com/a/abc123` could return `abc123` Example: `imgur.com/a/abc123` could return `abc123`
Example: `somesite.com/gallery.php?id=4321` could return `4321`
```java
@Override
public String getGID(URL url) throws MalformedURLException {
Pattern p = Pattern.compile("^https?://imgur\\.com/a/([a-zA-Z0-9]+).*$");
Matcher m = p.matcher(url.toExternalForm());
if (m.matches()) {
// Return the text contained between () in the regex
return m.group(1);
}
throw new MalformedURLException("Expected imgur.com URL format: " +
"imgur.com/a/albumid - got " + url + " instead");
}
```
--- ---
```java #### Document getFirstPage()
Document getFirstPage()
```
Returns: A Jsoup `Document` object containing the contents of the first page. Returns: A **Jsoup `Document`** object containing the contents of the first page.
Tip: Use the [`Http` class](https://github.com/4pr0n/ripme/blob/master/src/main/java/com/rarchives/ripme/utils/Http.java) for easy methods of retrieving the page. Tip: Use the [`Http` class](https://github.com/4pr0n/ripme/blob/master/src/main/java/com/rarchives/ripme/utils/Http.java) for easy methods of retrieving the page.
**Most** rippers simply `return Http.url(this.url).get()`. **Most** rippers just need to get the page, and do so with:
```java
@Override
public Document getFirstPage() throws IOException {
// "url" is an instance field of the superclass
return Http.url(url).get();
}
```
This works for the majority of websites (most sites don't require cookies, referrers, etc). This works for the majority of websites (most sites don't require cookies, referrers, etc).
```java ---
Document getNextPage(Document) // Optional!
``` #### Document getNextPage(Document) // Optional!
Input: Jsoup `Document` retrieved in the `getFirstPage()` method. Input: Jsoup `Document` retrieved in the `getFirstPage()` method.
Returns: The *next* page to retrieve. Returns: The **next page** to retrieve images from.
Throws: `IOException` if no next page can be retrieved. Throws: `IOException` if no next page can be retrieved.
By default, this method throws an `IOException` within `AbstractHTMLRipper`, meaning it assumes there is no **next page**. Override this method & retrieve the next page if you need to rip multiple pages. See [`ImagebamRipper.java`](https://github.com/4pr0n/ripme/blob/master/src/main/java/com/rarchives/ripme/ripper/rippers/ImagebamRipper.java#L70) for an example of how this is used. **Note**: By default, this method throws an `IOException` within `AbstractHTMLRipper`, meaning it assumes there is no **next page**. If you need to rip multiple pages, override this method & retrieve the next page. See [`ImagebamRipper.java`](https://github.com/4pr0n/ripme/blob/master/src/main/java/com/rarchives/ripme/ripper/rippers/ImagebamRipper.java#L70) for an example of how this is used.
---
#### List<String> getURLsFromPage(Document)
```java
List<String> getURLsFromPage(Document)
```
Input: Jsoup `Document` retrieved in the `getFirstPage()` method (and optionally the `getNextPage()` method). Input: Jsoup `Document` retrieved in the `getFirstPage()` method (and optionally the `getNextPage()` method).
Returns: List of URLs to be downloaded or retrieved. Returns: **List of URLs to be downloaded** or retrieved.
This is where the URLs are *extracted* from the page Document. This is where the URLs are *extracted* from the page Document.
Some rippers return a list of subpages to be ripped in separate threads (e.g. [`ImagevenueRipper.java](https://github.com/4pr0n/ripme/blob/master/src/main/java/com/rarchives/ripme/ripper/rippers/ImagevenueRipper.java#L67) Some rippers return a list of subpages to be ripped in separate threads (e.g. [`ImagevenueRipper.java](https://github.com/4pr0n/ripme/blob/master/src/main/java/com/rarchives/ripme/ripper/rippers/ImagevenueRipper.java#L67)
@@ -95,11 +137,14 @@ Some rippers return a list of subpages to be ripped in separate threads (e.g. [`
This is when CSS-Selectors come in handy. Say you wanted to grab every image that appears on the page: This is when CSS-Selectors come in handy. Say you wanted to grab every image that appears on the page:
```java ```java
@Override
public List<String> getURLsFromPage(Document doc) {
List<String> result = new ArrayList<String>(); List<String> result = new ArrayList<String>();
for (Element el : document.select("img")) { for (Element el : doc.select("img")) {
el.add(el.attr("src")); el.add(el.attr("src"));
} }
return result return result
}
``` ```
This would return the source to all images on the page (although they will likely be thumbnails). This would return the source to all images on the page (although they will likely be thumbnails).
@@ -108,9 +153,7 @@ The URLs returned are passed into the next method...
--- ---
```java #### void downloadURL(URL url, int index)
void downloadURL(URL url, int index)
```
Input: `URL`: One of the URLs returned by `getURLsFromPage()` Input: `URL`: One of the URLs returned by `getURLsFromPage()`
Input: `index`: The *number* for this URL (whether it's the 1st image, 2nd image, etc). Input: `index`: The *number* for this URL (whether it's the 1st image, 2nd image, etc).
@@ -125,3 +168,14 @@ public void downloadURL(URL url, int index) {
``` ```
The above will download the URL to the appropriate save directory, guessing the filename to save based on the `url` and the `index`. The above will download the URL to the appropriate save directory, guessing the filename to save based on the `url` and the `index`.
### Step 6: Test!
RipMe **automatically detects new rippers** without any other code changes required.
1. Execute the ripper.
2. Paste in a URL to the site you're trying to rip.
3. Click `Rip`
4. Look at the output for errors, warnings, Exceptions, etc.
5. Fix any bugs.
6. Repeat.