1
0
mirror of https://github.com/halaxa/json-machine.git synced 2025-01-17 21:18:23 +01:00
json-machine/README.md

371 lines
15 KiB
Markdown
Raw Normal View History

2018-11-29 14:20:06 +01:00
![](img/logo.png)
2018-11-29 11:41:26 +01:00
# JSON Machine
2020-12-03 16:36:14 +01:00
2020-12-04 14:56:16 +01:00
Very easy to use and memory efficient drop-in replacement for inefficient iteration of big JSON files or streams
2020-12-04 15:14:06 +01:00
for PHP 5.6+. See [TL;DR](#tl-dr). No dependencies in production except optional `ext-json`.
2020-12-03 16:36:14 +01:00
2019-04-10 19:44:18 +02:00
[![Build Status](https://travis-ci.com/halaxa/json-machine.svg?branch=master)](https://travis-ci.com/halaxa/json-machine)
2020-12-03 17:16:06 +01:00
[![Latest Stable Version](https://poser.pugx.org/halaxa/json-machine/v/stable?v0.4.1)](https://packagist.org/packages/halaxa/json-machine)
2020-04-06 12:32:18 +02:00
[![Monthly Downloads](https://poser.pugx.org/halaxa/json-machine/d/monthly)](https://packagist.org/packages/halaxa/json-machine)
2020-04-16 16:12:26 +02:00
2019-04-10 19:44:18 +02:00
---
2020-12-03 11:58:33 +01:00
2020-12-04 15:14:06 +01:00
* [TL;DR](#tl-dr)
* [Introduction](#introduction)
* [Parsing JSON documents](#parsing-json-documents)
+ [Simple document](#simple-document)
2020-12-04 14:49:13 +01:00
* [Parsing streaming responses from a JSON API](#parsing-json-stream-api-responses)
+ [GuzzleHttp](#guzzlehttp)
+ [Symfony HttpClient](#symfony-httpclient)
2020-12-04 14:49:13 +01:00
* [Tracking the progress](#tracking-parsing-progress)
* [Parsing a subtree](#parsing-a-subtree)
2020-12-04 14:49:13 +01:00
+ [What is Json Pointer?](#json-pointer)
* [Custom decoders](#custom-decoder)
+ [Available decoders](#available-decoders)
2020-12-06 19:43:05 +01:00
* [Error handling](#error-handling)
2020-12-04 14:49:13 +01:00
* [Parser efficiency](#on-parser-efficiency)
+ [Streams / files](#streams-files)
+ [In-memory JSON strings](#in-memory-json-strings)
* [Troubleshooting](#troubleshooting)
2020-12-06 19:43:05 +01:00
+ ["I'm still getting Allowed memory size ... exhausted"](#step1)
+ ["That didn't help"](#step2)
+ ["I am still out of luck"](#step3)
* [Running tests](#running-tests)
+ [Running tests on all supported PHP platforms](#running-tests-on-all-supported-php-platforms)
* [Installation](#installation)
* [Support](#support)
* [License](#license)
2020-12-03 13:04:21 +01:00
---
2020-12-03 13:04:21 +01:00
<a name="tl-dr"></a>
2020-12-04 15:14:06 +01:00
## TL;DR
2018-12-02 20:29:47 +01:00
```diff
2018-12-01 20:09:38 +01:00
<?php
2020-11-26 20:50:58 +01:00
use \JsonMachine\JsonMachine;
2019-12-20 17:59:55 +01:00
// this often causes Allowed Memory Size Exhausted
- $users = json_decode(file_get_contents('500MB-users.json'));
2020-11-26 20:50:58 +01:00
2020-04-17 15:13:22 +02:00
// this usually takes few kB of memory no matter the file size
2020-11-26 20:50:58 +01:00
+ $users = JsonMachine::fromFile('500MB-users.json');
2018-12-01 20:09:38 +01:00
foreach ($users as $id => $user) {
2018-12-02 20:30:39 +01:00
// just process $user as usual
2018-12-01 20:09:38 +01:00
}
```
2019-12-15 15:05:19 +01:00
Random access like `$users[42]` or counting results like `count($users)` **is not possible** by design.
2019-12-13 18:23:35 +01:00
Use above-mentioned `foreach` and find the item or count the collection there.
2020-04-16 16:01:18 +02:00
Requires `ext-json` if used out of the box. See [custom decoder](#custom-decoder).
2020-11-11 12:39:29 +01:00
2020-12-03 13:04:21 +01:00
<a name="introduction"></a>
2019-12-13 18:23:35 +01:00
## Introduction
JSON Machine is an efficient, easy-to-use and fast JSON stream parser based on generators
developed for unpredictably long JSON streams or documents. Main features are:
- Constant memory footprint for unpredictably large JSON documents.
- Ease of use. Just iterate JSON of any size with `foreach`. No events and callbacks.
- Efficient iteration on any subtree of the document, specified by [Json Pointer](#json-pointer)
- Speed. Performace critical code contains no unnecessary function calls, no regular expressions
2020-04-16 16:01:18 +02:00
and uses native `json_decode` to decode JSON document chunks by default. See [custom decoder](#custom-decoder).
2020-11-09 14:28:55 +01:00
- Thoroughly tested. More than 100 tests and 700 assertions.
2020-12-03 13:04:21 +01:00
<a name="parsing-json-documents"></a>
2018-11-29 19:54:39 +01:00
## Parsing JSON documents
2020-12-03 13:04:21 +01:00
<a name="simple-document"></a>
2020-12-04 14:49:13 +01:00
### A simple document
2020-11-26 20:50:58 +01:00
Let's say that `fruits.json` contains this really big JSON document:
```json
2020-11-26 20:50:58 +01:00
// fruits.json
{
"apple": {
"color": "red"
},
"pear": {
"color": "yellow"
}
}
```
It can be parsed this way:
```php
<?php
2020-11-26 20:50:58 +01:00
use \JsonMachine\JsonMachine;
2018-11-10 20:25:38 +01:00
2020-11-26 20:50:58 +01:00
$fruits = JsonMachine::fromFile('fruits.json');
foreach ($fruits as $name => $data) {
// 1st iteration: $name === "apple" and $data === ["color" => "red"]
// 2nd iteration: $name === "pear" and $data === ["color" => "yellow"]
}
```
2019-12-21 22:34:40 +01:00
Parsing a json array instead of a json object follows the same logic.
The key in a foreach will be a numeric index of an item.
2020-12-03 13:04:21 +01:00
If you prefer JSON Machine to return objects instead of arrays, use `new ExtJsonDecoder()` as decoder
2020-04-17 13:08:40 +02:00
which by default decodes objects - same as `json_decode`
2019-12-20 17:59:55 +01:00
```php
2020-04-17 13:08:40 +02:00
<?php
use JsonMachine\JsonDecoder\ExtJsonDecoder;
use JsonMachine\JsonMachine;
2020-11-26 20:50:58 +01:00
$objects = JsonMachine::fromFile('path/to.json', '', new ExtJsonDecoder);
2019-12-20 17:59:55 +01:00
```
2020-12-03 11:58:33 +01:00
2020-12-03 13:04:21 +01:00
<a name="parsing-json-stream-api-responses"></a>
2020-12-04 14:49:13 +01:00
## Parsing streaming responses from a JSON API
2020-12-03 13:04:21 +01:00
A stream API response or any other JSON stream is parsed exactly the same way as file is. The only difference
2020-12-01 20:50:51 +01:00
is, you use `JsonMachine::fromStream($streamResource)` for it, where `$streamResource` is the stream
resource with the JSON document. The rest is the same as with parsing files. Here are some examples of
popular http clients which support streaming responses:
2020-12-03 13:04:21 +01:00
<a name="guzzlehttp"></a>
2020-12-01 20:50:51 +01:00
### GuzzleHttp
Guzzle uses its own streams, but they can be converted back to PHP streams by calling
`\GuzzleHttp\Psr7\StreamWrapper::getResource()`. Pass the result of this function to
2020-12-03 13:04:21 +01:00
`JsonMachine::fromStream` function, and you're set up. See working
2020-12-01 20:50:51 +01:00
[GuzzleHttp example](src/examples/guzzleHttp.php).
2020-12-03 13:04:21 +01:00
<a name="symfony-httpclient"></a>
2020-12-01 20:50:51 +01:00
### Symfony HttpClient
A stream response of Symfony HttpClient works as iterator. And because JSON Machine is
based on iterators, the integration with Symfony HttpClient is very simple. See
[HttpClient example](src/examples/symfonyHttpClient.php).
2020-12-03 11:58:33 +01:00
2020-12-03 13:04:21 +01:00
<a name="tracking-parsing-progress"></a>
2020-12-04 14:49:13 +01:00
## Tracking the progress
2020-12-01 20:50:51 +01:00
Big documents may take a while to parse. Call `JsonMachine::getPosition()` in your `foreach` to get current
2020-12-04 14:49:13 +01:00
count of the processed bytes from the beginning. Percentage is then easy to calculate as `position / total * 100`.
To find out the total size of your document in bytes you may want to check:
2020-12-01 20:50:51 +01:00
- `strlen($document)` if you're parsing string
- `filesize($file)` if you're parsing a file
- `Content-Length` http header if you're parsing http stream response
- ... you get the point
```php
<?php
use JsonMachine\JsonMachine;
$fileSize = filesize('fruits.json');
$fruits = JsonMachine::fromFile('fruits.json');
foreach ($fruits as $name => $data) {
2020-12-01 20:58:22 +01:00
echo 'Progress: ' . intval($fruits->getPosition() / $fileSize * 100) . ' %';
2020-12-01 20:50:51 +01:00
}
2019-12-20 17:59:55 +01:00
```
2020-12-03 13:04:21 +01:00
<a name="parsing-a-subtree"></a>
2020-12-03 11:58:33 +01:00
## Parsing a subtree
2020-11-26 20:50:58 +01:00
If you want to iterate only `results` subtree in this `fruits.json`:
```json
2018-11-10 20:09:07 +01:00
// fruits.json
{
2020-11-26 20:50:58 +01:00
"results": {
"apple": {
"color": "red"
},
"pear": {
"color": "yellow"
}
}
}
```
2020-12-04 14:49:13 +01:00
use Json Pointer `"/results"` as the second argument:
```php
<?php
2020-11-26 20:50:58 +01:00
use \JsonMachine\JsonMachine;
2020-12-04 14:49:13 +01:00
$fruits = JsonMachine::fromFile("fruits.json", "/results");
2020-11-26 20:50:58 +01:00
foreach ($fruits as $name => $data) {
2018-11-10 20:09:07 +01:00
// The same as above, which means:
// 1st iteration: $name === "apple" and $data === ["color" => "red"]
// 2nd iteration: $name === "pear" and $data === ["color" => "yellow"]
}
```
2018-11-10 20:09:07 +01:00
2018-11-29 11:41:26 +01:00
> Note:
>
2020-11-26 23:41:39 +01:00
> Value of `results` is not loaded into memory at once, but only one item in
> `results` at a time. It is always one item in memory at a time at the level/subtree
2018-11-29 19:54:39 +01:00
> you are currently iterating. Thus the memory consumption is constant.
2018-12-01 20:09:38 +01:00
<a name="json-pointer"></a>
2020-12-04 14:49:13 +01:00
### What is Json Pointer?
2018-12-01 20:09:38 +01:00
It's a way of addressing one item in JSON document. See the [Json Pointer RFC 6901](https://tools.ietf.org/html/rfc6901).
It's very handy, because sometimes the JSON structure goes deeper, and you want to iterate a subtree,
not the main level. So you just specify the pointer to the JSON array or object you want to iterate and off you go.
When the parser hits the collection you specified, iteration begins. It is always a second parameter in all
2020-12-03 13:04:21 +01:00
`JsonMachine::from*` functions. If you specify a pointer to a scalar value (which logically cannot be iterated)
or a non-existent position in the document, an exception is thrown.
2018-12-01 20:09:38 +01:00
Some examples:
| Json Pointer value | Will iterate through |
|--------------------|---------------------------------------------------------------------------------------------------|
2019-12-22 13:34:50 +01:00
| `""` (empty string - default) | `["this", "array"]` or `{"a": "this", "b": "object"}` will be iterated (main level) |
2019-12-20 17:59:55 +01:00
| `"/result/items"` | `{"result":{"items":["this","array","will","be","iterated"]}}` |
| `"/0/items"` | `[{"items":["this","array","will","be","iterated"]}]` (supports array indexes) |
| `"/"` (gotcha! - a slash followed by an empty string, see the [spec](https://tools.ietf.org/html/rfc6901#section-5)) | `{"":["this","array","will","be","iterated"]}` |
2018-12-01 20:09:38 +01:00
2020-12-03 11:58:33 +01:00
2020-04-16 16:01:18 +02:00
<a name="custom-decoder"></a>
2020-12-03 11:58:33 +01:00
## Custom decoders
2020-12-04 14:49:13 +01:00
As a third parameter of all the `JsonMachine::from*` functions is an optional instance of
2020-04-16 16:01:18 +02:00
`JsonMachine\JsonDecoder\Decoder`. If none specified, `ExtJsonDecoder` is used by
default. It requires `ext-json` PHP extension to be present, because it uses
2020-11-27 10:43:18 +01:00
`json_decode`. When `json_decode` doesn't do what you want, implement `JsonMachine\JsonDecoder\Decoder`
and make your own.
2020-04-16 16:01:18 +02:00
2020-12-03 13:04:21 +01:00
<a name="available-decoders"></a>
2020-04-16 16:01:18 +02:00
### Available decoders
2020-12-05 18:10:06 +01:00
- **`ExtJsonDecoder`** - **Default.** Uses `json_decode` to decode keys and values.
2020-04-17 13:08:40 +02:00
Constructor takes the same params as `json_decode`.
2020-12-05 18:10:06 +01:00
- **`PassThruDecoder`** - uses `json_decode` to decode keys but returns values as pure JSON strings.
2020-12-04 14:27:51 +01:00
Useful when you want to parse a JSON chunk with something else directly in the foreach
and don't want to implement `JsonMachine\JsonDecoder\Decoder`.
2020-04-17 13:08:40 +02:00
Constructor takes the same params as `json_decode`.
Example:
```php
<?php
use JsonMachine\JsonDecoder\PassThruDecoder;
use JsonMachine\JsonMachine;
2020-11-25 01:52:07 +02:00
$jsonMachine = JsonMachine::fromFile('path/to.json', '', new PassThruDecoder);
2020-04-17 13:08:40 +02:00
```
2018-11-29 19:54:39 +01:00
2020-12-06 19:43:05 +01:00
<a name="error-handling"></a>
## Error handling
Since 0.4.0 every exception extends `JsonMachineException`, so you can catch that to filter any error from JSON Machine library.
When any part of the JSON stream is malformed, `SyntaxError` exception is thrown. Better solution is on the way.
2020-12-03 13:04:21 +01:00
<a name="on-parser-efficiency"></a>
2020-12-04 14:49:13 +01:00
## Parser efficiency
2019-12-20 17:59:55 +01:00
2020-12-03 13:04:21 +01:00
<a name="streams-files"></a>
2020-12-03 11:58:33 +01:00
### Streams / files
2020-12-03 13:04:21 +01:00
JSON Machine reads a stream (or a file) 1 JSON item at a time and generates corresponding 1 PHP array at a time.
2018-11-29 19:54:39 +01:00
This is the most efficient way, because if you had say 10,000 users in JSON file and wanted to parse it using
`json_decode(file_get_contents('big.json'))`, you'd have the whole string in memory as well as all the 10,000
2020-11-27 10:43:18 +01:00
PHP structures. Following table shows the difference:
2018-11-29 19:54:39 +01:00
2020-11-27 10:43:18 +01:00
| | String items in memory at a time | Decoded PHP items in memory at a time | Total |
|------------------------|---------------------------------:|--------------------------------------:|------:|
| `json_decode()` | 10000 | 10000 | 20000 |
| `JsonMachine::from*()` | 1 | 1 | 2 |
2018-11-29 19:54:39 +01:00
2020-11-27 10:43:18 +01:00
This means, that `JsonMachine` is constantly efficient for any size of processed JSON. 100 GB no problem.
2018-11-29 19:54:39 +01:00
2020-12-03 13:04:21 +01:00
<a name="in-memory-json-strings"></a>
2020-12-03 11:58:33 +01:00
### In-memory JSON strings
2020-12-04 14:49:13 +01:00
There is also a method `JsonMachine::fromString()`. If you are
2018-11-29 19:54:39 +01:00
forced to parse a big string and the stream is not available, JSON Machine may be better than `json_decode`.
2020-12-04 14:49:13 +01:00
The reason is that unlike `json_decode`, JSON Machine still traverses the JSON string one item at a time and doesn't
load all resulting PHP structures into memory at once.
2018-11-29 19:54:39 +01:00
Let's continue with the example with 10,000 users. This time they are all in string in memory.
When decoding that string with `json_decode`, 10,000 arrays (objects) is created in memory and then the result
2020-12-04 14:49:13 +01:00
is returned. JSON Machine on the other hand creates single structure for each found item in the string and yields it back
to you. When you process this item and iterate to the next one, another single structure is created. This is the same
2018-11-29 19:54:39 +01:00
behaviour as with streams/files. Following table puts the concept into perspective:
2019-12-20 17:59:55 +01:00
| | String items in memory at a time | Decoded PHP items in memory at a time | Total |
|-----------------------------|---------------------------------:|--------------------------------------:|------:|
| `json_decode()` | 10000 | 10000 | 20000 |
| `JsonMachine::fromString()` | 10000 | 1 | 10001 |
2018-11-29 19:54:39 +01:00
2020-12-04 14:49:13 +01:00
The reality is even brighter. `JsonMachine::fromString` consumes about **5x less memory** than `json_decode`. The reason is
that a PHP structure takes much more memory than its JSON string counterpart.
2020-12-03 11:58:33 +01:00
<a name="troubleshooting"></a>
## Troubleshooting
<a name="step1"></a>
### "I'm still getting Allowed memory size ... exhausted"
One of the reasons may be that the items you want to iterate over are in some subkey such as `"results"`
but you forgot to specify a json pointer. See [Parsing a subtree](#parsing-a-subtree).
<a name="step2"></a>
### "That didn't help"
2020-12-03 21:15:47 +01:00
The other reason may be, that one of the items you iterate is itself so huge it cannot be decoded at once.
2020-12-04 14:49:13 +01:00
For example, you iterate over users and one of them has thousands of "friend" objects in it.
Use `PassThruDecoder` which does not decode an item, get the json string of the user
and parse it iteratively yourself using `JsonMachine::fromString()`.
```php
<?php
use JsonMachine\JsonMachine;
use JsonMachine\JsonDecoder\PassThruDecoder;
$users = JsonMachine::fromFile('users.json', '', new PassThruDecoder);
foreach ($users as $user) {
foreach (JsonMachine::fromString($user, "/friends") as $friend) {
// process friends one by one
}
}
```
<a name="step3"></a>
### "I am still out of luck"
It probably means that the JSON string `$user` itself or one of the friends are too big and do not fit in memory.
However, you can try this approach recursively. Parse `"/friends"` with `PassThruDecoder` getting one `$friend`
json string at a time and then parse that using `JsonMachine::fromString()`... If even that does not help,
there's probably no solution yet via JSON Machine. A feature is planned which will enable you to iterate
any structure fully recursively and strings will be served as streams.
2020-12-03 13:04:21 +01:00
<a name="running-tests"></a>
## Running tests
```bash
tests/run.sh
```
2020-12-06 19:43:05 +01:00
This uses php and composer installation already present in your OS installation.
2020-12-03 13:04:21 +01:00
<a name="running-tests-on-all-supported-php-platforms"></a>
### Running tests on all supported PHP platforms
2019-03-22 10:50:16 +01:00
[Install docker](https://docs.docker.com/install/) to your machine and run
```bash
tests/docker-run-all-platforms.sh
```
This needs no php nor composer installation on your machine. Only Docker.
2018-11-29 14:20:06 +01:00
2020-12-03 11:58:33 +01:00
2020-12-03 13:04:21 +01:00
<a name="installation"></a>
2018-12-01 20:09:38 +01:00
## Installation
```bash
composer require halaxa/json-machine
```
or clone or download this repository (not recommended).
2020-12-03 11:58:33 +01:00
2020-12-03 13:04:21 +01:00
<a name="support"></a>
## Support
Do you like this library? Star it, share it, show it :)
Issues and pull requests are very welcome.
2020-04-04 23:07:25 +02:00
2020-12-03 13:04:21 +01:00
<a name="license"></a>
2018-11-29 14:20:06 +01:00
## License
2018-12-01 20:09:38 +01:00
Apache 2.0
2018-11-29 19:54:39 +01:00
Cogwheel element: Icons made by [TutsPlus](https://www.flaticon.com/authors/tutsplus)
2018-11-29 14:20:06 +01:00
from [www.flaticon.com](https://www.flaticon.com/)
is licensed by [CC 3.0 BY](http://creativecommons.org/licenses/by/3.0/)
2020-04-06 18:18:49 +02:00
2020-12-03 13:04:21 +01:00
<i><a href='http://ecotrust-canada.github.io/markdown-toc/'>Table of contents generated with markdown-toc</a></i>