1
0
mirror of https://github.com/halaxa/json-machine.git synced 2025-01-17 21:18:23 +01:00
json-machine/README.md

597 lines
23 KiB
Markdown
Raw Normal View History

2021-12-24 17:28:44 +01:00
<img align="right" src="img/github.png" />
2021-12-24 17:21:57 +01:00
2022-01-06 13:46:33 +01:00
(README in sync with the code)
2020-12-03 16:36:14 +01:00
2020-12-04 14:56:16 +01:00
Very easy to use and memory efficient drop-in replacement for inefficient iteration of big JSON files or streams
2021-12-23 13:21:45 +01:00
for PHP >=7.0. See [TL;DR](#tl-dr). No dependencies in production except optional `ext-json`.
2020-12-03 16:36:14 +01:00
2022-01-24 14:37:31 +01:00
[![Build Status](https://github.com/halaxa/json-machine/actions/workflows/makefile.yml/badge.svg)](https://github.com/halaxa/json-machine/actions)
2022-01-28 18:23:39 +01:00
[![codecov](https://img.shields.io/codecov/c/gh/halaxa/json-machine?label=phpunit%20%40covers)](https://codecov.io/gh/halaxa/json-machine)
[![Latest Stable Version](https://img.shields.io/github/v/release/halaxa/json-machine?color=blueviolet&include_prereleases&logoColor=white)](https://packagist.org/packages/halaxa/json-machine)
[![Monthly Downloads](https://img.shields.io/packagist/dt/halaxa/json-machine?color=%23f28d1a)](https://packagist.org/packages/halaxa/json-machine)
2020-04-16 16:12:26 +02:00
2019-04-10 19:44:18 +02:00
---
2020-12-03 11:58:33 +01:00
2020-12-04 15:14:06 +01:00
* [TL;DR](#tl-dr)
* [Introduction](#introduction)
* [Parsing JSON documents](#parsing-json-documents)
2021-12-23 18:40:03 +01:00
+ [Parsing a document](#simple-document)
2020-12-23 18:56:07 +01:00
+ [Parsing a subtree](#parsing-a-subtree)
2021-05-06 13:45:01 +02:00
+ [Parsing nested values in arrays](#parsing-nested-values)
2021-12-23 18:40:03 +01:00
+ [Parsing a single scalar value](#getting-scalar-values)
+ [Parsing multiple subtrees](#parsing-multiple-subtrees)
2021-12-23 18:40:03 +01:00
+ [What is JSON Pointer anyway?](#json-pointer)
* [Options](#options)
2020-12-04 14:49:13 +01:00
* [Parsing streaming responses from a JSON API](#parsing-json-stream-api-responses)
+ [GuzzleHttp](#guzzlehttp)
+ [Symfony HttpClient](#symfony-httpclient)
2020-12-04 14:49:13 +01:00
* [Tracking the progress](#tracking-parsing-progress)
2020-12-08 12:35:27 +01:00
* [Decoders](#decoders)
+ [Available decoders](#available-decoders)
2020-12-06 19:43:05 +01:00
* [Error handling](#error-handling)
2020-12-08 16:37:36 +01:00
+ [Catching malformed items](#malformed-items)
2020-12-04 14:49:13 +01:00
* [Parser efficiency](#on-parser-efficiency)
+ [Streams / files](#streams-files)
+ [In-memory JSON strings](#in-memory-json-strings)
* [Troubleshooting](#troubleshooting)
2020-12-06 19:43:05 +01:00
+ ["I'm still getting Allowed memory size ... exhausted"](#step1)
+ ["That didn't help"](#step2)
+ ["I am still out of luck"](#step3)
2021-11-23 12:50:17 +01:00
* [Installation](#installation)
2021-11-23 12:42:13 +01:00
* [Development](#development)
+ [Non containerized](#non-containerized)
+ [Containerized](#containerized)
* [Support](#support)
* [License](#license)
2020-12-03 13:04:21 +01:00
---
2020-12-03 13:04:21 +01:00
<a name="tl-dr"></a>
2020-12-04 15:14:06 +01:00
## TL;DR
2018-12-02 20:29:47 +01:00
```diff
2018-12-01 20:09:38 +01:00
<?php
2021-12-21 16:13:39 +01:00
use \JsonMachine\Items;
2020-11-26 20:50:58 +01:00
2019-12-20 17:59:55 +01:00
// this often causes Allowed Memory Size Exhausted
- $users = json_decode(file_get_contents('500MB-users.json'));
2020-11-26 20:50:58 +01:00
2020-04-17 15:13:22 +02:00
// this usually takes few kB of memory no matter the file size
2021-12-21 16:13:39 +01:00
+ $users = Items::fromFile('500MB-users.json');
2018-12-01 20:09:38 +01:00
foreach ($users as $id => $user) {
2018-12-02 20:30:39 +01:00
// just process $user as usual
var_dump($user->name);
2018-12-01 20:09:38 +01:00
}
```
2022-01-15 18:27:52 +01:00
Random access like `$users[42]` is not yet possible.
Use above-mentioned `foreach` and find the item or use [JSON Pointer](#parsing-a-subtree).
2022-01-17 11:50:33 +01:00
Count the items via [`iterator_count($users)`](https://www.php.net/manual/en/function.iterator-count.php).
2022-01-15 18:27:52 +01:00
Remember it will still have to internally iterate the whole thing to get the count and thus will take about the same time.
2019-12-13 18:23:35 +01:00
2020-12-08 12:35:27 +01:00
Requires `ext-json` if used out of the box. See [Decoders](#decoders).
2020-04-16 16:01:18 +02:00
2021-12-22 21:06:37 +01:00
Follow [CHANGELOG](CHANGELOG.md).
2020-11-11 12:39:29 +01:00
2020-12-03 13:04:21 +01:00
<a name="introduction"></a>
2019-12-13 18:23:35 +01:00
## Introduction
2020-12-08 12:35:27 +01:00
JSON Machine is an efficient, easy-to-use and fast JSON stream/pull/incremental/lazy (whatever you name it) parser
based on generators developed for unpredictably long JSON streams or documents. Main features are:
2019-12-13 18:23:35 +01:00
- Constant memory footprint for unpredictably large JSON documents.
- Ease of use. Just iterate JSON of any size with `foreach`. No events and callbacks.
2021-12-23 18:40:03 +01:00
- Efficient iteration on any subtree of the document, specified by [JSON Pointer](#json-pointer)
2020-12-08 12:35:27 +01:00
- Speed. Performance critical code contains no unnecessary function calls, no regular expressions
and uses native `json_decode` to decode JSON document items by default. See [Decoders](#decoders).
2020-12-08 12:35:27 +01:00
- Parses not only streams but any iterable that produces JSON chunks.
2022-02-05 11:48:09 +01:00
- Thoroughly tested. More than 200 tests and 1000 assertions.
2020-12-03 13:04:21 +01:00
<a name="parsing-json-documents"></a>
2018-11-29 19:54:39 +01:00
## Parsing JSON documents
2020-12-03 13:04:21 +01:00
<a name="simple-document"></a>
2021-12-23 18:40:03 +01:00
### Parsing a document
Let's say that `fruits.json` contains this huge JSON document:
```json
2020-11-26 20:50:58 +01:00
// fruits.json
{
"apple": {
"color": "red"
},
"pear": {
"color": "yellow"
}
}
```
It can be parsed this way:
```php
<?php
2021-12-21 16:13:39 +01:00
use \JsonMachine\Items;
2018-11-10 20:25:38 +01:00
2021-12-21 16:13:39 +01:00
$fruits = Items::fromFile('fruits.json');
2020-11-26 20:50:58 +01:00
foreach ($fruits as $name => $data) {
// 1st iteration: $name === "apple" and $data->color === "red"
// 2nd iteration: $name === "pear" and $data->color === "yellow"
}
```
2019-12-21 22:34:40 +01:00
Parsing a json array instead of a json object follows the same logic.
The key in a foreach will be a numeric index of an item.
2021-12-21 15:58:49 +01:00
If you prefer JSON Machine to return arrays instead of objects, use `new ExtJsonDecoder(true)` as a decoder.
2019-12-20 17:59:55 +01:00
```php
2020-04-17 13:08:40 +02:00
<?php
use JsonMachine\JsonDecoder\ExtJsonDecoder;
2021-12-21 16:13:39 +01:00
use JsonMachine\Items;
2020-04-17 13:08:40 +02:00
$objects = Items::fromFile('path/to.json', ['decoder' => new ExtJsonDecoder(true)]);
2019-12-20 17:59:55 +01:00
```
2020-12-23 18:56:07 +01:00
<a name="parsing-a-subtree"></a>
### Parsing a subtree
If you want to iterate only `results` subtree in this `fruits.json`:
```json
// fruits.json
{
"results": {
"apple": {
"color": "red"
},
"pear": {
"color": "yellow"
}
}
}
```
2021-12-23 18:40:03 +01:00
use JSON Pointer `/results` as `pointer` option:
2020-12-23 18:56:07 +01:00
```php
<?php
2021-12-21 16:13:39 +01:00
use \JsonMachine\Items;
2020-12-23 18:56:07 +01:00
$fruits = Items::fromFile('fruits.json', ['pointer' => '/results']);
2020-12-23 18:56:07 +01:00
foreach ($fruits as $name => $data) {
// The same as above, which means:
// 1st iteration: $name === "apple" and $data->color === "red"
// 2nd iteration: $name === "pear" and $data->color === "yellow"
2020-12-23 18:56:07 +01:00
}
```
> Note:
>
> Value of `results` is not loaded into memory at once, but only one item in
> `results` at a time. It is always one item in memory at a time at the level/subtree
> you are currently iterating. Thus, the memory consumption is constant.
2021-05-06 13:45:01 +02:00
<a name="parsing-nested-values"></a>
### Parsing nested values in arrays
2021-05-04 08:07:18 +10:00
The JSON Pointer spec also allows to use a hyphen (`-`) instead of a specific array index. JSON Machine interprets
it as a wildcard which matches any **array index** (not any object key). This enables you to iterate nested values in
2021-05-03 15:48:06 +02:00
arrays without loading the whole item.
Example:
```json
// fruitsArray.json
{
"results": [
{
"name": "apple",
"color": "red"
},
{
"name": "pear",
"color": "yellow"
}
]
}
```
2021-12-23 18:40:03 +01:00
To iterate over all colors of the fruits, use the JSON Pointer `"/results/-/color"`.
2021-05-03 15:48:06 +02:00
2022-01-22 21:13:34 +01:00
```php
<?php
use \JsonMachine\Items;
$fruits = Items::fromFile('fruitsArray.json', ['pointer' => '/results/-/color']);
foreach ($fruits as $key => $value) {
// 1st iteration:
$key == 'color';
$value == 'red';
$fruits->getMatchedJsonPointer() == '/results/-/color';
$fruits->getCurrentJsonPointer() == '/results/0/color';
// 2nd iteration:
$key == 'color';
$value == 'yellow';
$fruits->getMatchedJsonPointer() == '/results/-/color';
$fruits->getCurrentJsonPointer() == '/results/1/color';
}
```
2020-12-15 17:45:37 +01:00
<a name="getting-scalar-values"></a>
2021-12-23 18:40:03 +01:00
### Parsing a single scalar value
You can parse a single scalar value anywhere in the document the same way as a collection. Consider this example:
2020-12-15 17:45:37 +01:00
```json
// fruits.json
{
"lastModified": "2012-12-12",
"apple": {
"color": "red"
},
"pear": {
"color": "yellow"
},
// ... gigabytes follow ...
}
```
2021-12-23 18:40:03 +01:00
Get the scalar value of `lastModified` key like this:
2020-12-15 17:45:37 +01:00
```php
<?php
2021-12-21 16:13:39 +01:00
use \JsonMachine\Items;
2020-12-15 17:45:37 +01:00
$fruits = Items::fromFile('fruits.json', ['pointer' => '/lastModified']);
foreach ($fruits as $key => $value) {
2020-12-20 20:41:45 +01:00
// 1st and final iteration:
// $key === 'lastModified'
// $value === '2012-12-12'
2020-12-15 17:45:37 +01:00
}
```
2020-12-27 13:34:43 +01:00
When parser finds the value and yields it to you, it stops parsing. So when a single scalar value is in the beginning
2020-12-16 13:14:09 +01:00
of a gigabytes-sized file or stream, it just gets the value from the beginning in no time and with almost no memory
2021-05-06 13:45:01 +02:00
consumed.
2020-12-15 17:45:37 +01:00
The obvious shortcut is:
2020-12-15 17:45:37 +01:00
```php
<?php
2021-12-21 16:13:39 +01:00
use \JsonMachine\Items;
2020-12-15 17:45:37 +01:00
$fruits = Items::fromFile('fruits.json', ['pointer' => '/lastModified']);
$lastModified = iterator_to_array($fruits)['lastModified'];
2020-12-15 17:45:37 +01:00
```
2021-12-23 18:40:03 +01:00
Single scalar value access supports array indices in JSON Pointer as well.
2020-12-03 11:58:33 +01:00
<a name="parsing-multiple-subtrees"></a>
### Parsing multiple subtrees
2022-01-22 21:13:34 +01:00
It is also possible to parse multiple subtrees using multiple JSON Pointers. Consider this example:
```json
// fruits.json
{
"lastModified": "2012-12-12",
"berries": [
{
2022-01-22 21:13:34 +01:00
"name": "strawberry", // not a berry, but whatever ...
"color": "red"
},
{
2022-01-22 21:13:34 +01:00
"name": "raspberry", // the same ...
"color": "red"
2022-01-22 21:13:34 +01:00
}
],
2022-01-22 21:13:34 +01:00
"citruses": [
{
2022-01-22 21:13:34 +01:00
"name": "orange",
"color": "orange"
},
{
2022-01-22 21:13:34 +01:00
"name": "lime",
"color": "green"
2022-01-22 21:13:34 +01:00
}
]
}
```
2022-01-22 21:13:34 +01:00
To iterate over all berries and citrus fruits, use the JSON pointers `["/berries", "/citrus"]`. The order of pointers
2022-01-22 21:38:20 +01:00
does not matter. The items will be iterated in the order of appearance in the document.
```php
<?php
use \JsonMachine\Items;
$fruits = Items::fromFile('fruits.json', [
2022-01-22 21:13:34 +01:00
'pointer' => ['/berries', '/citruses']
]);
2022-01-22 21:13:34 +01:00
foreach ($fruits as $key => $value) {
2022-01-22 21:13:34 +01:00
// 1st iteration:
$value == ["name" => "strawberry", "color" => "red"];
$fruits->getCurrentJsonPointer() == '/berries';
// 2nd iteration:
$value == ["name" => "raspberry", "color" => "red"];
$fruits->getCurrentJsonPointer() == '/berries';
// 3rd iteration:
$value == ["name" => "orange", "color" => "orange"];
$fruits->getCurrentJsonPointer() == '/citruses';
// 4th iteration:
$value == ["name" => "lime", "color" => "green"];
$fruits->getCurrentJsonPointer() == '/citruses';
}
```
2021-05-06 13:45:01 +02:00
<a name="json-pointer"></a>
2021-12-23 18:40:03 +01:00
### What is JSON Pointer anyway?
It's a way of addressing one item in JSON document. See the [JSON Pointer RFC 6901](https://tools.ietf.org/html/rfc6901).
2021-05-06 13:45:01 +02:00
It's very handy, because sometimes the JSON structure goes deeper, and you want to iterate a subtree,
not the main level. So you just specify the pointer to the JSON array or object (or even to a scalar value) you want to iterate and off you go.
2021-12-23 18:04:55 +01:00
When the parser hits the collection you specified, iteration begins. You can pass it as `pointer` option in all
2021-12-21 16:13:39 +01:00
`Items::from*` functions. If you specify a pointer to a non-existent position in the document, an exception is thrown.
It can be used to access scalar values as well. **JSON Pointer itself must be a valid JSON string**. Literal comparison
of reference tokens (the parts between slashes) is performed against the JSON document keys/member names.
2021-05-06 13:45:01 +02:00
Some examples:
| JSON Pointer value | Will iterate through |
|--------------------------|-----------------------------------------------------------------------------------------------------------|
| (empty string - default) | `["this", "array"]` or `{"a": "this", "b": "object"}` will be iterated (main level) |
| `/result/items` | `{"result": {"items": ["this", "array", "will", "be", "iterated"]}}` |
| `/0/items` | `[{"items": ["this", "array", "will", "be", "iterated"]}]` (supports array indices) |
| `/results/-/status` | `{"results": [{"status": "iterated"}, {"status": "also iterated"}]}` (a hyphen as an array index wildcard)|
| `/` (gotcha! - a slash followed by an empty string, see the [spec](https://tools.ietf.org/html/rfc6901#section-5)) | `{"":["this","array","will","be","iterated"]}` |
| `/quotes\"` | `{"quotes\"": ["this", "array", "will", "be", "iterated"]}` |
2021-05-06 13:45:01 +02:00
2021-12-23 18:40:03 +01:00
<a name="options"></a>
## Options
Options may change how a JSON is parsed. Array of options is the second parameter of all `Items::from*` functions.
Available options are:
- `pointer` - A JSON Pointer string that tells which part of the document you want to iterate.
- `decoder` - An instance of `ItemDecoder` interface.
2021-12-23 18:40:03 +01:00
- `debug` - `true` or `false` to enable or disable the debug mode. When the debug mode is enabled, data such as line,
column and position in the document are available during parsing or in exceptions. Keeping debug disabled adds slight
performance advantage.
2020-12-03 13:04:21 +01:00
<a name="parsing-json-stream-api-responses"></a>
2020-12-04 14:49:13 +01:00
## Parsing streaming responses from a JSON API
2020-12-03 13:04:21 +01:00
A stream API response or any other JSON stream is parsed exactly the same way as file is. The only difference
2021-12-21 16:13:39 +01:00
is, you use `Items::fromStream($streamResource)` for it, where `$streamResource` is the stream
2020-12-01 20:50:51 +01:00
resource with the JSON document. The rest is the same as with parsing files. Here are some examples of
popular http clients which support streaming responses:
2020-12-03 13:04:21 +01:00
<a name="guzzlehttp"></a>
2020-12-01 20:50:51 +01:00
### GuzzleHttp
Guzzle uses its own streams, but they can be converted back to PHP streams by calling
`\GuzzleHttp\Psr7\StreamWrapper::getResource()`. Pass the result of this function to
2021-12-21 16:13:39 +01:00
`Items::fromStream` function, and you're set up. See working
2022-01-27 20:22:36 +01:00
[GuzzleHttp example](examples/guzzleHttp.php).
2020-12-01 20:50:51 +01:00
2020-12-03 13:04:21 +01:00
<a name="symfony-httpclient"></a>
2020-12-01 20:50:51 +01:00
### Symfony HttpClient
A stream response of Symfony HttpClient works as iterator. And because JSON Machine is
based on iterators, the integration with Symfony HttpClient is very simple. See
2022-01-27 20:22:36 +01:00
[HttpClient example](examples/symfonyHttpClient.php).
2020-12-01 20:50:51 +01:00
2020-12-03 11:58:33 +01:00
2020-12-03 13:04:21 +01:00
<a name="tracking-parsing-progress"></a>
2021-12-23 18:04:55 +01:00
## Tracking the progress (with `debug` enabled)
2021-12-21 16:13:39 +01:00
Big documents may take a while to parse. Call `Items::getPosition()` in your `foreach` to get current
2020-12-04 14:49:13 +01:00
count of the processed bytes from the beginning. Percentage is then easy to calculate as `position / total * 100`.
To find out the total size of your document in bytes you may want to check:
2021-03-04 12:18:38 +01:00
- `strlen($document)` if you parse a string
- `filesize($file)` if you parse a file
- `Content-Length` http header if you parse a http stream response
2020-12-01 20:50:51 +01:00
- ... you get the point
2021-12-23 18:04:55 +01:00
If `debug` is disabled, `getPosition()` always returns `0`.
2020-12-01 20:50:51 +01:00
```php
<?php
2021-12-21 16:13:39 +01:00
use JsonMachine\Items;
2020-12-01 20:50:51 +01:00
$fileSize = filesize('fruits.json');
2021-12-23 18:04:55 +01:00
$fruits = Items::fromFile('fruits.json', ['debug' => true]);
2020-12-01 20:50:51 +01:00
foreach ($fruits as $name => $data) {
2020-12-01 20:58:22 +01:00
echo 'Progress: ' . intval($fruits->getPosition() / $fileSize * 100) . ' %';
2020-12-01 20:50:51 +01:00
}
2019-12-20 17:59:55 +01:00
```
2020-12-03 11:58:33 +01:00
2020-12-08 12:35:27 +01:00
<a name="decoders"></a>
## Decoders
2021-12-23 18:04:55 +01:00
`Items::from*` functions also accept `decoder` option. It must be an instance of
`JsonMachine\JsonDecoder\ItemDecoder`. If none is specified, `ExtJsonDecoder` is used by
2020-04-16 16:01:18 +02:00
default. It requires `ext-json` PHP extension to be present, because it uses
`json_decode`. When `json_decode` doesn't do what you want, implement `JsonMachine\JsonDecoder\ItemDecoder`
2020-11-27 10:43:18 +01:00
and make your own.
2020-04-16 16:01:18 +02:00
2020-12-03 13:04:21 +01:00
<a name="available-decoders"></a>
2020-04-16 16:01:18 +02:00
### Available decoders
2020-12-05 18:10:06 +01:00
- **`ExtJsonDecoder`** - **Default.** Uses `json_decode` to decode keys and values.
2021-12-23 18:04:55 +01:00
Constructor has the same parameters as `json_decode`.
2020-12-06 22:12:52 +01:00
- **`PassThruDecoder`** - Does no decoding. Both keys and values are produced as pure JSON strings.
Useful when you want to parse a JSON item with something else directly in the foreach
and don't want to implement `JsonMachine\JsonDecoder\ItemDecoder`. Since `1.0.0` does not use `json_decode`.
2020-04-17 13:08:40 +02:00
Example:
```php
<?php
use JsonMachine\JsonDecoder\PassThruDecoder;
2021-12-21 16:13:39 +01:00
use JsonMachine\Items;
2020-04-17 13:08:40 +02:00
$items = Items::fromFile('path/to.json', ['decoder' => new PassThruDecoder]);
2020-12-06 22:12:52 +01:00
```
- **`ErrorWrappingDecoder`** - A decorator which wraps decoding errors inside `DecodingError` object
thus enabling you to skip malformed items instead of dying on `SyntaxError` exception.
2020-12-06 22:12:52 +01:00
Example:
```php
<?php
2021-12-21 16:13:39 +01:00
use JsonMachine\Items;
2020-12-06 22:12:52 +01:00
use JsonMachine\JsonDecoder\DecodingError;
use JsonMachine\JsonDecoder\ErrorWrappingDecoder;
use JsonMachine\JsonDecoder\ExtJsonDecoder;
$items = Items::fromFile('path/to.json', ['decoder' => new ErrorWrappingDecoder(new ExtJsonDecoder())]);
2020-12-06 22:12:52 +01:00
foreach ($items as $key => $item) {
if ($key instanceof DecodingError || $item instanceof DecodingError) {
// handle error of this malformed json item
continue;
}
var_dump($key, $item);
}
2020-04-17 13:08:40 +02:00
```
2018-11-29 19:54:39 +01:00
2020-12-06 19:43:05 +01:00
<a name="error-handling"></a>
## Error handling
Since 0.4.0 every exception extends `JsonMachineException`, so you can catch that to filter any error from JSON Machine library.
2020-12-06 22:12:52 +01:00
2020-12-08 16:37:36 +01:00
<a name="malformed-items"></a>
### Skipping malformed items
2020-12-07 08:44:30 +01:00
If there's an error anywhere in a json stream, `SyntaxError` exception is thrown. That's very inconvenient,
because if there is an error inside one json item you are unable to parse the rest of the document
2020-12-06 22:12:52 +01:00
because of one malformed item. `ErrorWrappingDecoder` is a decoder decorator which can help you with that.
2020-12-08 16:37:36 +01:00
Wrap a decoder with it, and all malformed items you are iterating will be given to you in the foreach via
2020-12-07 08:44:30 +01:00
`DecodingError`. This way you can skip them and continue further with the document. See example in
2020-12-06 22:12:52 +01:00
[Available decoders](#available-decoders). Syntax errors in the structure of a json stream between the iterated
2020-12-07 08:44:30 +01:00
items will still throw `SyntaxError` exception though.
2020-12-06 19:43:05 +01:00
2020-12-03 13:04:21 +01:00
<a name="on-parser-efficiency"></a>
2020-12-04 14:49:13 +01:00
## Parser efficiency
2022-01-10 23:04:12 +01:00
The time complexity is always `O(n)`
2019-12-20 17:59:55 +01:00
2020-12-03 13:04:21 +01:00
<a name="streams-files"></a>
2020-12-03 11:58:33 +01:00
### Streams / files
2022-01-10 23:04:12 +01:00
TL;DR: The memory complexity is `O(2)`
2021-12-24 22:56:42 +01:00
JSON Machine reads a stream (or a file) 1 JSON item at a time and generates corresponding 1 PHP item at a time.
2018-11-29 19:54:39 +01:00
This is the most efficient way, because if you had say 10,000 users in JSON file and wanted to parse it using
`json_decode(file_get_contents('big.json'))`, you'd have the whole string in memory as well as all the 10,000
2020-11-27 10:43:18 +01:00
PHP structures. Following table shows the difference:
2018-11-29 19:54:39 +01:00
2020-11-27 10:43:18 +01:00
| | String items in memory at a time | Decoded PHP items in memory at a time | Total |
|------------------------|---------------------------------:|--------------------------------------:|------:|
| `json_decode()` | 10000 | 10000 | 20000 |
2021-12-21 16:13:39 +01:00
| `Items::from*()` | 1 | 1 | 2 |
2018-11-29 19:54:39 +01:00
2021-12-21 16:13:39 +01:00
This means, that JSON Machine is constantly efficient for any size of processed JSON. 100 GB no problem.
2018-11-29 19:54:39 +01:00
2020-12-03 13:04:21 +01:00
<a name="in-memory-json-strings"></a>
2020-12-03 11:58:33 +01:00
### In-memory JSON strings
2022-01-10 23:04:12 +01:00
TL;DR: The memory complexity is `O(n+1)`
2021-12-21 16:13:39 +01:00
There is also a method `Items::fromString()`. If you are
2020-12-08 12:35:27 +01:00
forced to parse a big string, and the stream is not available, JSON Machine may be better than `json_decode`.
2020-12-04 14:49:13 +01:00
The reason is that unlike `json_decode`, JSON Machine still traverses the JSON string one item at a time and doesn't
load all resulting PHP structures into memory at once.
2018-11-29 19:54:39 +01:00
Let's continue with the example with 10,000 users. This time they are all in string in memory.
When decoding that string with `json_decode`, 10,000 arrays (objects) is created in memory and then the result
2020-12-04 14:49:13 +01:00
is returned. JSON Machine on the other hand creates single structure for each found item in the string and yields it back
to you. When you process this item and iterate to the next one, another single structure is created. This is the same
2018-11-29 19:54:39 +01:00
behaviour as with streams/files. Following table puts the concept into perspective:
2019-12-20 17:59:55 +01:00
| | String items in memory at a time | Decoded PHP items in memory at a time | Total |
|-----------------------------|---------------------------------:|--------------------------------------:|------:|
| `json_decode()` | 10000 | 10000 | 20000 |
2021-12-21 16:13:39 +01:00
| `Items::fromString()` | 10000 | 1 | 10001 |
2018-11-29 19:54:39 +01:00
2021-12-21 16:13:39 +01:00
The reality is even better. `Items::fromString` consumes about **5x less memory** than `json_decode`. The reason is
2020-12-07 11:24:03 +01:00
that a PHP structure takes much more memory than its corresponding JSON representation.
2020-12-03 11:58:33 +01:00
<a name="troubleshooting"></a>
## Troubleshooting
<a name="step1"></a>
### "I'm still getting Allowed memory size ... exhausted"
2020-12-08 12:35:27 +01:00
One of the reasons may be that the items you want to iterate over are in some sub-key such as `"results"`
2021-12-23 18:40:03 +01:00
but you forgot to specify a JSON Pointer. See [Parsing a subtree](#parsing-a-subtree).
<a name="step2"></a>
### "That didn't help"
2020-12-03 21:15:47 +01:00
The other reason may be, that one of the items you iterate is itself so huge it cannot be decoded at once.
2020-12-04 14:49:13 +01:00
For example, you iterate over users and one of them has thousands of "friend" objects in it.
Use `PassThruDecoder` which does not decode an item, get the json string of the user
2021-12-21 16:13:39 +01:00
and parse it iteratively yourself using `Items::fromString()`.
```php
<?php
2021-12-21 16:13:39 +01:00
use JsonMachine\Items;
use JsonMachine\JsonDecoder\PassThruDecoder;
$users = Items::fromFile('users.json', ['decoder' => new PassThruDecoder]);
foreach ($users as $user) {
foreach (Items::fromString($user, ['pointer' => "/friends"]) as $friend) {
// process friends one by one
}
}
```
<a name="step3"></a>
### "I am still out of luck"
It probably means that the JSON string `$user` itself or one of the friends are too big and do not fit in memory.
However, you can try this approach recursively. Parse `"/friends"` with `PassThruDecoder` getting one `$friend`
2021-12-21 16:13:39 +01:00
json string at a time and then parse that using `Items::fromString()`... If even that does not help,
there's probably no solution yet via JSON Machine. A feature is planned which will enable you to iterate
any structure fully recursively and strings will be served as streams.
2021-11-23 12:50:17 +01:00
<a name="installation"></a>
## Installation
2022-02-12 16:01:13 +01:00
### Using Composer
2021-11-23 12:50:17 +01:00
```bash
composer require halaxa/json-machine
```
2022-02-12 16:01:13 +01:00
### Without Composer
Clone or download this repository and add the following to your bootstrap file:
```php
spl_autoload_register(require '/path/to/json-machine/src/autoloader.php');
2022-02-11 18:28:13 +01:00
```
2021-11-23 12:50:17 +01:00
2021-11-23 12:42:13 +01:00
<a name="development"></a>
## Development
Clone this repository. This library supports two development approaches:
1. non containerized (PHP and composer already installed on your machine)
1. containerized (Docker on your machine)
2021-11-23 12:42:13 +01:00
<a name="non-containerized"></a>
### Non containerized
Run `composer run -l` in the project dir to see available dev scripts. This way you can run some steps
of the build process such as tests.
<a name="containerized"></a>
### Containerized
[Install Docker](https://docs.docker.com/install/) and run `make` in the project dir on your host machine
to see available dev tools/commands. You can run all the steps of the build process separately as well
as the whole build process at once. Make basically runs composer dev scripts inside containers in the background.
2018-11-29 14:20:06 +01:00
2022-01-24 14:37:31 +01:00
`make build`: Runs complete build. The same command is run via GitHub Actions CI.
2020-12-03 11:58:33 +01:00
2020-12-03 13:04:21 +01:00
<a name="support"></a>
## Support
Do you like this library? Star it, share it, show it :)
Issues and pull requests are very welcome.
2020-04-04 23:07:25 +02:00
2021-12-24 17:21:57 +01:00
[![ko-fi](https://ko-fi.com/img/githubbutton_sm.svg)](https://ko-fi.com/G2G57KTE4)
2020-12-03 13:04:21 +01:00
<a name="license"></a>
2018-11-29 14:20:06 +01:00
## License
2018-12-01 20:09:38 +01:00
Apache 2.0
2018-11-29 19:54:39 +01:00
Cogwheel element: Icons made by [TutsPlus](https://www.flaticon.com/authors/tutsplus)
2018-11-29 14:20:06 +01:00
from [www.flaticon.com](https://www.flaticon.com/)
is licensed by [CC 3.0 BY](http://creativecommons.org/licenses/by/3.0/)
2020-04-06 18:18:49 +02:00
2020-12-03 13:04:21 +01:00
<i><a href='http://ecotrust-canada.github.io/markdown-toc/'>Table of contents generated with markdown-toc</a></i>