- Correct README chapter links to match actual filenames - Fix Modern::Perl version from invalid '2023' to valid '2018' - Ensure all code examples use proper Perl syntax and best practices - Maintain consistency across all chapters 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
19 KiB
Chapter 8: Working with CSV, JSON, and XML
"Data formats are like opinions - everyone has one, and they all stink except yours." - DevOps Proverb
In a perfect world, all data would be in one format. In reality, you're juggling CSV exports from Excel, JSON from REST APIs, and XML from that enterprise system installed in 2003. Perl doesn't judge—it parses them all. This chapter shows you how to read, write, and transform between these formats without losing your sanity.
CSV: The Deceptively Simple Format
CSV looks simple. It's just commas and values, right? Wrong. There are quotes, escaped quotes, embedded newlines, different delimiters, encodings, and Excel's creative interpretations. Don't write your own CSV parser. Use Text::CSV.
Reading CSV Files
#!/usr/bin/env perl
use Modern::Perl '2018';
use Text::CSV;
# Basic CSV reading
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1 });
open my $fh, '<:encoding(utf8)', 'data.csv' or die $!;
# Read header row
my $headers = $csv->getline($fh);
$csv->column_names(@$headers);
# Read data rows as hashrefs
while (my $row = $csv->getline_hr($fh)) {
say "Processing user: $row->{username}";
say " Email: $row->{email}";
say " Status: $row->{status}";
}
close $fh;
# Or slurp all at once
sub read_csv {
my ($filename) = @_;
my $csv = Text::CSV->new({
binary => 1,
auto_diag => 1,
sep_char => ',', # Could be ';' or '\t' for TSV
quote_char => '"',
escape_char => '"',
allow_loose_quotes => 1,
allow_loose_escapes => 1,
});
open my $fh, '<:encoding(utf8)', $filename or die $!;
# Read all rows
my $rows = $csv->getline_all($fh);
close $fh;
# Convert to array of hashrefs
my $headers = shift @$rows;
my @data = map {
my %hash;
@hash{@$headers} = @$_;
\%hash;
} @$rows;
return \@data;
}
Writing CSV Files
# Write CSV with proper escaping
sub write_csv {
my ($filename, $data, $headers) = @_;
my $csv = Text::CSV->new({
binary => 1,
auto_diag => 1,
eol => "\n", # Auto line endings
});
open my $fh, '>:encoding(utf8)', $filename or die $!;
# Write headers if provided
if ($headers) {
$csv->say($fh, $headers);
} elsif (@$data && ref $data->[0] eq 'HASH') {
# Extract headers from first hashref
my @headers = sort keys %{$data->[0]};
$csv->say($fh, \@headers);
# Write data rows
for my $row (@$data) {
$csv->say($fh, [@$row{@headers}]);
}
} else {
# Array of arrays
$csv->say($fh, $_) for @$data;
}
close $fh;
}
# Example: Generate report
my @report_data = (
{ date => '2024-01-15', sales => 1500, region => 'North' },
{ date => '2024-01-15', sales => 2300, region => 'South' },
{ date => '2024-01-16', sales => 1800, region => 'North' },
);
write_csv('sales_report.csv', \@report_data);
Handling Problematic CSV
# Deal with Excel's CSV quirks
sub parse_excel_csv {
my ($filename) = @_;
# Excel likes to add BOM
open my $fh, '<:encoding(utf8)', $filename or die $!;
my $first_line = <$fh>;
# Remove BOM if present
$first_line =~ s/^\x{FEFF}//;
seek($fh, 0, 0); # Rewind
my $csv = Text::CSV->new({
binary => 1,
auto_diag => 1,
# Excel sometimes uses semicolons in some locales
sep_char => index($first_line, ';') > -1 ? ';' : ',',
allow_whitespace => 1, # Handle extra spaces
blank_is_undef => 1, # Empty cells become undef
});
# Process file...
}
# Handle malformed CSV
sub parse_dirty_csv {
my ($filename) = @_;
my @clean_data;
open my $fh, '<', $filename or die $!;
while (my $line = <$fh>) {
chomp $line;
# Try to parse with Text::CSV first
my $csv = Text::CSV->new({ binary => 1 });
if ($csv->parse($line)) {
push @clean_data, [$csv->fields];
} else {
# Fallback to manual parsing for broken lines
warn "Failed to parse line $.: " . $csv->error_input;
# Simple split (dangerous but sometimes necessary)
my @fields = split /,/, $line;
s/^\s+|\s+$//g for @fields; # Trim
push @clean_data, \@fields;
}
}
close $fh;
return \@clean_data;
}
JSON: The Modern Standard
JSON is everywhere. REST APIs speak it, configuration files use it, and JavaScript loves it. Perl handles JSON beautifully with JSON::XS (fast) or JSON::PP (pure Perl).
Reading and Writing JSON
use JSON::XS;
# Create JSON encoder/decoder
my $json = JSON::XS->new->utf8->pretty->canonical;
# Decode JSON
my $json_text = '{"name":"Alice","age":30,"active":true}';
my $data = $json->decode($json_text);
say "Name: $data->{name}";
# Encode to JSON
my $perl_data = {
servers => [
{ name => 'web01', ip => '10.0.1.1', status => 'running' },
{ name => 'web02', ip => '10.0.1.2', status => 'running' },
{ name => 'db01', ip => '10.0.2.1', status => 'stopped' },
],
updated => time(),
version => '1.0',
};
my $json_output = $json->encode($perl_data);
print $json_output;
# File operations
sub read_json {
my ($filename) = @_;
open my $fh, '<:encoding(utf8)', $filename or die $!;
local $/; # Slurp mode
my $json_text = <$fh>;
close $fh;
return decode_json($json_text); # Using functional interface
}
sub write_json {
my ($filename, $data) = @_;
my $json = JSON::XS->new->utf8->pretty->canonical;
open my $fh, '>:encoding(utf8)', $filename or die $!;
print $fh $json->encode($data);
close $fh;
}
Advanced JSON Handling
# Custom JSON encoder settings
my $json = JSON::XS->new
->utf8(1) # Encode/decode UTF-8
->pretty(1) # Human-readable output
->canonical(1) # Sort keys for consistent output
->allow_nonref(1) # Allow non-reference values
->allow_blessed(1) # Allow blessed references
->convert_blessed(1) # Call TO_JSON method on objects
->max_depth(512) # Prevent deep recursion
->max_size(10_000_000); # 10MB max
# Handle special values
my $data = {
string => "Hello",
number => 42,
float => 3.14,
true => \1, # JSON::XS::true
false => \0, # JSON::XS::false
null => undef,
unicode => "Hello 世界 🌍",
};
# Streaming JSON parser for large files
sub process_json_stream {
my ($filename, $callback) = @_;
open my $fh, '<:encoding(utf8)', $filename or die $!;
my $json = JSON::XS->new->utf8;
my $parser = JSON::XS->new->utf8->incremental;
while (my $chunk = <$fh>) {
$parser->incr_parse($chunk);
# Extract complete JSON objects
while (my $obj = $parser->incr_parse) {
$callback->($obj);
}
}
close $fh;
}
# JSON with schema validation
use JSON::Validator;
my $validator = JSON::Validator->new;
$validator->schema({
type => 'object',
required => ['name', 'email'],
properties => {
name => { type => 'string', minLength => 1 },
email => { type => 'string', format => 'email' },
age => { type => 'integer', minimum => 0, maximum => 150 },
},
});
my @errors = $validator->validate({
name => 'Alice',
email => 'alice@example.com',
age => 30,
});
die "Validation failed: @errors" if @errors;
JSON Transformations
# Convert between JSON and other formats
sub csv_to_json {
my ($csv_file, $json_file) = @_;
# Read CSV
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1 });
open my $fh, '<:encoding(utf8)', $csv_file or die $!;
my $headers = $csv->getline($fh);
$csv->column_names(@$headers);
my @data;
while (my $row = $csv->getline_hr($fh)) {
push @data, $row;
}
close $fh;
# Write JSON
write_json($json_file, \@data);
}
# Transform JSON structure
sub transform_json {
my ($data) = @_;
# Flatten nested structure
my @flattened;
for my $user (@{$data->{users}}) {
for my $order (@{$user->{orders}}) {
push @flattened, {
user_id => $user->{id},
user_name => $user->{name},
order_id => $order->{id},
order_total => $order->{total},
order_date => $order->{date},
};
}
}
return \@flattened;
}
XML: The Enterprise Format
XML is verbose, complex, and everywhere in enterprise systems. Perl has excellent XML support through various modules. We'll focus on XML::LibXML (fast and standards-compliant) and XML::Simple (for simple cases).
XML::Simple for Basic Tasks
use XML::Simple;
use Data::Dumper;
# Read simple XML
my $xml_text = <<'XML';
<config>
<server name="web01" ip="10.0.1.1" status="active"/>
<server name="web02" ip="10.0.1.2" status="active"/>
<server name="db01" ip="10.0.2.1" status="maintenance"/>
</config>
XML
my $data = XMLin($xml_text,
ForceArray => ['server'], # Always make array
KeyAttr => { server => 'name' }, # Use 'name' as hash key
);
print Dumper($data);
# Write XML
my $output = XMLout($data,
RootName => 'config',
NoAttr => 0, # Use attributes
XMLDecl => '<?xml version="1.0" encoding="UTF-8"?>',
);
print $output;
XML::LibXML for Serious XML Work
use XML::LibXML;
# Parse XML
my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($xml_text);
# Or parse from file
my $doc = $parser->parse_file('config.xml');
# XPath queries
my @servers = $doc->findnodes('//server[@status="active"]');
for my $server (@servers) {
say "Active server: " . $server->getAttribute('name');
say " IP: " . $server->getAttribute('ip');
}
# Modify XML
my ($db_server) = $doc->findnodes('//server[@name="db01"]');
$db_server->setAttribute('status', 'active');
$db_server->appendTextChild('note', 'Maintenance completed');
# Create new XML
my $new_doc = XML::LibXML::Document->new('1.0', 'UTF-8');
my $root = $new_doc->createElement('inventory');
$new_doc->setDocumentElement($root);
for my $item (@inventory) {
my $elem = $new_doc->createElement('item');
$elem->setAttribute('id', $item->{id});
$elem->setAttribute('quantity', $item->{quantity});
$elem->appendTextChild('name', $item->{name});
$elem->appendTextChild('price', $item->{price});
$root->appendChild($elem);
}
print $new_doc->toString(1); # Pretty print
Parsing Complex XML
# Parse XML with namespaces
my $xml_with_ns = <<'XML';
<root xmlns:app="http://example.com/app">
<app:user id="1">
<app:name>Alice</app:name>
<app:email>alice@example.com</app:email>
</app:user>
</root>
XML
my $doc = $parser->parse_string($xml_with_ns);
my $xpc = XML::LibXML::XPathContext->new($doc);
$xpc->registerNs('app', 'http://example.com/app');
my @users = $xpc->findnodes('//app:user');
for my $user (@users) {
my $name = $xpc->findvalue('app:name', $user);
my $email = $xpc->findvalue('app:email', $user);
say "User: $name <$email>";
}
# Stream parsing for large XML
sub parse_large_xml {
my ($filename, $callback) = @_;
my $reader = XML::LibXML::Reader->new(location => $filename)
or die "Cannot read $filename";
while ($reader->read) {
if ($reader->nodeType == XML_READER_TYPE_ELEMENT
&& $reader->name eq 'record') {
my $node = $reader->copyCurrentNode(1);
$callback->($node);
}
}
}
# Validate against XSD
sub validate_xml {
my ($xml_file, $xsd_file) = @_;
my $schema_doc = $parser->parse_file($xsd_file);
my $schema = XML::LibXML::Schema->new(string => $schema_doc);
my $doc = $parser->parse_file($xml_file);
eval { $schema->validate($doc) };
if ($@) {
die "Validation failed: $@";
}
say "XML is valid!";
}
Format Conversion Utilities
Universal Data Converter
#!/usr/bin/env perl
use Modern::Perl '2018';
use feature 'signatures';
no warnings 'experimental::signatures';
package DataConverter;
use Text::CSV;
use JSON::XS;
use XML::Simple;
use YAML::XS;
sub new($class) {
return bless {}, $class;
}
sub convert($self, $input_file, $output_file, $from_format, $to_format) {
my $data = $self->read($input_file, $from_format);
$self->write($output_file, $data, $to_format);
}
sub read($self, $file, $format) {
my $method = "read_$format";
die "Unknown format: $format" unless $self->can($method);
return $self->$method($file);
}
sub write($self, $file, $data, $format) {
my $method = "write_$format";
die "Unknown format: $format" unless $self->can($method);
return $self->$method($file, $data);
}
sub read_json($self, $file) {
open my $fh, '<:encoding(utf8)', $file or die $!;
local $/;
my $json = <$fh>;
close $fh;
return decode_json($json);
}
sub write_json($self, $file, $data) {
my $json = JSON::XS->new->utf8->pretty->canonical;
open my $fh, '>:encoding(utf8)', $file or die $!;
print $fh $json->encode($data);
close $fh;
}
sub read_csv($self, $file) {
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1 });
open my $fh, '<:encoding(utf8)', $file or die $!;
my $headers = $csv->getline($fh);
$csv->column_names(@$headers);
my @data;
while (my $row = $csv->getline_hr($fh)) {
push @data, $row;
}
close $fh;
return \@data;
}
sub write_csv($self, $file, $data) {
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1, eol => "\n" });
open my $fh, '>:encoding(utf8)', $file or die $!;
# Assume array of hashrefs
if (@$data && ref $data->[0] eq 'HASH') {
my @headers = sort keys %{$data->[0]};
$csv->say($fh, \@headers);
for my $row (@$data) {
$csv->say($fh, [@$row{@headers}]);
}
}
close $fh;
}
sub read_xml($self, $file) {
return XMLin($file,
ForceArray => 1,
KeyAttr => [],
);
}
sub write_xml($self, $file, $data) {
my $xml = XMLout($data,
RootName => 'data',
XMLDecl => '<?xml version="1.0" encoding="UTF-8"?>',
);
open my $fh, '>:encoding(utf8)', $file or die $!;
print $fh $xml;
close $fh;
}
sub read_yaml($self, $file) {
return YAML::XS::LoadFile($file);
}
sub write_yaml($self, $file, $data) {
YAML::XS::DumpFile($file, $data);
}
package main;
# Use the converter
my $converter = DataConverter->new();
# Convert CSV to JSON
$converter->convert('data.csv', 'data.json', 'csv', 'json');
# Convert JSON to XML
$converter->convert('data.json', 'data.xml', 'json', 'xml');
# Convert XML to YAML
$converter->convert('data.xml', 'data.yaml', 'xml', 'yaml');
Real-World Example: API Data Pipeline
#!/usr/bin/env perl
use Modern::Perl '2018';
use LWP::UserAgent;
use JSON::XS;
use Text::CSV;
use Try::Tiny;
# Fetch data from API, process, and export
sub api_data_pipeline {
my ($api_url, $output_file) = @_;
# Fetch from API
my $ua = LWP::UserAgent->new(timeout => 30);
my $response = $ua->get($api_url);
die "API request failed: " . $response->status_line
unless $response->is_success;
# Parse JSON response
my $data = decode_json($response->content);
# Transform data
my $processed = process_api_data($data);
# Export to multiple formats
export_data($processed, $output_file);
}
sub process_api_data {
my ($data) = @_;
my @processed;
for my $item (@{$data->{results}}) {
push @processed, {
id => $item->{id},
name => clean_text($item->{name}),
value => sprintf("%.2f", $item->{value} || 0),
timestamp => format_timestamp($item->{created_at}),
status => normalize_status($item->{status}),
};
}
return \@processed;
}
sub clean_text {
my ($text) = @_;
$text =~ s/^\s+|\s+$//g; # Trim
$text =~ s/\s+/ /g; # Normalize spaces
return $text;
}
sub format_timestamp {
my ($ts) = @_;
# Convert various timestamp formats to ISO 8601
# Implementation depends on input format
return $ts;
}
sub normalize_status {
my ($status) = @_;
my %status_map = (
active => 'active',
enabled => 'active',
disabled => 'inactive',
deleted => 'archived',
);
return $status_map{lc($status)} // 'unknown';
}
sub export_data {
my ($data, $base_filename) = @_;
# Export as JSON
my $json_file = "$base_filename.json";
write_json($json_file, $data);
say "Exported to $json_file";
# Export as CSV
my $csv_file = "$base_filename.csv";
write_csv($csv_file, $data);
say "Exported to $csv_file";
# Generate summary
generate_summary($data, "$base_filename.summary.txt");
}
sub generate_summary {
my ($data, $filename) = @_;
open my $fh, '>:encoding(utf8)', $filename or die $!;
my %stats = (
total => scalar(@$data),
by_status => {},
);
for my $item (@$data) {
$stats{by_status}{$item->{status}}++;
}
print $fh "Data Export Summary\n";
print $fh "=" x 40 . "\n";
print $fh "Total Records: $stats{total}\n\n";
print $fh "By Status:\n";
for my $status (sort keys %{$stats{by_status}}) {
printf $fh " %-15s: %d\n", $status, $stats{by_status}{$status};
}
close $fh;
say "Summary written to $filename";
}
Performance Considerations
- Choose the right parser - XML::LibXML is faster than XML::Simple
- Use streaming for large files - Don't load 1GB XML into memory
- Prefer JSON::XS over JSON::PP - 10-100x faster
- Cache parsed data - Parse once, use many times
- Validate incrementally - Don't wait until the end to find errors
- Use binary mode for CSV - Handles all encodings properly
- Consider format limitations - CSV can't represent nested data well
Best Practices
- Always validate input - Never trust external data
- Handle encoding explicitly - UTF-8 everywhere
- Use appropriate error handling - Malformed data is common
- Document format expectations - Include sample files
- Test with real-world data - Edge cases are everywhere
- Provide format detection - Auto-detect when possible
- Keep original data - Transform copies, not originals
- Version your schemas - Data formats evolve
Conclusion
Data comes in many formats, but Perl handles them all with aplomb. Whether you're parsing gigabytes of CSV, streaming JSON from APIs, or wrestling with enterprise XML, Perl has the tools you need. The key is choosing the right tool for each format and understanding its quirks.
Remember: every data format has edge cases. Plan for them, test for them, and handle them gracefully. Your future self will thank you when that "perfectly formatted" data file turns out to be anything but.
Next: Log file analysis and monitoring. We'll build real-time log processors that can handle anything from Apache access logs to custom application outputs.