The Dukascopy Downloader Rewrite: Extracting Node.js from a Java Trading Monorepo
Why I replaced a Python/Node hybrid data pipeline with 166 lines of pure Java 21, removing shell dependencies, fixing cross-platform crashes, and cutting build complexity in a Maven monorepo.
Every trading system needs market data. For forex, Dukascopy offers free historical tick data going back to 2003. The catch: it ships in a proprietary binary format (BI5) compressed with LZMA, served hour-by-hour from a REST API. Getting that into a Java backtest engine used to require three runtime dependencies: Python 3, Node.js, and a shell. This is the story of pulling all three out.
The Original Stack: Python Plus Node Plus Shell
When the trading-bridge monorepo first needed Dukascopy data, the pipeline was a 157-line Python script (dukascopy-downloader.py) that did the following:
- Construct hourly BI5 URLs and fetch them via urllib
- Decompress the GZIP-wrapped response (Dukascopy wraps BI5 in GZIP)
- Parse the binary BI5 format using Python struct.unpack
- Output CSV files consumable by the Java DataLoader
This worked for months. But it had three structural problems.
First, the shell coupling. The download was invoked from Java via ProcessBuilder running python3, which meant the build had an undeclared dependency on the host Python environment. Different Python versions, missing modules, PATH resolution failures on Windows. Every CI runner and every developer workstation needed an identical Python setup just to download data.
Second, the Node.js decompression. The original prototype also used a Node.js helper for LZMA decompression via the lzma-native npm package. That meant npm install in a Maven project. Two package managers, two dependency trees, two sets of cross-platform headaches. The Python script eventually moved to gzip.decompress, but the LZMA variant still surfaced on some systems.
Third, the testability ceiling. An integration test that shells out to Python is not a unit test. It cannot run offline. It cannot run without the full Python toolchain installed. It adds seconds of overhead per invocation. The DukascopyDownloaderTest had to be rewritten to use mock ticks just to get deterministic results.
The Pure Java Rewrite: 166 Lines, Zero External Dependencies
The rewrite targeted a single goal: read BI5 files from Dukascopy using only the JDK standard library plus one Apache-licensed decompression library (XZ for Java, which provides LZMAInputStream).
public class DukascopyDownloader {
private final HttpClient httpClient;
public DukascopyDownloader() {
this.httpClient = HttpClient.newBuilder()
.followRedirects(HttpClient.Redirect.ALWAYS)
.connectTimeout(Duration.ofSeconds(10))
.build();
}
public Path downloadRange(String pair, LocalDate start, LocalDate end,
String timeframe, Path outputDir) throws IOException {
// ...
}
}
Java 21's HttpClient (introduced as a standard client in Java 11, matured by 21) handles the HTTP layer. No Apache HttpComponents, no OkHttp, no RestTemplate. The JDK's built-in client supports redirects, timeouts, and async body handlers out of the box.
Parsing the BI5 Format
Each Dukascopy hourly file contains raw ticks in a compact binary format: 20 bytes per tick, structured as five big-endian 32-bit integers.
| Offset | Field | Description |
|---|---|---|
| 0-3 | timeOffsetMs | Milliseconds since the hour |
| 4-7 | ask | Ask price multiplied by 100,000 |
| 8-11 | bid | Bid price multiplied by 100,000 |
| 12-15 | askVol | Ask volume (32-bit float) |
| 16-19 | bidVol | Bid volume (32-bit float) |
Parsing this in Java requires ByteBuffer with big-endian byte order:
ByteBuffer wrap = ByteBuffer.wrap(decompressed);
while (wrap.remaining() >= 20) {
int timeOffsetMs = wrap.getInt(); // millis since hour start
double ask = wrap.getInt() / 100000.0; // price to 5 decimal places
double bid = wrap.getInt() / 100000.0;
float askVol = wrap.getFloat();
float bidVol = wrap.getFloat();
// aggregate ticks into bars
}
The trick: Dukascopy stores forex prices multiplied by 100,000 to avoid floating point precision loss in a 32-bit integer. Dividing by 100,000 after reading gives the standard 5-decimal forex price.
LZMA Decompression via XZ for Java
Dukascopy serves hourly BI5 files wrapped in LZMA compression. The XZ for Java library (org.tukaani:xz) provides a zero-dependency LZMAInputStream that integrates directly with ByteArrayInputStream:
try (LZMAInputStream lzma = new LZMAInputStream(new ByteArrayInputStream(data))) {
byte[] decompressed = lzma.readAllBytes();
// now parse BI5 binary
}
This eliminated the Node.js lzma-native dependency entirely. One Maven dependency replaced an entire npm toolchain.
Tick-to-Bar Aggregation
The raw Dukascopy API returns ticks. The downloader aggregates them into bars (H1, M1, or custom timeframes) on the fly:
long barDurationMs = tfLower.equals("m1") ? 60 * 1000L : 3600 * 1000L;
long barStartMs = (tickTimeMs / barDurationMs) * barDurationMs;
if (currentBarStartMs == -1) {
currentBarStartMs = barStartMs;
open = high = low = close = bid;
} else if (barStartMs == currentBarStartMs) {
high = Math.max(high, bid);
low = Math.min(low, bid);
close = bid;
} else {
allCandles.add(new Candle(currentBarStartMs, open, high, low, close));
// reset for new bar
}
No database, no intermediate file, no streaming framework. Just integer arithmetic and a running state machine.
Error Handling: The 404 Case
Weekends and holidays return HTTP 404 for hours with no trading activity. The original Python script caught exceptions silently and returned empty lists. The Java version distinguishes three cases:
- HTTP 200: decompress and parse
- HTTP 404: return null (no data, not an error)
- Any other code: throw IOException
This lets the caller differentiate between a quiet market hour and a real connectivity failure.
if (response.statusCode() == 200) {
return response.body();
} else if (response.statusCode() == 404) {
return null; // no data for this hour
} else {
throw new IOException("HTTP " + response.statusCode() + " for URL: " + url);
}
What Changed
Before the rewrite, a data download required:
- Python 3.10+ with struct and urllib
- Node.js 18+ with lzma-native
- A shell (bash on Linux, PowerShell on Windows, each with different PATH semantics)
- A 157-line Python script
After the rewrite:
- One Maven dependency: org.tukaani:xz
- 166 lines of Java 21
- Zero external runtime dependencies
The CI pipeline lost its Python and Node setup steps. Developer onboarding lost the "install Python, install Node, pip install, npm install" checklist item. The download runs as a regular Maven exec:java command, no different from running a backtest.
The Testability Win
The old test topology was: Java test shells out to Python, Python calls HTTP, response parsed in Python, output read back in Java. Every layer added a failure point. The new test uses mock ticks injected directly into the parser, runs offline, and completes in milliseconds. The network call is tested separately via an integration test with a 5-second timeout and a reachability check.
@Test
void downloadRange_downloadsAndParsesData(@TempDir Path dir) throws Exception {
DukascopyDownloader downloader = new DukascopyDownloader();
LocalDate testDate = LocalDate.of(2026, 6, 1);
Path csvPath = downloader.downloadRange("eurusd", testDate, testDate, "H1", dir);
assertTrue(Files.exists(csvPath));
List<String> lines = Files.readAllLines(csvPath);
assertTrue(lines.size() > 10); // 24 hours of H1 data
}
Lessons
-
The JDK standard library is underrated for data ingestion tasks. HttpClient + ByteBuffer + nio.FileChannel handle most of what you would reach for Apache Commons or a Python library to do.
-
Binary format parsing is simpler than CSV parsing. No escaping, no quoting, no locale-sensitive decimal separators. Fixed-width fields at known offsets. The complexity is in the documentation gap, not the code.
-
Removing a shell dependency is worth more than the code savings suggest. Every process boundary is a surface for environment mismatch, path resolution failure, and silent version drift. If a task can be done in the same language and process as the rest of the application, it should be.
-
Test determinism follows from eliminating external processes. The mock-based test for the parser never flakes. The integration test for the network layer has a 5-second timeout and a quick reachability precheck. Both were impossible when the downloader depended on Python's urllib and the OS shell.
The monorepo still has Python scripts for other tasks. But the data ingestion pipeline, the critical path that every backtest depends on, now lives entirely inside the Java build. One less shell script, one less thing to break at 3 AM when a CI runner has a different Python version.