Bulk data processing and postgresql thingy
Challenges
- Volume (scalability): Handling millions of records across multiple datasets.
- Velocity: High write throughput optimized data fetching.
- Variety: Managing diverse data formats (JSON, CSV, XML) with pluggable pipelines
Hibernate 6+
- Native support for postgresql JSON / JSONB and direct mapping from column to a java String/POJO.
@Column
@JdbcTypeCode(SqlTypes.JSON)
private MyJsonData jsonData;
postgres advanced indexing optimized data retrieval
GIN
- Optimized for: JSONB filtering, fill-text search.
- Best use cases: product metadata, documents.
- Index size: Large.
GiST
- Optimized for: ranged-based queries.
- Best use cases: prices, time intervals, geometric searches.
- Index size: Medium.
BRIN
- Optimized for: bulk read of ordered data.
- Best use cases: time-series data, financial transactions.
- Index size: Small.
A double-edged sword: postgresql’s Multi-Version Concurrency Control (MVCC)
- Strengths:
- high concurrency
- non-blocking reads
- data integrity
- performance
- flaws
- table bloat
- autovaccum overhead
- transaction id (XID) wraparound
- write amplification
VACUUM to maintain postgresql performance
- WHY? Cleans up old versions of rows left by MVCC.
- WHAT? Reclaims disk space and prevents table bloat.
- HOW? Runs automatically in the background via Autovaccum / Manual
VACUUM
- Impact: ensures performance but requires careful tuning to avoid I/O overhead.
jdbc flow
- open connection
- prepare statement
- execute query
- handle results
- commit/rollback
- close connection
Batching with hibernate
- SessionFactory-level:
hibernate.jdbc.batch_size=5
- Session-level:
session.setJdbcBatchSize(10)
Use hibernate Stateless Session for faster batching operations, as a stateless session:
- doesn’t have first-level cache, nor does it interact with any second-level caches
- doesn’t implement transactional write-behind or automatic dirty checking, so all operations are executed immediately when the’re explicitly called
try (var statelessSession = sessionFactory.openStatelessSession()) {
statelessSession.setJdbcBatchSize(batchSize);
Transaction transaction = statelessSession.begingTransaction();
// ...
transaction.commit();
}
Warning
postgresql default fetch size is
0
! So it will try to fetch everything and put in memory!
hibernate traditional session:
- first-level cache (session-scoped)
- supports caching, dirty checking, lazy loading
- manages entity lifecycle
- best for interactive, transactional use cases