Bulk data processing and postgresql thingy

Challenges

  • Volume (scalability): Handling millions of records across multiple datasets.
  • Velocity: High write throughput optimized data fetching.
  • Variety: Managing diverse data formats (JSON, CSV, XML) with pluggable pipelines

Hibernate 6+

  • Native support for postgresql JSON / JSONB and direct mapping from column to a java String/POJO.
@Column
@JdbcTypeCode(SqlTypes.JSON)
private MyJsonData jsonData;

postgres advanced indexing optimized data retrieval

GIN

  • Optimized for: JSONB filtering, fill-text search.
  • Best use cases: product metadata, documents.
  • Index size: Large.

GiST

  • Optimized for: ranged-based queries.
  • Best use cases: prices, time intervals, geometric searches.
  • Index size: Medium.

BRIN

  • Optimized for: bulk read of ordered data.
  • Best use cases: time-series data, financial transactions.
  • Index size: Small.

A double-edged sword: postgresql’s Multi-Version Concurrency Control (MVCC)

  • Strengths:
    • high concurrency
    • non-blocking reads
    • data integrity
    • performance
  • flaws
    • table bloat
    • autovaccum overhead
    • transaction id (XID) wraparound
    • write amplification

VACUUM to maintain postgresql performance

  • WHY? Cleans up old versions of rows left by MVCC.
  • WHAT? Reclaims disk space and prevents table bloat.
  • HOW? Runs automatically in the background via Autovaccum / Manual VACUUM
  • Impact: ensures performance but requires careful tuning to avoid I/O overhead.

jdbc flow

  1. open connection
  2. prepare statement
  3. execute query
  4. handle results
  5. commit/rollback
  6. close connection

Batching with hibernate

  • SessionFactory-level:
hibernate.jdbc.batch_size=5
  • Session-level:
session.setJdbcBatchSize(10)

Use hibernate Stateless Session for faster batching operations, as a stateless session:

  • doesn’t have first-level cache, nor does it interact with any second-level caches
  • doesn’t implement transactional write-behind or automatic dirty checking, so all operations are executed immediately when the’re explicitly called
try (var statelessSession = sessionFactory.openStatelessSession()) {
  statelessSession.setJdbcBatchSize(batchSize);
  Transaction transaction = statelessSession.begingTransaction();
  // ...
  transaction.commit();
}

Warning

postgresql default fetch size is 0! So it will try to fetch everything and put in memory!

hibernate traditional session:

  • first-level cache (session-scoped)
  • supports caching, dirty checking, lazy loading
  • manages entity lifecycle
  • best for interactive, transactional use cases