Files

fahed 9c0ffa5721 docs: add ETL pipeline design spec

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-26 17:26:23 +03:00

7.9 KiB

Raw Blame History

ETL Pipeline: ERP → NocoDB Daily Sales

Goal

Replace the current client-side ERP fetching (which downloads hundreds of MBs of raw transactions to the browser) with a server-side ETL pipeline that aggregates ERP data into NocoDB. The dashboard reads pre-aggregated data from NocoDB — fast and lightweight.

Data Flow

Daily (2am cron):
  ERP API → Server (fetch + aggregate) → NocoDB "DailySales" table

On page load:
  NocoDB "DailySales" → Dashboard client (small payload, fast)

NocoDB "DailySales" Table

One row per date/museum/channel combination. Flat — no lookup tables needed.

Column	Type	Example
Date	string	`2025-03-01`
MuseumName	string	`Revelation Exhibition`
Channel	string	`HiHala Website/App`
Visits	number	`702`
Tickets	number	`71`
GrossRevenue	number	`12049.00`
NetRevenue	number	`10477.40`

Museums are derived from product descriptions using a priority-ordered keyword mapping (46 products → 6 museums). Channels are derived from OperatingAreaName with display labels (e.g. B2C → "HiHala Website/App").

Server Architecture

New files

File	Responsibility
`server/src/config/museumMapping.ts`	Product → museum mapping, channel labels (moved from client)
`server/src/types.ts`	Server-side ERP types (`ERPSaleRecord`, `ERPProduct`, `ERPPayment`, `AggregatedRecord`)
`server/src/services/nocodbClient.ts`	NocoDB table discovery (via `process.env`, NOT `import.meta.env`) + paginated read/write
`server/src/services/etlSync.ts`	Orchestrate: fetch ERP → aggregate → write NocoDB
`server/src/routes/etl.ts`	`POST /api/etl/sync` endpoint (protected by secret token)

Modified files

File	Change
`server/src/config.ts`	Add NocoDB config (`process.env.NOCODB_*`)
`server/src/index.ts`	Mount ETL route
`server/.env`	Add `NOCODB_*` and `ETL_SECRET` vars
`server/.env.example`	Add `NOCODB_*` and `ETL_SECRET` placeholders
`src/services/dataService.ts`	Revert to NocoDB fetch with paginated reads for DailySales

Removed files

File	Reason
`server/src/routes/erp.ts`	Client no longer calls ERP directly
`src/services/erpService.ts`	Client no longer aggregates transactions
`src/config/museumMapping.ts`	Moved to server

ETL Sync Endpoint

POST /api/etl/sync?mode=full|incremental
Authorization: Bearer <ETL_SECRET>

Protected by a secret token (ETL_SECRET env var). Requests without a valid token get 401. The cron passes it: curl -H "Authorization: Bearer $ETL_SECRET" -X POST ....

incremental (default): fetch current month from ERP, aggregate, upsert into NocoDB. Used by daily cron.
full: fetch all months from 2024-01 to now, clear and replace all NocoDB DailySales data. Used for initial setup or recovery.

Incremental date range

The current month is defined as:

startDate: YYYY-MM-01T00:00:00 (first of current month)
endDate: YYYY-{MM+1}-01T00:00:00 (first of next month, exclusive)

This matches the convention already used in erpService.ts month boundary generation.

Response:

{
  "status": "ok",
  "mode": "incremental",
  "transactionsFetched": 12744,
  "recordsWritten": 342,
  "duration": "8.2s"
}

Aggregation Logic

For each ERP transaction:

Extract date from TransactionDate (split on space, take first part)
Map OperatingAreaName → channel label via getChannelLabel()
For each product in Products[]:
- Map ProductDescription → museum name via getMuseumFromProduct() (priority-ordered keyword matching)
- Accumulate into composite key date|museum|channel:
  - visits += PeopleCount
  - tickets += UnitQuantity
  - GrossRevenue += TotalPrice
  - NetRevenue += TotalPrice - TaxAmount

Negative quantities (refunds) sum correctly by default.

NocoDB Upsert Strategy

For incremental sync:

Delete all rows in DailySales where Date falls within the fetched month range
Insert the newly aggregated rows

For full sync:

Delete all rows in DailySales
Insert all aggregated rows

This avoids duplicate detection complexity — just replace the month's data.

Race condition note

During the delete/insert window, dashboard reads may see incomplete data. Mitigations:

The sync runs at 2am when traffic is minimal
The client's localStorage cache (7-day TTL) means most page loads never hit NocoDB
The client checks if fetched data is suspiciously small (< 10 rows) and prefers cached data over a likely-incomplete NocoDB read
For full syncs, the window is larger (~2-5 minutes). If this becomes a problem, a shadow-table swap pattern can be added later.

Client Changes

dataService.ts

Revert to reading from NocoDB. The DailySales table is flat, so no joins needed. Must use paginated fetch (NocoDB defaults to 25 rows per page, max 1000). The existing fetchNocoDBTable() helper already handles pagination — reintroduce it.

async function fetchFromNocoDB(): Promise<MuseumRecord[]> {
  const tables = await discoverTableIds();
  const rows = await fetchNocoDBTable<NocoDBDailySale>(tables['DailySales']);
  return rows.map(row => ({
    date: row.Date,
    museum_name: row.MuseumName,
    channel: row.Channel,
    visits: row.Visits,
    tickets: row.Tickets,
    revenue_gross: row.GrossRevenue,
    revenue_net: row.NetRevenue,
    year: row.Date.substring(0, 4),
    quarter: computeQuarter(row.Date),
  }));
}

Add a NocoDBDailySale type to src/types/index.ts:

export interface NocoDBDailySale {
  Id: number;
  Date: string;
  MuseumName: string;
  Channel: string;
  Visits: number;
  Tickets: number;
  GrossRevenue: number;
  NetRevenue: number;
}

No Districts, Museums, or DailyStats tables needed — just DailySales and PilgrimStats.

Suspicious data check

In fetchData(), if NocoDB returns fewer than 10 rows and a cache exists, prefer the cache:

if (data.length < 10 && cached) {
  console.warn('NocoDB returned suspiciously few rows, using cache');
  return { data: cached.data, fromCache: true, cacheTimestamp: cached.timestamp };
}

Server Environment

Add to server/.env:

NOCODB_URL=http://localhost:8090
NOCODB_TOKEN=<token>
NOCODB_BASE_ID=<base_id>
ETL_SECRET=<random-secret-for-cron>

Note: Client .env.local retains its existing VITE_NOCODB_* vars — the client still reads NocoDB directly for both DailySales and PilgrimStats.

Update server/.env.example with the same keys (placeholder values).

Server-Side Types

ERP types are re-declared in server/src/types.ts (not imported from the client src/types/index.ts):

export interface ERPProduct {
  ProductDescription: string;
  SiteDescription: string | null;
  UnitQuantity: number;
  PeopleCount: number;
  TaxAmount: number;
  TotalPrice: number;
}

export interface ERPSaleRecord {
  SaleId: number;
  TransactionDate: string;
  CustIdentification: string;
  OperatingAreaName: string;
  Payments: Array<{ PaymentMethodDescription: string }>;
  Products: ERPProduct[];
}

export interface AggregatedRecord {
  Date: string;
  MuseumName: string;
  Channel: string;
  Visits: number;
  Tickets: number;
  GrossRevenue: number;
  NetRevenue: number;
}

Cron

0 2 * * * curl -s -H "Authorization: Bearer $ETL_SECRET" -X POST http://localhost:3002/api/etl/sync

Runs daily at 2am. The incremental mode fetches only the current month (~15-25K transactions), aggregates server-side, and writes ~300-500 rows to NocoDB.

What's NOT Changing

PilgrimStats still fetched from NocoDB by the client (unchanged)
Client .env.local retains VITE_NOCODB_* vars (still needed for client reads)
All dashboard UI components (Dashboard, Comparison) stay as-is
Channel and museum filters stay as-is
Cache/offline fallback logic stays as-is (enhanced with suspicious-data check)
Dark mode, i18n, accessibility — all unchanged

7.9 KiB Raw Blame History