Data Migration Using Hadoop

This is one of the most popular use cases within banks that is migrating trade data from traditional relational data sources to Hadoop. This is also known as online data archiving.

Transaction data Archival

When we acquire a new data warehouse in any financial organization, the fundamental design is based on this question, “What do we need to store and for how long?”
Ask this question to businesses and their answer will be simple—Everything and forever. Even regulatory requirements, such as Sarbanes-Oxley stipulate that we need to store records for at least 5 to 7 years and the data needs to be accessible in a reasonable amount of time. Today, that reasonable amount of time is not weeks, as before, but in the order of days or even one business day for certain types of data. In banks, data warehouses are mostly built on high-performance enterprise databases with very expensive storage. So, they retain only recent data (say the last year) at the detailed level and summarize the remaining 5 to 10 years of transactions, positions, and events. They move older data to tapes or optical media to save cost. However, the big problem is the detailed data is not accessible unless it is restored back to the database, which again costs time and money.

The problem of storage is even worse, because generally the trade, event, and position tables have hundreds of columns on data warehouses. In fact, business is mostly interested in 10–15 columns of these tables on a day-to-day basis and all other columns are rarely queried. But business still wants the flexibility to query less frequently used columns, if they need to.

Solution

Hadoop HDFS is a low-cost storage and has almost unlimited scalability and thus is an excellent solution for this use case. One can archive historical data from expensive high-performance databases to low-cost HDFS, and still process the data. Because Hadoop can scale horizontally, simply by adding more data nodes, the business can store as much as they like. Data is archived on Hadoop HDFS instead of tapes or optical media, which makes the data accessible quickly. This also offers flexibility to store less frequently used columns on HDFS.

Once data is on HDFS, it can be accessed using Hive or Pig queries. Developers can also write MapReduce jobs for slightly more complicated data access operations.

Low-cost data warehouse online archives is one of the simplest Hadoop projects for banks to implement and has an almost immediate return on investment. Following are the different ways, how we make use of Hadoop

  • Option 1: While loading transactions from source systems into a relational data warehouse, load frequently used columns into the data warehouse and all columns into HDFS
  • Option 2: Migrate all transactions that are older than a year from the relational data warehouse to HDFS
The target world

Leave a comment