Data Migration Using Hadoop

This is one of the most popular use cases within banks that is migrating trade data from traditional relational data sources to Hadoop. This is also known as online data archiving.

Transaction data Archival

When we acquire a new data warehouse in any financial organization, the fundamental design is based on this question, “What do we need to store and for how long?”
Ask this question to businesses and their answer will be simple—Everything and forever. Even regulatory requirements, such as Sarbanes-Oxley stipulate that we need to store records for at least 5 to 7 years and the data needs to be accessible in a reasonable amount of time. Today, that reasonable amount of time is not weeks, as before, but in the order of days or even one business day for certain types of data. In banks, data warehouses are mostly built on high-performance enterprise databases with very expensive storage. So, they retain only recent data (say the last year) at the detailed level and summarize the remaining 5 to 10 years of transactions, positions, and events. They move older data to tapes or optical media to save cost. However, the big problem is the detailed data is not accessible unless it is restored back to the database, which again costs time and money.

The problem of storage is even worse, because generally the trade, event, and position tables have hundreds of columns on data warehouses. In fact, business is mostly interested in 10–15 columns of these tables on a day-to-day basis and all other columns are rarely queried. But business still wants the flexibility to query less frequently used columns, if they need to.

Solution

Hadoop HDFS is a low-cost storage and has almost unlimited scalability and thus is an excellent solution for this use case. One can archive historical data from expensive high-performance databases to low-cost HDFS, and still process the data. Because Hadoop can scale horizontally, simply by adding more data nodes, the business can store as much as they like. Data is archived on Hadoop HDFS instead of tapes or optical media, which makes the data accessible quickly. This also offers flexibility to store less frequently used columns on HDFS.

Once data is on HDFS, it can be accessed using Hive or Pig queries. Developers can also write MapReduce jobs for slightly more complicated data access operations.

Low-cost data warehouse online archives is one of the simplest Hadoop projects for banks to implement and has an almost immediate return on investment. Following are the different ways, how we make use of Hadoop

  • Option 1: While loading transactions from source systems into a relational data warehouse, load frequently used columns into the data warehouse and all columns into HDFS
  • Option 2: Migrate all transactions that are older than a year from the relational data warehouse to HDFS
The target world

Big data use cases in the financial sector

Financial organizations have been actively using big data platforms for the last few years and their key objectives are:

  • Complying with regulatory requirements
  • Better risk analytics
  • Understanding customer behavior and improving services
  • Understanding transaction patterns and monetizing using cross-selling of products

Data archival on HDFS

Archiving data on HDFS is one of the basic use cases for Hadoop in financial organizations and is a quick win. It is likely to provide a very high return on investment. The data is archived on Hadoop and is still available to query (although not in real time), which is far more efficient than archiving on tape and far less expensive than keeping it on databases. Some of the use cases are:

  • Migrate expensive and inefficient legacy mainframe data and load jobs to the Hadoop platform
  • Migrate expensive older transaction data from high-end expensive databases to Hadoop HDFS
  • Migrate unstructured legal, compliance, and onboarding documents to Hadoop HDFS

Regulatory

Financial organizations must comply with regulatory requirements. In order to meet these requirements, the use of traditional data processing platforms is becoming increasingly expensive and unsustainable.

A couple of such use cases are:

  • Checking customer names against a sanctions blacklist is very complicated due to the same or similar names. It is even more complicated when financial organizations have different names or aliases across different systems. With Hadoop, we can apply complex fuzzy matching on name and contact information across massive data sets at a much lower cost.
  • The BCBS239 regulation states that financial organizations must be able to aggregate risk exposures across the whole group quickly and accurately. With Hadoop, financial organizations can consolidate and aggregate data on a single platform in the most efficient and cost-effective way.

Fraud detection

Fraud is estimated to cost the financial industry billions of US dollars per year. Financial organizations have invested in Hadoop platforms to identify fraudulent transactions by picking up unusual behavior patterns.

Complex algorithms that need to be run on large volumes of transaction data to identify outliers are now possible on the Hadoop platform at a much lower expense.

Tick data

Stock market tick data is real-time data and generated on a massive scale. Live data streams can be processed using real-time streaming technology on the Hadoop infrastructure for quick trading decisions, and older tick data can be used for trending and forecasting using batch Hadoop tools.

Risk management

Financial organizations must be able to measure risk exposures for each customer and effectively aggregate it across entire business divisions. They should be able to score the credit risk for each customer using internal rules. They need to build risk models with intensive calculation on the underlying massive data.

All these risk management requirements have two things in common—massive data and intensive calculation. Hadoop can handle both, given its inexpensive commodity hardware and parallel execution of jobs.

Customer behavior prediction

Once the customer data has been consolidated from a variety of sources on a Hadoop platform, it is possible to analyze data and:

  • Predict mortgage defaults
  • Predict spending for retail customers
  • Analyze patterns that lead to customers leaving and customer dissatisfaction

Sentiment analysis – unstructured

Sentiment analysis is one of the best use cases to test the power of unstructured data analysis using Hadoop. Here are a few use cases:

  • Analyze all e-mail text and call recordings from customers, which indicates whether they feel positive or negative about the products offered to them
  • Analyze Facebook and Twitter comments to make buy or sell recommendations—analyze the market sentiments on which sectors or organizations will be a better buy for stock investments
  • Analyze Facebook and Twitter comments to assess the feedback on new products

Data Lake

Data-driven decision making is changing how we work and live. From data science, machine learning, and advanced analytics to real-time dashboards, decision makers are demanding data to help make decisions.
With so much variety, volume, and velocity, the old systems and processes are no longer able to support the data needs of the enterprise.
To support these endeavors and address these challenges, a revolution is occurring in data management around how data is stored, processed, managed, and provided to the decision makers. Big data technology is enabling scalability and cost efficiency orders of magnitude greater than what’s possible with traditional data management infrastructure.
The data lake is a daring new approach that harnesses the power of big data technology and marries it with agility of self-service. 
Most large enterprises today either have deployed or are in the process of deploying data lakes.
. The term was invented and first described by James Dixon, CTO of Pentaho : “If you think of a datamart as a store of bottled water—cleansed and packaged and structured for easy consumption—the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”