Data-intensive computing

Content sourced from Wikipedia, licensed under CC BY-SA 3.0.

Data-intensive computing refers to programs that process very large amounts of data by using many computers at once. These programs spend most of their time moving, organizing, and analyzing data rather than performing heavy calculations. The explosion of the internet and business data has created terabytes or even petabytes of information, much of it unstructured, which must be searched, analyzed, and visualized quickly.

To handle this volume, data is split into pieces and processed in parallel on many machines. This data-parallel approach scales with the amount of data. It differs from compute-intensive tasks, where the bottleneck is CPU time and data size is smaller. Data-intensive work is often I/O bound and can be solved with clusters of machines, data grids, or cloud systems.

One popular model for data-intensive processing is MapReduce. You write small programs that map data to key-value pairs and then reduce values by key to produce results. The system handles partitioning, scheduling, and inter-node communication, so developers don’t need deep parallel-programming skills. Hadoop is a leading open-source implementation of MapReduce and includes a distributed file system (HDFS) plus tools like HBase, Hive, Pig, and Chukwa to support different data tasks.

Other platforms also support data-intensive work. LexisNexis developed HPCC, using the ECL language and two clusters: Thor for batch-style data processing and Roxie for fast online queries. Both run on commodity hardware and are designed for scalable performance on large data tasks.

Data-intensive computing faces challenges such as keeping up with growing data volumes, reducing analysis time, and creating algorithms that scale. Its goal is to turn massive data into useful information quickly, enabling better search, data mining, and visualization.

This page was last edited on 2 February 2026, at 15:09 (CET).