{"id":3275,"date":"2025-04-23T07:02:30","date_gmt":"2025-04-23T07:02:30","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/04\/23\/mapreduce-how-it-powers-scalable-data-processing\/"},"modified":"2025-04-23T07:02:30","modified_gmt":"2025-04-23T07:02:30","slug":"mapreduce-how-it-powers-scalable-data-processing","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/04\/23\/mapreduce-how-it-powers-scalable-data-processing\/","title":{"rendered":"MapReduce: How It Powers Scalable Data Processing"},"content":{"rendered":"<p>    MapReduce: How It Powers Scalable Data Processing<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\"><mdspan datatext=\"el1745350024281\" class=\"mdspan-comment\">In this article<\/mdspan>, I\u2019ll give a brief introduction to the <a href=\"https:\/\/en.wikipedia.org\/wiki\/MapReduce\" target=\"_blank\" rel=\"noreferrer noopener\">MapReduce<\/a> programming model. Hopefully after reading this, you leave with a solid intuition of what MapReduce is, the role it plays in scalable data processing, and how to recognize when it can be applied to optimize a computational task.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Contents:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong><a href=\"https:\/\/towardsdatascience.com\/#terminology\">Terminology &amp; Useful Background<\/a><\/strong><\/li>\n<li class=\"wp-block-list-item\"><strong><a href=\"https:\/\/towardsdatascience.com\/#what-is-mapreduce\">What is MapReduce?<\/a><\/strong><\/li>\n<li class=\"wp-block-list-item\"><strong><a href=\"https:\/\/towardsdatascience.com\/#motivation\">Motivation &amp; Simple Example<\/a><\/strong><\/li>\n<li class=\"wp-block-list-item\"><strong><a href=\"https:\/\/towardsdatascience.com\/#mapreduce-walkthrough\">MapReduce Walkthrough<\/a><\/strong><\/li>\n<li class=\"wp-block-list-item\"><strong><a href=\"https:\/\/towardsdatascience.com\/#mapreduce-in-code\">Expressing a MapReduce Job in Code<\/a><\/strong><\/li>\n<li class=\"wp-block-list-item\"><strong><a href=\"https:\/\/towardsdatascience.com\/#contributions-current-state\">MapReduce Contributions &amp; Current State<\/a><\/strong><\/li>\n<li class=\"wp-block-list-item\"><strong><a href=\"https:\/\/towardsdatascience.com\/#wrap-up\">Wrap-up<\/a><\/strong><\/li>\n<li class=\"wp-block-list-item\"><strong><a href=\"https:\/\/towardsdatascience.com\/#sources\">Sources<\/a><\/strong><\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h3 class=\"wp-block-heading\" id=\"terminology\">Terminology &amp; Useful Background:<\/h3>\n<p class=\"wp-block-paragraph\">Below are some terms\/concepts that may be useful to know before reading the rest of this article.<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/en.wikipedia.org\/wiki\/Distributed_computing\" target=\"_blank\" rel=\"noreferrer noopener\">Distributed computing fundamentals<\/a><\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/web.mit.edu\/6.005\/www\/fa15\/classes\/25-map-filter-reduce\/\" target=\"_blank\" rel=\"noreferrer noopener\">Map &amp; reduce operations<\/a><\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/en.wikipedia.org\/wiki\/Name%E2%80%93value_pair\" target=\"_blank\" rel=\"noreferrer noopener\">Key-Value data representation<\/a><\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/en.wikipedia.org\/wiki\/Associative_array\" target=\"_blank\" rel=\"noreferrer noopener\">Map data structure<\/a><\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h3 class=\"wp-block-heading\" id=\"what-is-mapreduce\">What is MapReduce?<\/h3>\n<p class=\"wp-block-paragraph\">Introduced by a couple of developers at Google in the early 2000s, MapReduce is a programming model that enables large-scale data processing to be carried out in a parallel and distributed manner across a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Computer_cluster\" rel=\"noreferrer noopener\" target=\"_blank\">compute cluster<\/a> consisting of many <a href=\"https:\/\/www.techtarget.com\/whatis\/definition\/commodity-hardware#:~:text=Commodity%20hardware%20in%20computing%20is,all%20PCs%20use%20commodity%20hardware.\" rel=\"noreferrer noopener\" target=\"_blank\">commodity machines<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">The MapReduce programming model is ideal for optimizing compute tasks that can be broken down into independent transformations on distinct partitions of the input data. These transformations are typically followed by <a href=\"https:\/\/pandas.pydata.org\/docs\/user_guide\/groupby.html\" rel=\"noreferrer noopener\" target=\"_blank\">grouped aggregation<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">The programming model breaks up the computation into the following two primitives:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Map<\/strong>: given a partition of the input data to process, parse the input data for each of its individual records. For each record, apply some user-defined data transformation to extract a set of intermediate key-value pairs.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Reduce<\/strong>: for each distinct key in the set of intermediate key-value pairs, aggregate the values in some manner to produce a smaller set of key-value pairs. Typically, the output of the reduce phase is a single key-value pair for each distinct key.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">In this MapReduce framework, computation is distributed among a compute cluster of <em>N<\/em> machines with homogenous commodity hardware, where <em>N<\/em> may be in the hundreds or thousands, in practice. One of these machines is designated as the <strong>master<\/strong>, and all the other machines are designated as <strong>workers<\/strong>.<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Master<\/strong>: handles task scheduling by assigning map and reduce tasks to available workers.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Worker<\/strong>: handle the map and reduce tasks it\u2019s assigned by the master.<\/li>\n<\/ul>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"566\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/mr_cluster_setup-1024x566.jpg?resize=1024%2C566&#038;ssl=1\" alt=\"\" class=\"wp-image-602021\"><figcaption class=\"wp-element-caption\">MapReduce cluster setup. Solid arrows symbolize a\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Fork_(system_call)#\" target=\"_blank\" rel=\"noreferrer noopener\">fork()<\/a>, and the dashed arrows symbolize task assignment.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Each of the tasks within the map or reduce phase may be executed in a parallel and distributed manner across the available workers in the compute cluster. However, the map and reduce phases are executed <em>sequentially<\/em>\u200a\u2014\u200athat is, all map tasks must complete before kicking off the reduce phase.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"345\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/mr_dataflow-1024x345.jpg?resize=1024%2C345&#038;ssl=1\" alt=\"\" class=\"wp-image-602022\"><figcaption class=\"wp-element-caption\">Rough dataflow of the execution process for a single MapReduce job.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">That all probably sounds pretty abstract, so let\u2019s go through some motivation and a concrete example of how the MapReduce framework can be applied to optimize common data processing tasks.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h3 class=\"wp-block-heading\" id=\"motivation\">Motivation &amp; Simple Example<\/h3>\n<p class=\"wp-block-paragraph\" id=\"f56c\">The MapReduce programming model is typically best for large\u00a0<a href=\"https:\/\/aws.amazon.com\/what-is\/batch-processing\/#:~:text=Batch%20processing%20is%20the%20method,run%20on%20individual%20data%20transactions.\" rel=\"noreferrer noopener\" target=\"_blank\">batch processing<\/a>\u00a0tasks that require executing independent data transformations on distinct groups of the input data, where each group is typically identified by a unique value of a keyed attribute.<\/p>\n<p class=\"wp-block-paragraph\" id=\"cb61\">You can think of this framework as an extension to the\u00a0<a href=\"https:\/\/www.jstatsoft.org\/article\/view\/v040i01\" rel=\"noreferrer noopener\" target=\"_blank\">split-apply-combine<\/a>\u00a0pattern in the context of data analysis, where map encapsulates the split-apply logic and reduce corresponds with the combine. The critical difference is that MapReduce can be applied to achieve parallel and distributed implementations for generic computational tasks outside of data wrangling and statistical computing.<\/p>\n<p class=\"wp-block-paragraph\" id=\"c03a\">One of the motivating data processing tasks that inspired Google to create the MapReduce framework was to build\u00a0<a href=\"https:\/\/www.cockroachlabs.com\/blog\/inverted-indexes\/\" rel=\"noreferrer noopener\" target=\"_blank\">indexes<\/a>\u00a0for its search engine.<\/p>\n<p class=\"wp-block-paragraph\" id=\"77af\">We can express this task as a MapReduce job using the following logic:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Divide the corpus to search through into separate partitions\/documents.<\/li>\n<li class=\"wp-block-list-item\">Define a\u00a0<strong>map()<\/strong>\u00a0function to apply to each document of the corpus, which will emit &lt;word, documentID&gt; pairs for every word that is parsed in the partition.<\/li>\n<li class=\"wp-block-list-item\">For each distinct key in the set of intermediate &lt;word, documentID&gt; pairs produced by the mappers, apply a user-defined\u00a0<strong>reduce()\u00a0<\/strong>function that will combine the document IDs associated with each word to produce &lt;word, list(documentIDs)&gt; pairs.<\/li>\n<\/ul>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"600\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/mr_inverted_index-1024x600.jpg?resize=1024%2C600&#038;ssl=1\" alt=\"\" class=\"wp-image-602023\"><figcaption class=\"wp-element-caption\">MapReduce workflow for constructing an inverted index.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">For additional examples of data processing tasks that fit well with the MapReduce framework, check out\u00a0<a href=\"https:\/\/research.google\/pubs\/mapreduce-simplified-data-processing-on-large-clusters\/\" rel=\"noreferrer noopener\" target=\"_blank\">the original paper<\/a>.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h3 class=\"wp-block-heading\" id=\"mapreduce-walkthrough\">MapReduce Walkthrough<\/h3>\n<p class=\"wp-block-paragraph\" id=\"d056\">There are numerous other great resources that walkthrough how the MapReduce algorithm works. However, I don\u2019t feel that this article would be complete without one. Of course, refer to the\u00a0<a href=\"https:\/\/research.google\/pubs\/mapreduce-simplified-data-processing-on-large-clusters\/\" rel=\"noreferrer noopener\" target=\"_blank\">original paper<\/a>\u00a0for the \u201csource of truth\u201d of how the algorithm works.<\/p>\n<p class=\"wp-block-paragraph\" id=\"65d6\">First, some basic configuration is required to prepare for execution of a MapReduce job.<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Implement\u00a0<strong>map()<\/strong>\u00a0and\u00a0<strong>reduce()<\/strong>\u00a0to handle the data transformation and aggregation logic specific to the computational task.<\/li>\n<li class=\"wp-block-list-item\">Configure the block size of the input partition passed to each map task. The MapReduce library will then establish the number of map tasks accordingly,\u00a0<em>M<\/em>, that will be created and executed.<\/li>\n<li class=\"wp-block-list-item\">Configure the number of reduce tasks,\u00a0<em>R<\/em>, that will be executed. Additionally, the user may specify a deterministic partitioning function to specify how key-value pairs are assigned to partitions. In practice, this partitioning function is typically a\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Hash_function#:~:text=A%20hash%20function%20is%20any,%29%20digests%2C%20or%20simply%20hashes.\" target=\"_blank\" rel=\"noreferrer noopener\">hash<\/a>\u00a0of the key (i.e. hash(key) mod\u00a0<em>R<\/em>).<\/li>\n<li class=\"wp-block-list-item\">Typically, it\u2019s desirable to have\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Granularity_%28parallel_computing%29#Fine-grained_parallelism\" target=\"_blank\" rel=\"noreferrer noopener\">fine task granularity<\/a>. In other words,\u00a0<em>M<\/em>\u00a0and\u00a0<em>R<\/em>\u00a0should be much larger than the number of machines in the compute cluster. Since the master node in a MapReduce cluster assigns tasks to workers based on availability, partitioning the processing workload into many tasks decreases the chances that any single worker node will be overloaded.<\/li>\n<\/ul>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"761\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/mr_job_execution-1024x761.jpg?resize=1024%2C761&#038;ssl=1\" alt=\"\" class=\"wp-image-602024\"><figcaption class=\"wp-element-caption\">MapReduce Job Execution (M = 6, R = 2).<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"fa4d\">Once the required configuration steps are completed, the MapReduce job can be executed. The execution process of a MapReduce job can be broken down into the following steps:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Partition the input data into\u00a0<em>M<\/em>\u00a0partitions, where each partition is associated with a map worker.<\/li>\n<li class=\"wp-block-list-item\">Each map worker applies the user-defined\u00a0<strong>map()<\/strong>\u00a0function to its partition of the data. The execution of each of these\u00a0<strong>map()<\/strong>\u00a0functions on each map worker may be carried out in parallel. The\u00a0<strong>map()<\/strong>\u00a0function will parse the input records from its data partition and extract all key-value pairs from each input record.<\/li>\n<li class=\"wp-block-list-item\">The map worker will sort these key-value pairs in increasing key order. Optionally, if there are multiple key-value pairs for a single key, the values for the key may be\u00a0<a href=\"https:\/\/data-flair.training\/blogs\/hadoop-combiner-tutorial\/\" target=\"_blank\" rel=\"noreferrer noopener\">combined<\/a>\u00a0into a single key-value pair, if desired.<\/li>\n<li class=\"wp-block-list-item\">These key-value pairs are then written to\u00a0<em>R<\/em>\u00a0separate files stored on the local disk of the worker. Each file corresponds to a single reduce task. The locations of these files are registered with the master.<\/li>\n<li class=\"wp-block-list-item\">When all the map tasks have finished, the master notifies the reducer workers the locations of the intermediate files associated with the reduce task.<\/li>\n<li class=\"wp-block-list-item\">Each reduce task uses\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Remote_procedure_call\" target=\"_blank\" rel=\"noreferrer noopener\">remote procedure calls<\/a>\u00a0to read the intermediate files associated with the task stored on the local disks of the mapper workers.<\/li>\n<li class=\"wp-block-list-item\">The reduce task then iterates over each of the keys in the intermediate output, and then applies the user-defined\u00a0<strong>reduce()<\/strong>\u00a0function to each distinct key in the intermediate output, along with its associated set of values.<\/li>\n<li class=\"wp-block-list-item\">Once all the reduce workers have completed, the master worker notifies the user program that the MapReduce job is complete. The output of the MapReduce job will be available in the\u00a0<em>R<\/em>\u00a0output files stored in the distributed file system. The users may access these files directly, or pass them as input files to another MapReduce job for further processing.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h3 class=\"wp-block-heading\" id=\"mapreduce-in-code\">Expressing a MapReduce Job in Code<\/h3>\n<p class=\"wp-block-paragraph\" id=\"8bfa\">Now let\u2019s look at how we can use the MapReduce framework to optimize a common data engineering workload\u2014 cleaning\/standardizing large amounts of raw data, or the transform stage of a typical\u00a0<a href=\"https:\/\/www.bigdataframework.org\/knowledge\/etl-in-data-engineering\/#:~:text=in%20Data%20Engineering-,Introduction%20to%20ETL,Adding%20additional%20data%20or%20context.\" rel=\"noreferrer noopener\" target=\"_blank\">ETL workflow<\/a>.<\/p>\n<p class=\"wp-block-paragraph\" id=\"a34e\">Suppose that we are in charge of managing data related to a user registration system. Our data schema may contain the following information:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Name of user<\/li>\n<li class=\"wp-block-list-item\">Date they joined<\/li>\n<li class=\"wp-block-list-item\">State of residence<\/li>\n<li class=\"wp-block-list-item\">Email address<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\" id=\"42a7\">A sample dump of raw data may look like this:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-markup\">John Doe , 04\/09\/25, il, john@gmail.com\n jane SMITH, 2025\/04\/08, CA, jane.smith@yahoo.com\n JOHN  DOE, 2025-04-09, IL, john@gmail.com\n Mary  Jane, 09-04-2025, Ny, maryj@hotmail.com\n    Alice Walker, 2025.04.07, tx, alicew@outlook.com\n   Bob Stone  , 04\/08\/2025, CA, bobstone@aol.com\n BOB  STONE , 2025\/04\/08, CA, bobstone@aol.com<\/code><\/pre>\n<p class=\"wp-block-paragraph\" id=\"e1e7\">Before making this data accessible for analysis, we probably want to transform the data to a clean, standard format.<\/p>\n<p class=\"wp-block-paragraph\" id=\"ec3b\">We\u2019ll want to fix the following:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Names and states have inconsistent case.<\/li>\n<li class=\"wp-block-list-item\">Dates vary in format.<\/li>\n<li class=\"wp-block-list-item\">Some fields contain redundant whitespace.<\/li>\n<li class=\"wp-block-list-item\">There are duplicate entries for certain users (ex: John Doe, Bob Stone).<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\" id=\"05e0\">We may want the final output to look like this.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-markup\">alice walker,2025-04-07,TX,alicew@outlook.com\nbob stone,2025-04-08,CA,bobstone@aol.com\njane smith,2025-04-08,CA,jane.smith@yahoo.com\njohn doe,2025-09-04,IL,john@gmail.com\nmary jane,2025-09-04,NY,maryj@hotmail.net<\/code><\/pre>\n<p class=\"wp-block-paragraph\" id=\"406e\">The data transformations we want to carry out are straightforward, and we could write a simple program that parses the raw data and applies the desired transformation steps to each individual line in a serial manner. However, if we\u2019re dealing with millions or billions of records, this approach may be quite time consuming.<\/p>\n<p class=\"wp-block-paragraph\" id=\"ca57\">Instead, we can use the MapReduce model to apply our data transformations to distinct partitions of the raw data, and then \u201caggregate\u201d these transformed outputs by discarding any duplicate entries that appear in the intermediate result.<\/p>\n<p class=\"wp-block-paragraph\" id=\"710b\">There are many libraries\/frameworks available for expressing programs as MapReduce jobs. For our example, we\u2019ll use the\u00a0<a href=\"https:\/\/mrjob.readthedocs.io\/en\/latest\/\" rel=\"noreferrer noopener\" target=\"_blank\">mrjob<\/a>\u00a0library to express our data transformation program as a MapReduce job in python.<\/p>\n<p class=\"wp-block-paragraph\" id=\"839a\">mrjob simplifies the process of writing MapReduce as the developer simply needs to provide implementations for the mapper and reducer logic in a single python class. Although it\u2019s no longer under active development and may not achieve the same level of performance as other options that allow deployment of jobs on Hadoop (as its a python wrapper around the Hadoop API), it\u2019s a great way for anybody familiar with python to start learning how to write MapReduce jobs and recognizing how to break up computation into map and reduce tasks.<\/p>\n<p class=\"wp-block-paragraph\" id=\"7a49\">Using mrjob, we can write a simple MapReduce job by subclassing the MRJob class and overriding the\u00a0<strong>mapper()<\/strong>\u00a0and\u00a0<strong>reducer()<\/strong>\u00a0methods.<\/p>\n<p class=\"wp-block-paragraph\" id=\"19d8\">Our\u00a0<strong>mapper()<\/strong>\u00a0will contain the data transformation\/cleaning logic we want to apply to each record of input:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Standardize names and states to lowercase and uppercase, respectively.<\/li>\n<li class=\"wp-block-list-item\">Standardize dates to %Y-%m-%d format.<\/li>\n<li class=\"wp-block-list-item\">Strip unnecessary whitespace around fields.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\" id=\"26a4\">After applying these data transformations to each record, it\u2019s possible that we may end up with duplicate entries for some users. Our\u00a0<strong>reducer()<\/strong>\u00a0implementation will eliminate such duplicate entries that appear.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from mrjob.job import MRJob\nfrom mrjob.step import MRStep\nfrom datetime import datetime\nimport csv\nimport re\n\nclass UserDataCleaner(MRJob):\n\n   def mapper(self, _, line):\n       \"\"\"\n       Given a record of input data (i.e. a line of csv input),\n       parse the record for &lt;Name, (Date, State, Email)&gt; pairs and emit them.\n       \n       If this function is not implemented,\n       by default, &lt;None, line&gt; will be emitted.\n       \"\"\"\n       try:\n           row = next(csv.reader([line])) # returns row contents as a list of strings (\",\" delimited by default)\n           \n           # if row contents don't follow schema, don't extract KV pairs\n           if len(row) != 4:\n               return\n           \n           name, date_str, state, email = row\n\n           # clean data\n           name = re.sub(r's+', ' ', name).strip().lower() # replace 2+ whitespaces with a single space, then strip leading\/trailing whitespace\n           state = state.strip().upper()\n           email = email.strip().lower()\n           date = self.normalize_date(date_str)\n\n           # emit cleaned KV pair\n           if name and date and state and email:\n               yield name, (date, state, email)\n       except: \n           pass # skip bad records\n\n   def reducer(self, key, values):\n       \"\"\"\n       Given a Name and an iterator of (Date, State, Email) values associated with that key,\n       return a set of (Date, State, Email) values for that Name.\n\n       This will eliminate all duplicate &lt;Name, (Date, State, Email)&gt; entries.\n       \"\"\"\n       seen = set()\n       for value in values:\n           value = tuple(value)\n           if value not in seen:\n               seen.add(value)\n               yield key, value\n          \n   def normalize_date(self, date_str):\n       formats = [\"%Y-%m-%d\", \"%m-%d-%Y\", \"%d-%m-%Y\", \"%d\/%m\/%y\", \"%m\/%d\/%Y\", \"%Y\/%m\/%d\", \"%Y.%m.%d\"]\n       for fmt in formats:\n           try:\n               return datetime.strptime(date_str.strip(), fmt).strftime(\"%Y-%m-%d\")\n           except ValueError:\n               continue\n       return \"\"\n\n\nif __name__ == '__main__':\n   UserDataCleaner.run()<\/code><\/pre>\n<p class=\"wp-block-paragraph\" id=\"5836\">This is just one example of a simple data transformation task that can be expressed using the mrjob framework. For more complex data-processing tasks that cannot be expressed with a single MapReduce job,\u00a0<a href=\"https:\/\/mrjob.readthedocs.io\/en\/latest\/guides\/quickstart.html#writing-your-second-job\" rel=\"noreferrer noopener\" target=\"_blank\">mrjob supports this<\/a>\u00a0by allowing developers to write multiple\u00a0<strong>mapper()<\/strong>\u00a0and\u00a0<strong>producer()<\/strong>\u00a0methods, and define a pipeline of mapper\/producer steps that result in the desired output.<\/p>\n<p class=\"wp-block-paragraph\" id=\"2ab2\">By default, mrjob executes your job in a single process, as this allows for friendly development, testing, and debugging. Of course, mrjob supports the execution of MapReduce jobs on various platforms (Hadoop, Google Dataproc, Amazon EMR). It\u2019s good to be aware that the overhead of initial cluster setup can be fairly significant (~5+ min, depending on the platform and various factors), but when executing MapReduce jobs on truly large datasets (10+ GB), job deployment on one of these platforms would save significant amounts of time as the initial setup overhead would be fairly small relative to the execution time on a single machine.<\/p>\n<p class=\"wp-block-paragraph\" id=\"0e40\">Check out the\u00a0<a href=\"https:\/\/mrjob.readthedocs.io\/en\/latest\/\" rel=\"noreferrer noopener\" target=\"_blank\">mrjob documentation<\/a>\u00a0if you want to explore its capabilities further <img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f642.png?ssl=1\" alt=\"\ud83d\ude42\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h3 class=\"wp-block-heading\" id=\"contributions-current-state\">MapReduce: Contributions &amp; Current State<\/h3>\n<p class=\"wp-block-paragraph\" id=\"12cf\">MapReduce was a significant contribution to the development of scalable, data-intensive applications primarily for the following two reasons:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The authors recognized that primitive operations originating from functional programming,\u00a0<a href=\"https:\/\/www.cs.cornell.edu\/courses\/cs3110\/2014sp\/lectures\/5\/map-fold-map-reduce.html\" target=\"_blank\" rel=\"noreferrer noopener\">map and reduce<\/a>, can be pipelined together to accomplish many <a href=\"https:\/\/towardsdatascience.com\/tag\/big-data\/\" title=\"Big Data\">Big Data<\/a> tasks.<\/li>\n<li class=\"wp-block-list-item\">It abstracted away the\u00a0<a href=\"https:\/\/aws.amazon.com\/builders-library\/challenges-with-distributed-systems\/\" target=\"_blank\" rel=\"noreferrer noopener\">difficulties<\/a>\u00a0that come with executing those operations on a distributed system.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\" id=\"11b3\"><a href=\"https:\/\/towardsdatascience.com\/tag\/mapreduce\/\" title=\"Mapreduce\">Mapreduce<\/a> was not significant because it introduced new primitive concepts. Rather, MapReduce was so influential because it encapsulated these map and reduce primitives into a single library, which automatically handled challenges that come from managing distributed systems, such as\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Scheduling_%28computing%29\" rel=\"noreferrer noopener\" target=\"_blank\">task scheduling<\/a>\u00a0and\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Fault_tolerance\" rel=\"noreferrer noopener\" target=\"_blank\">fault tolerance<\/a>. These abstractions allowed developers with little distributed programming experience to write parallel programs efficiently.<\/p>\n<p class=\"wp-block-paragraph\" id=\"c921\">There were\u00a0<a href=\"https:\/\/www.cs.utexas.edu\/~rossbach\/cs380p\/papers\/dewitt08blog-mapreduce-backwards.pdf\" rel=\"noreferrer noopener\" target=\"_blank\">opponents from the database community<\/a>\u00a0who were skeptical about the novelty of the MapReduce framework \u2014 prior to MapReduce, there was existing research on\u00a0<a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/129888.129894\" rel=\"noreferrer noopener\" target=\"_blank\">parallel database systems<\/a>\u00a0investigating how to enable parallel and distributed execution of analytical SQL queries. However, MapReduce is typically integrated with a\u00a0<a href=\"https:\/\/research.google.com\/archive\/gfs.html\" rel=\"noreferrer noopener\" target=\"_blank\">distributed file system<\/a>\u00a0with no requirements to impose a schema on the data, and it provides developers the freedom to implement custom data processing logic (ex: machine learning workloads, image processing, network analysis) in\u00a0<strong>map()<\/strong>\u00a0and\u00a0<strong>reduce()<\/strong>\u00a0that may be impossible to express through SQL queries alone. These characteristics enable MapReduce to orchestrate parallel and distributed execution of general purpose programs, instead of being limited to declarative SQL queries.<\/p>\n<p class=\"wp-block-paragraph\" id=\"4594\">All that being said, the MapReduce framework is no longer the go-to model for most modern large-scale data processing tasks.<\/p>\n<p class=\"wp-block-paragraph\" id=\"12dd\">It has been criticized for its somewhat\u00a0<a href=\"https:\/\/www.the-paper-trail.org\/post\/2014-06-25-the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google\/\" rel=\"noreferrer noopener\" target=\"_blank\">restrictive nature<\/a>\u00a0of requiring computations to be translated into map and reduce phases, and requiring intermediate data to be materialized before transmitting it between mappers and reducers. Materializing intermediate results may result in I\/O bottlenecks, as all mappers must complete their processing before the reduce phase starts. Additionally, complex data processing tasks may require many MapReduce jobs to be chained together and executed sequentially.<\/p>\n<p class=\"wp-block-paragraph\" id=\"88d0\">Modern frameworks, such as Apache Spark, have extended upon the original MapReduce design by opting for a more flexible\u00a0<a href=\"https:\/\/blog.devgenius.io\/mastering-spark-dags-the-ultimate-guide-to-understanding-execution-ce6683ae785b\" rel=\"noreferrer noopener\" target=\"_blank\">DAG execution model<\/a>. This DAG execution model allows the entire sequence of transformations to be optimized, so that dependencies between stages can be recognized and exploited to execute data transformations in memory and pipeline intermediate results, when appropriate.<\/p>\n<p class=\"wp-block-paragraph\" id=\"6add\">However, MapReduce has had a significant influence on modern data processing frameworks (Apache Spark, Flink, Google Cloud Dataflow) due to fundamental distributed programming concepts that it introduced, such as\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/cs\/0403019\" rel=\"noreferrer noopener\" target=\"_blank\">locality-aware scheduling<\/a>, fault tolerance by re-execution, and scalability.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h3 class=\"wp-block-heading\" id=\"wrap-up\">Wrap Up<\/h3>\n<p class=\"wp-block-paragraph\" id=\"6c5b\">If you made it this far, thanks for reading! There was a lot of content here, so let\u2019s quickly flesh out what we discussed.<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">MapReduce is a programming model used to orchestrate the parallel and distributed execution of programs across a large compute cluster of commodity hardware. Developers can write parallel programs using the MapReduce framework by simply defining the mapper and reducer logic specific for their task.<\/li>\n<li class=\"wp-block-list-item\">Tasks that consist of applying transformations on independent partitions of the data followed by grouped aggregation are ideal fits to be optimized by MapReduce.<\/li>\n<li class=\"wp-block-list-item\">We walked through how to express a common data engineering workload as a MapReduce task using the MRJob library.<\/li>\n<li class=\"wp-block-list-item\">MapReduce as it was originally designed is no longer used for modern big data tasks, but its core components have played a signifcant role in the design of modern distributed programming frameworks.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\" id=\"260a\">If there are any important details about the MapReduce framework that are missing or deserve more attention here, I\u2019d love to hear it in the comments. Additionally, I did my best to include all of the great resources that I read while writing this article, and I highly recommend checking them out if you\u2019re interested in learning further!<\/p>\n<p class=\"wp-block-paragraph\"><em>The author has created all images in this article.<\/em><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h3 class=\"wp-block-heading\" id=\"sources\"><strong>Sources<\/strong><\/h3>\n<p class=\"wp-block-paragraph\" id=\"3acd\">MapReduce Fundamentals:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/research.google\/pubs\/mapreduce-simplified-data-processing-on-large-clusters\/\" target=\"_blank\" rel=\"noreferrer noopener\">MapReduce: Simplified Data Processing on Large Clusters<\/a><\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/dataintensive.net\/\" target=\"_blank\" rel=\"noreferrer noopener\">Martin Kleppmann, Chapter 10, Designing Data Intensive Applications<\/a><\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\" id=\"14d7\">mrjob:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/mrjob.readthedocs.io\/en\/latest\/index.html\" target=\"_blank\" rel=\"noreferrer noopener\">Documentation<\/a><\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\" id=\"b885\">Related Background:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/www.cs.cornell.edu\/courses\/cs3110\/2014sp\/lectures\/5\/map-fold-map-reduce.html\" target=\"_blank\" rel=\"noreferrer noopener\">Map, Fold, Reduce<\/a><\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/www.jstatsoft.org\/article\/view\/v040i01%20and%20https:\/\/pandas.pydata.org\/docs\/user_guide\/groupby.html\" target=\"_blank\" rel=\"noreferrer noopener\">Split-Apply-Combine<\/a><\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\" id=\"2064\">MapReduce Limitations &amp; Extensions:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/www.cs.utexas.edu\/~rossbach\/cs380p\/papers\/dewitt08blog-mapreduce-backwards.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">MapReduce: A major step backwards<\/a><\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/www.the-paper-trail.org\/post\/2014-06-25-the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google\/\" target=\"_blank\" rel=\"noreferrer noopener\">The Elephant was a Trojan Horse: On the Death of Map-Reduce at Google<\/a><\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/blog.devgenius.io\/mastering-spark-dags-the-ultimate-guide-to-understanding-execution-ce6683ae785b\" target=\"_blank\" rel=\"noreferrer noopener\">Spark DAG Execution Model<\/a><\/li>\n<\/ul>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/mapreduce-how-it-powers-scalable-data-processing\/\">MapReduce: How It Powers Scalable Data Processing<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Jimin Kang<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/mapreduce-how-it-powers-scalable-data-processing\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>MapReduce: How It Powers Scalable Data Processing In this article, I\u2019ll give a brief introduction to the MapReduce programming model. Hopefully after reading this, you leave with a solid intuition of what MapReduce is, the role it plays in scalable data processing, and how to recognize when it can be applied to optimize a computational [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,1067,401,67,2434,2435,2436],"tags":[84,649,2049],"class_list":["post-3275","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-big-data","category-data-engineering","category-deep-dives","category-distributed-computing","category-mapreduce","category-parallel-computing","tag-data","tag-key","tag-mapreduce"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3275"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=3275"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3275\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=3275"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=3275"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=3275"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}