[go: nahoru, domu]

MapReduce: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Alter: pages. Add: s2cid, date, website, title, authors 1-1. Removed proxy/dead URL that duplicated identifier. Removed access-date with no URL. Changed bare reference to CS1/2. Removed parameters. Formatted dashes. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Abductive | Category:Articles with example code | #UCB_Category 21/101
Line 14:
MapReduce [[library (software)|libraries]] have been written in many programming languages, with different levels of optimization. A popular [[open-source software|open-source]] implementation that has support for distributed shuffles is part of [[Apache Hadoop]]. The name MapReduce originally referred to the proprietary [[Google]] technology, but has since been [[generic trademark|genericized]]. By 2014, Google was no longer using MapReduce as their primary ''[[big data]]'' processing model,<ref>{{cite web|url=http://www.datacenterknowledge.com/archives/2014/06/25/google-dumps-mapreduce-favor-new-hyper-scale-analytics-system/ | title=Google Dumps MapReduce in Favor of New Hyper-Scale Analytics System |last1=Sverdlik |first1=Yevgeniy |date=2014-06-25 |website=Data Center Knowledge |access-date=2015-10-25 |quote="We don't really use MapReduce anymore" [Urs Hölzle, senior vice president of technical infrastructure at Google]}}</ref> and development on [[Apache Mahout]] had moved on to more capable and less disk-oriented mechanisms that incorporated full map and reduce capabilities.<ref>{{cite web |url=https://gigaom.com/2014/03/27/apache-mahout-hadoops-original-machine-learning-project-is-moving-on-from-mapreduce/ |title=Apache Mahout, Hadoop's original machine learning project, is moving on from MapReduce |last1=Harris |first1=Derrick |date=2014-03-27 |website=[[Gigaom]] |access-date=2015-09-24 |quote=Apache Mahout [...] is joining the exodus away from MapReduce.}}</ref>
 
Some [[Vector processors]] such as the [[NEC SX-Aurora TSUBASA]] have extensive mapreduce support built-in to their Vector instructions,<ref>[https://www.hpc.nec/documents/guide/pdfs/Aurora_ISA_guide.pdf NEC SX-Aurora ISA guide]</ref> including in RISC-V Vectors<ref>[{{Cite web|url=https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-reduction-operations|title = Riscv-v-spec|website = [[GitHub]]|date = 29 October 2021}}</ref>
 
==Overview==
Line 50:
The ''Reduce'' function is then applied in parallel to each group, which in turn produces a collection of values in the same domain:
 
<code>Reduce(k2, list (v2))</code> → <code>list((k3, v3))</code><ref>{{Cite web|url=https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Inputs+and+Outputs|title = MapReduce Tutorial}}</ref>
 
Each ''Reduce'' call typically produces either one key value pair or an empty return, though one call is allowed to return more than one key value pair. The returns of all calls are collected as the desired result list.
 
Thus the MapReduce framework transforms a list of (key, value) pairs into another list of (key, value) pairs.<ref>{{Cite web|url=https://github.com/apache/hadoop-mapreduce/blob/307cb5b316e10defdbbc228d8cdcdb627191ea15/src/java/org/apache/hadoop/mapreduce/Reducer.java#L148|title = Apache/Hadoop-mapreduce|website = [[GitHub]]|date = 31 August 2021}}</ref> This behavior is different from the typical functional programming map and reduce combination, which accepts a list of arbitrary values and returns one single value that combines ''all'' the values returned by map.
 
It is [[Necessity and sufficiency|necessary but not sufficient]] to have implementations of the map and reduce abstractions in order to implement MapReduce. Distributed implementations of MapReduce require a means of connecting the processes performing the Map and Reduce phases. This may be a [[distributed file system]]. Other options are possible, such as direct streaming from mappers to reducers, or for the mapping processors to serve up their results to reducers that query them.
Line 171:
==Performance considerations==
MapReduce programs are not guaranteed to be fast. The main benefit of this programming model is to exploit the optimized shuffle operation of the platform, and only having to write the ''Map'' and ''Reduce'' parts of the program.
In practice, the author of a MapReduce program however has to take the shuffle step into consideration; in particular the partition function and the amount of data written by the ''Map'' function can have a large impact on the performance and scalability. Additional modules such as the ''Combiner'' function can help to reduce the amount of data written to disk, and transmitted over the network. MapReduce applications can achieve sub-linear speedups under specific circumstances.<ref name=":0">{{Cite journal|title = BSP cost and scalability analysis for MapReduce operations|journal = Concurrency and Computation: Practice and Experience|date = 2015-01-01|issn = 1532-0634|pages = 2503–2527|doi = 10.1002/cpe.3628|firstfirst1 = Hermes|lastlast1 = Senger|first2 = Veronica|last2 = Gil-Costa|first3 = Luciana|last3 = Arantes|first4 = Cesar A. C.|last4 = Marcondes|first5 = Mauricio|last5 = Marín|first6 = Liria M.|last6 = Sato|first7 = Fabrício A.B.|last7 = da Silva|volume=28|issue = 8|hdl = 10533/147670|s2cid = 33645927|hdl-access = free}}</ref>
 
When designing a MapReduce algorithm, the author needs to choose a good tradeoff<ref name="ullman">{{Cite journal | doi = 10.1145/2331042.2331053 | title = Designing good MapReduce algorithms| journal = XRDS: Crossroads, the ACM Magazine for Students| volume = 19| pages = 3030–34| year = 2012| last1 = Ullman | first1 = J. D. | s2cid = 26498063| author-link1 = Jeffrey Ullman| url = http://xrds.acm.org/article.cfm?aid=2331053 |url-access=subscription}}</ref> between the computation and the communication costs. Communication cost often dominates the computation cost,<ref name="ullman"/><ref name=":0"/> and many MapReduce implementations are designed to write all communication to distributed storage for crash recovery.
 
In tuning performance of MapReduce, the complexity of mapping, shuffle, sorting (grouping by the key), and reducing has to be taken into account. The amount of data produced by the mappers is a key parameter that shifts the bulk of the computation cost between mapping and reducing. Reducing includes sorting (grouping of the keys) which has nonlinear complexity. Hence, small partition sizes reduce sorting time, but there is a trade-off because having a large number of reducers may be impractical. The influence of split unit size is marginal (unless chosen particularly badly, say <1MB). The gains from some mappers reading load from local disks, on average, is minor.<ref>{{Cite journal|title = Scheduling divisible MapReduce computations|lastlast1 = Berlińska|firstfirst1 = Joanna|date = 2010-12-01|journal = Journal of Parallel and Distributed Computing|doi = 10.1016/j.jpdc.2010.12.004|last2 = Drozdowski|first2 = Maciej|volume=71|issue = 3|pages=450–459}}</ref>
 
For processes that complete quickly, and where the data fits into main memory of a single machine or a small cluster, using a MapReduce framework usually is not effective. Since these frameworks are designed to recover from the loss of whole nodes during the computation, they write interim results to distributed storage. This crash recovery is expensive, and only pays off when the computation involves many computers and a long runtime of the computation. A task that completes in seconds can just be restarted in the case of an error, and the likelihood of at least one machine failing grows quickly with the cluster size. On such problems, implementations keeping all data in memory and simply restarting a computation on node failures or —when the data is small enough— non-distributed solutions will often be faster than a MapReduce system.
Line 187:
 
==Uses==
MapReduce is useful in a wide range of applications, including distributed pattern-based searching, distributed sorting, web link-graph reversal, Singular Value Decomposition,<ref>{{cite journal|last1=Bosagh Zadeh|first1=Reza|last2=Carlsson|first2=Gunnar|title=Dimension Independent Matrix Square Using MapReduce|url=https://stanford.edu/~rezab/papers/dimsum.pdf|access-date=12 July 2014|bibcode=2013arXiv1304.1467B|year=2013|arxiv=1304.1467}}</ref> web access log stats, [[inverted index]] construction, [[document clustering]], [[machine learning]],<ref name="mrml">{{cite web| url=http://www.willowgarage.com/map-reduce-machine-learning-multicore| title=Map-Reduce for Machine Learning on Multicore|first1=Andrew Y. |last1=Ng |first2=Gary |last2=Bradski |first3=Cheng-Tao |last3=Chu |first4=Kunle |last4=Olukotun |first5=Sang Kyun |last5=Kim |first6=Yi-An |last6=Lin |first7=YuanYuan |last7=Yu| publisher=NIPS 2006| year=2006}}</ref> and [[statistical machine translation]]. Moreover, the MapReduce model has been adapted to several computing environments like multi-core and many-core systems,<ref name="evalMR">{{Cite book | doi = 10.1109/HPCA.2007.346181| chapter = Evaluating MapReduce for Multi-core and Multiprocessor Systems| title = 2007 IEEE 13th International Symposium on High Performance Computer Architecture| pages = 13| year = 2007| last1 = Ranger | first1 = C. | last2 = Raghuraman | first2 = R. | last3 = Penmetsa | first3 = A. | last4 = Bradski | first4 = G. | last5 = Kozyrakis | first5 = C. | isbn = 978-1-4244-0804-7| citeseerx = 10.1.1.220.8210| s2cid = 12563671}}</ref><ref name="graphicsMR">{{Cite book | doi = 10.1145/1454115.1454152| chapter = Mars: a MapReduce framework on graphics processors| title = Proceedings of the 17th international conference on Parallel architectures and compilation techniques – PACT '08| pages = 260| year = 2008| last1 = He | first1 = B. | last2 = Fang | first2 = W. | last3 = Luo | first3 = Q. | last4 = Govindaraju | first4 = N. K. | chapter-url = http://wenbin.org/doc/papers/Wenbin08PACT.pdf| last5 = Wang | first5 = T. | isbn = 9781605582825| s2cid = 207169888}}</ref><ref name="tiledMR">{{Cite book | doi = 10.1145/1854273.1854337| chapter = Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling| title = Proceedings of the 19th international conference on Parallel architectures and compilation techniques – PACT '10| pages = 523| year = 2010| last1 = Chen | first1 = R. | last2 = Chen | first2 = H. | last3 = Zang | first3 = B. | isbn = 9781450301787| s2cid = 2082196}}</ref> desktop grids,<ref name="gridMR">{{Cite book | doi = 10.1109/3PGCIC.2010.33| chapter = Towards MapReduce for Desktop Grid Computing| title = 2010 International Conference on P2P, Parallel, Grid, Cloud and Internet Computing| pages = 193| year = 2010| last1 = Tang | first1 = B. | last2 = Moca | first2 = M. | last3 = Chevalier | first3 = S. | last4 = He | first4 = H. | last5 = Fedak | first5 = G. | chapter-url = http://graal.ens-lyon.fr/~gfedak/papers/xtremmapreduce.3pgcic10.pdf| isbn = 978-1-4244-8538-3| citeseerx = 10.1.1.671.2763| s2cid = 15044391}}</ref>
multi-cluster,<ref name="HMR">{{Cite book | doi = 10.1145/1996023.1996026| chapter = A Hierarchical Framework for Cross-Domain MapReduce Execution|chapter-url = http://yuanluo.net/publications/LUO_ECMLS2011.pdf| title = Proceedings of the second international workshop on Emerging computational methods for the life sciences (ECMLS '11)| year = 2011| last1 = Luo | first1 = Y. | last2 = Guo | first2 = Z. | last3 = Sun | first3 = Y.| last4 = Plale | first4 = B. | last5 = Qiu | first5 = J. | last6=Li|
first6=W. |isbn = 978-1-4503-0702-4| citeseerx = 10.1.1.364.9898| s2cid = 15179363}}</ref> volunteer computing environments,<ref name="volunteerMR">{{Cite book | doi = 10.1145/1851476.1851489| chapter = MOON: MapReduce On Opportunistic eNvironments| title = Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing – HPDC '10| pages = 95| year = 2010| last1 = Lin | first1 = H. | last2 = Ma | first2 = X. | last3 = Archuleta | first3 = J. | last4 = Feng | first4 = W. C. | last5 = Gardner | first5 = M. | last6 = Zhang | first6 = Z. | chapter-url = http://eprints.cs.vt.edu/archive/00001089/01/moon.pdf| isbn = 9781605589428| s2cid = 2351790}}</ref> dynamic cloud environments,<ref name="dynCloudMR">{{Cite journal| doi = 10.1016/j.jcss.2011.12.021| title = P2P-MapReduce: Parallel data processing in dynamic Cloud environments| journal = [[Journal of Computer and System Sciences]]| volume = 78| issue = 5| pages = 1382–1402| year = 2012| url = http://grid.deis.unical.it/papers/pdf/MarozzoTaliaTrunfioJCSS2012.pdf| last1 = Marozzo| first1 = F.| last2 = Talia| first2 = D.| last3 = Trunfio| first3 = P.| access-date = 2015-09-04| archive-url = https://web.archive.org/web/20160303211309/http://grid.deis.unical.it/papers/pdf/MarozzoTaliaTrunfioJCSS2012.pdf| archive-date = 2016-03-03| url-status = dead| doi-access = free}}</ref> mobile environments,<ref name="mobileMR">{{Cite book | doi = 10.1145/1839294.1839332| chapter = Misco: a MapReduce framework for mobile systems| title = Proceedings of the 3rd International Conference on PErvasive Technologies Related to Assistive Environments – PETRA '10| pages = 1| year = 2010| last1 = Dou | first1 = A. | last2 = Kalogeraki | first2 = V. | last3 = Gunopulos | first3 = D. | last4 = Mielikainen | first4 = T. | last5 = Tuulos | first5 = V. H. | isbn = 9781450300711| s2cid = 14517696}}</ref> and high-performance computing environments.<ref>{{cite book|chapter=Characterization and Optimization of Memory-Resident MapReduce on HPC Systems|publisher=IEEE|date=May 2014|doi=10.1109/IPDPS.2014.87|isbn=978-1-4799-3800-1|title=2014 IEEE 28th International Parallel and Distributed Processing Symposium|last1=Wang|first1=Yandong|last2=Goldstone|first2=Robin|last3=Yu|first3=Weikuan|last4=Wang|first4=Teng|pages=799–808|s2cid=11157612}}</ref>
 
At Google, MapReduce was used to completely regenerate Google's index of the [[World Wide Web]]. It replaced the old ''ad hoc'' programs that updated the index and ran the various analyses.<ref name="usage">{{cite web| quote=As of October, Google was running about 3,000 computing jobs per day through MapReduce, representing thousands of machine-days, according to a presentation by Dean. Among other things, these batch routines analyze the latest Web pages and update Google's indexes.| url=http://www.baselinemag.com/c/a/Infrastructure/How-Google-Works-1/5| title=How Google Works| publisher=baselinemag.com}}</ref> Development at Google has since moved on to technologies such as Percolator, FlumeJava<ref name="Chambers2010">{{cite book |last1=Chambers |first1=Craig |last2=Raniwala |first2=Ashish |last3=Perry |first3=Frances |last4=Adams |first4=Stephen |last5=Henry |first5=Robert R. |last6=Bradshaw |first6=Robert |last7=Weizenbaum |first7=Nathan |title=FlumeJava: Easy, Efficient Data-parallel Pipelines |journal=Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation |date=1 January 2010 |pages=363–375 |doi=10.1145/1806596.1806638 |url=https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35650.pdf |access-date=4 August 2016 |isbn=9781450300193 |s2cid=14888571 |archive-url=https://web.archive.org/web/20160923141630/https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35650.pdf |archive-date=23 September 2016 |url-status=dead }}</ref> and [[Google MillWheel|MillWheel]] that offer streaming operation and updates instead of batch processing, to allow integrating "live" search results without rebuilding the complete index.<ref>Peng, D., & Dabek, F. (2010, October). Large-scale Incremental Processing Using Distributed Transactions and Notifications. In OSDI (Vol. 10, pp. 1-15).</ref>
 
MapReduce's stable inputs and outputs are usually stored in a [[distributed file system]]. The transient data are usually stored on local disk and fetched remotely by the reducers.