Wednesday, September 4, 2019
Novel Clockwise Task Migration in Many-Core Chip
Novel Clockwise Task Migration in Many-Core Chip A Novel Clockwise Task Migration in Many-Core Chip Multiprocessors Ã Abstract-The industry trend for Chip Multiprocessors (CMPs) moves from multi-core to many-core to obtain higher computing performance, flexibility, and scalability systems. Moreover, the transistors size is constantly shrinking, and more and more transistors are integrated in a single chip that allows to design more powerful and complicated systems. However, obtaining higher computing performance needs to increase the consuming of power consumption which results in increasing the on-chip hotspots and the overall chip temperature. The peak temperature causes performance degradation, reducing reliability, decreasing the chip life spam, and eventually, damaging the system. Therefore, Runtime Thermal Management (RTM) for CMPs has become crucial to minimize temperature without any performance degradation. In this paper, a new clockwise task migration technique is proposed on many-core CMPs. The proposed technique migrates the heavy loaded tasks which are placed in a central cores away from the central cores to the surrounding cores. The proposed technique performs a clockwise task migrations to distribute the variations hotspots that are placed in the central core of the chip. Moreover, the proposed migration algorithm gathers cores temperature by using performance-counters and proposed equations which shows efficient results instead of using thermal sensors. Simulation results indicate up to 15% reduction in the maximum temperature value of the whole many-core CMPs. The efficiency of the proposed technique is shown by temperature values of many-core CMPs that are below the maximum temperature limit. Keywords- chip multiprocessors; many-core; task migration; performance counter; runtime thermal management. The chip multiprocessors (CMPs) is continued to increase the number of transistors to face the increased demand of the maintaining reliability and high computing performance. In the same time, transistors size are constantly shrinking, and more and more transistors are integrated in a single chip that allows to design more powerful and complicated CMPs architectures . These advantages lead to increase cores number on the CMPs, therefore CMPs are shifting from multicore to many-core era where tens or hundreds of cores are integrated on a single chip connected via network-on-chip (NoC) [4-5]. In fact, many-core CMPs provide higher computing performance because of executing heavy loaded tasks which consume more power consumption. However, heavy loaded tasks lead to increase the overall chip temperature and on-chip hotspots. Hotspots are the main driving obstacle for wide adoption of many core CMPs architectures which lead to performance degradation, reduced reliability, increased coo ling costs, shorter chip life span, and eventually the system frailer. Therefore, to achieve better computing performance with higher scalability and maintaining reliability, efficient Runtime Thermal Management (RTM) techniques become very imperative ,[6-8]. In fact, RTM not only aims to balance and distribute the temperature of the chip but also enables many-core CMPs to operate at a favorable performance while working below a temperature threshold [1-2]. Therefore, in order to maintain efficient performance on the many core CMPs, authors propose a clockwise task migration technique that is served as an alternative to control the many core CMPs cores temperature. The proposed migration technique migrates the heavy loaded tasks which are placed in the central cores away from the central part to the surrounding part on the core layer. In other word, the proposed method performs the clockwise task migrations to distribute the variations hotspots that are placed in the central cores of the chip. The proposed method aims to maximize the throughput on many core CMPs while satisfying the peak temperature constraint [5-6],. With the development of many-core CMPs, using high overhead expensive thermal sensors to measure cores temperature becomes not effective nor improper to encounter thermal challenges ,. Therefore, in this work, a new technique have been provided to measure cores temperature instead of using thermal sensors. The proposed migration algorithm obtains the core temperature by using performance-counters which are placed in each core. In this context, cores with high temperature are distributed on the chip without any performance degradation [1-3],[11-13]. In this paper, they are some contributions are achieved as following: It develops a novel runtime task migration technique in many-core systems to balance hotspots. Instead of using high overheads expensive sensors to majeure cores temperature, the proposed task migration technique is using performance-counters. Experimental results show that the proposed algorithm can signiÃ ¯Ã ¬Ã cantly outperform the conventional approach. The rest of the paper is organized as follows. First of all in Section II, a summary of related works is given. The proposed technique is introduced in Section III. In Section IV, experimental evaluation is presented. Finally, the conclusion is given in Section V. While the industry trends of CMPs is to increase transistors numbers redundant exponentially as Ohms low, its help to achieve more powerful and better computing performance by executing heavy loaded tasks [1-3]. However, heavy loaded tasks lead to increase on-chip thermal hotspots and the overall CMPs peak temperature. Thus, in case of having hundreds of processors are integrated on a single chip as many-core CMPs, off-line methods are not efficient. Therefore, RTM becomes crucial to balance on-chip thermal hot-spots and the overall CMPs peak temperature [1-3],[8-10]. To this end, many theoretical works have been carried out to dissipation and elimination thermal hot-spots by different techniques. For instance, Dynamic Voltage and Frequency Scaling (DVFS) technique in  aims to control the temperature by dynamically adjusting the processor speed based on the workload. However, DVFS techniques dynamically adjusting the processor speed based on the workload which sacriÃ ¯Ã ¬Ã ce the performance to cool down the chip temperature. Another technique called task migration technique which aims to manage the on-chip temperature by balancing the tasks loads among CMPs tiles without slowing down the processing. In [1-3],,[10-11] the proposed algorithms in some cases is unable to find a proper destination core due to the thermal constraints, therefore, authors have used DVFS which had proved to be inefficient as far as performance is concerned. In , authors had implemented many thermal-aware algorithms to migrate tasks between processor cores to reduce thermal variation in 3D architecture with stacked DRAM memory. However, the authors are used some techniques that proceed static task migration which in some cases can migrate a task from cold core to a hotspot core. Also, the authors proposed another techniques which are providing high overheads expensive thermal sensors to detect the on-chip hotspot. Moreover, in [2-3], authors proposed other techniques which always assigns the new job to the coolest core for balancing the thermal hotspots across the chip, however it increases hotspots in the system rapidly. Therefore, in case of having hundreds of processors are integrated on a single chip as many-core CMPs, off-line methods are not efficient to distribute and balance the thermal hotspots. In this work, a novel runtime task migration technique is proposed which offers an effective solution to face thermal challenges in many-core CMPs. Furthermore, instead of using high overhead expensive sensors to measure cores temperature, the proposed migration technique is using performance-counters to measure many-core CMPs tiles temperature. Fig. 1: Many-core CMPs with 64 cores and the TCU connection with a tile on many core CMPs. Fig. 2: A tile components in 64 cores many-core CMPs. Nowadays, the CMPs industry trend moves from multi-core to many-core architectures to achieve better computing performance, and more maintaining reliability. Therefore, many-core CMPs architectures provide heavy loaded tasks to allow the system operating at high computing performance. However, heavy tasks lead to increase peak temperature of chip and on-chip hotspots. Thus, RTM is crucial to achieve balanced systems temperature threshold with efficient task execution performance. As shown in Figure 1, a many-core CMPs with 64 tiles is presented. Each tile includes a core, a private L1 cache bank, and a shared cache L2 bank as shown in Figure 2. The proposed technique in this work aims to balance thermal distribution to combat thermal issues and temperature related reliability. The proposed technique provides task migration between cores while it is done at runtime and repeated periodically at a predefined time interval. Each time interval in this work is 100ms. Each core considers instruction per cycle (IPC) for calculating power consumption at the end of each interval. IPC is a critical factor in power consumption calculation. It is notable that, cores with higher power consumption lead to execute tasks with higher performance which create higher temperature in compared with the cores with lower power consumption . The power consumption for each core is calculated based on Equation 1. Where P is the core power consumption, IPC is the instruction per cycle which is the core activity, f is the core frequency, CL is the average capacitance, and VDD is supply voltage. Since the frequency of each core in the many-core CMPs is constant and the DVFS technique is expensive and inappropriate because of performance degradation, dynamically change in the frequency of each core is not assumed in the system. As can be seen in Equation 1, the IPC has a key role for calculating and predicting the power consumption of each core in system. For calculating IPC, performance counters are used which are very applicable in the modern processors. Each core has a performance counter for IPC counting. At the end of each time interval, IPC is achieved by the performance counter for each core and then power consumption is calculated based on Equation 1. According to the calculated power consumption, a look up table in the Thermal Control Unit (TCU) will be filled. An example of look up tabl e is illustrated in Figure 3. In the target many core system, the TCU is assumed to be placed near to all of the cores as shown in Figure 1. Based on the filled table in the TCU, we divide the many core floor plan into two parts, the central part with one region, and the surrounding part with four regions as shown in Figure 4. Based on the thermal distribution of central part and surrounding part, we try to balance the temperature in the system. As before mentioned, the look up table is illustrated in Figure 3, based on each core activity, hot and cold cores are determined based on the related thresholds shown in Figure 5 ,where th1=5, th2=10, th3=15, and th4=20. Fig. 3: A sample of a look up table in the PCU used at the end of each time interval. Fig. 4: The central part and the surrounding part of 64 tile of many core CMPs. Based on the plan of hot and cold cores, the proposed technique sorts the cores both in the central part and surrounding part from the hottest to coldest cores. Then the proposed technique exchanges the hottest core in the central part with the coldest core in the surrounding part. Based on this trend, the heavy load tasks are migrated to the edges of the chip and light load tasks are migrated to the central part. It is notable that the edges of the chip is a better choice for placement of the hot cores in compared with the central part because neighbor cores have a big effect on each temperature. Since the number of cores in the surrounding part is three times of the central part, the hot cores in the central part have more options for migration with a cold core. At the end of each time interval, each core sends IPC information (cores activity) which calculated based on performance counter to the TCU. Then, the TCU based on cores activities from the look up table calculates two sets of activities which are in central part and surrounding part. Therefore, the TCU sorts the activities related to central part and surrounding part from the hottest to the coldest cores, separately. In this part, as shown in Figure 1, TCU exchanges the hottest core in the central part with the coldest core in surrounding part region by region as will be explained in the next subsection. It is notable that the TCU can migrate the hot cores in the central part with the cold cores in the surrounding part in the clockwise manner. Fig.5: The used thresholds for determining the ranges of temperature of the cores. Fig. 6: The proposed clockwise task migration algorithm. A. Clockwise Migration Algorithm For avoiding the gathering of all of the hot cores in a one region of surrounding part instead of divide it the whole surrounding part regions, a novel clockwise algorithm is proposed. This clockwise migration algorithm divides the surrounding part into four regions as shown in Figure 4. After sorting the cores from high temperature to low temperature both in of central part and surrounding part by the TCU, the proposed clockwise algorithm exchanges the hottest core in the central part with a coldest core in the surrounding part region one. After that, the proposed clockwise algorithm exchanges the hottest core in the central part with a coldest core in the surrounding part region two etc. The system repeats this procedure periodically at the end of each time interval to migrate the hot cores in the central part with the cold cores on four regions in surrounding part. The summary of Phase 1 and Phase 2 of the proposed clockwise task migration technique is shown in Figures 6. As shows in Figure 1, a 64 tiles many-core CMPs architecture with multithreaded workloads is used to proceed the proposed clockwise task migration technique. a) Platform Setup In order to validate the efficiency the many-core CMPs architecture in this paper, authors use the traffic traces extracted from GEM5  full-system simulator to setup the basic system platform. The area of cores and cache banks are estimated by CACTI  and McPAT . We use multithread applications from PARSEC benchmarks  in our experimental evaluation. The detailed system configuration are given in Table 1. For this benchmarks, one billion instructions are executed for the simlarge input set starting from the Region of Interest (ROI). HotSpot  version 5.0 is employed as a grid-based thermal modeling tool for chip temperature estimation. For experimental evaluation, maximum temperature limit and dark silicon peak power budget, Tmax and Pbudget is assumed to be 80Ã ¢Ã¢â¬Å¾Ãâ and 100 W, respectively. Table 1. Specification of the target CMP architecture. Component Description Number of Cores 64, 8-8 mesh Core Configuration Alpha21164, 3GHz, 65nm Private Cache per each Core SRAM, 4 way, 32 line, size 32KB per core On-chip Memory Baseline: Static random mapping Proposed: Proposed migration technique b) Experimental Results In this sub-section, we evaluate a many core CMPs in two different cases. First, the many core CMPs without any migration policy (Baseline), and the many core CMPs with the proposed clockwise migration policy (Proposed). Figure 7 shows the results of normalized throughput for PARSEC and SPEC workloads, where throughput is the number of executed instructions per second (IPS). As shown in Figure 7, the Proposed architecture yields on average 31% throughput improvement compared with the Baseline. Moreover, Figure 8 illustrates the results of normalized energy consumption for PARSEC and SPEC workloads. As shown in Figure 8, the Proposed architecture yields on average 69% energy consumption improvement compared with the Baseline. In addition, Figure 9 (a) and (b) show the results of temperature distribution for canneal from PARSEC workloads for Baseline and Proposed architecture, respectively. Also, as shown in figure 9 (a), after applying the proposed clockwise task migration technique (Proposed), it ensures that all cores on the many core CMPs are below the maximum temperature of 80 . While the Baseline spends up to 19% of time above the maximum temperature which presences hotspotsÃ as shown in figure 9 (b). In other words, by applying the proposed clockwise task migration technique on the proposed many core CMPs architecture, it distributes the temperature and without appearance of hotspots. Fig.7. Comparison results of IPC. Fig.8. Comparison results of energy consumption. The many-core CMPs provide higher system performance, more flexibility and scalability. Since these advantages require increased power consumption in the system, peak temperature issues become disquieting. Thus, Runtime Thermal Management (RTM) of many-core CMPs becomes crucial in minimizing thermal hotspots without any performance degradation. In this paper, the proposed clockwise task migration technique migrates the heavy loaded task from central cores part to the surrounding cores part. Thy system gathers cores temperature by using performance-counters that are placed in each core instead of use thermal sensors. Since cores with higher power consumption lead to execute higher tasks performance, therefore creates higher temperature. Experimental results of the 64 tiles many-core CMPs have shown signiÃ ¯Ã ¬Ã cant improvement of the average for normalized IPC throughput and energy consumption. While the many-core CMPs architecture yields on average 31% throughput improvement com pared without preceding the using technique. Moreover, the Proposed architecture yields on average 69% energy consumption improvement compared without using the proposed technique. Furthermore, results also have clarified that up to 15% signiÃ ¯Ã ¬Ã cant reduction of temperature threshold, and all tiles are below the maximum temperature limit which is 80 Ã ¢Ã¢â¬Å¾Ãâ on the 64 tiles many-core CMPs (a) (b) Fig.9. Comparison results of temperature.
Posted by Unknown at 6:35 PM