MARTES Publications
- A. Hovsepyan, S. Van Baelen, Y. Berbers, W. Joosen,
Generic Reusable Concern Compositions (GReCCo): Description and Case Study,
Report CW 508, Department of Computer Science, K.U.Leuven, Leuven, Belgium, January 2008.
(PDF format)
Abstract
This report presents the GReCCo approach to Aspect Oriented Modeling (AOM) using Generic Reusable Concern Compositions. GReCCo offers an AOM-based framework to promote and enhance the reuse of oblivious concern models. We focus on software design patterns, which represent complete solutions to recurring concern-specific problems. We have developed a prototype generic transformation engine written in ATL that can be used to compose two concern models specified in UML. We first describe the GReCCo approach and the offered composition types. In the second part, we illustrate the GReCCo approach on a case study in the domain of Electronic Health Information and Privacy (EHIP). We start from a description of the base part of the application. On top of this application, we apply several reusable concerns using the GReCCo methodology.
- ITEA-MARTES consortium,
MARTES Posters,
ITEA symposium Berlin, 18-19 October 2007
(PDF format)
Abstract
MARTES Solving the Embedded Systems Industry Product development Challenges through Model based Thinking, MARTES Technical Solutions, MARTES Results and Impact
- B. Vanhooff, D. Ayed, S. Van Baelen, W. Joosen, Y. Berbers,
UniTI: A Unified Transformation Infrastructure,
ACM/IEEE 10th International Conference on Model Driven Engineering Languages and Systems (MODELS 2007), Nashville, TN, USA, 30 September-5 October 2007
(PDF format)
Abstract
A model transformation can be decomposed into a sequence of subtransformations, i.e. a transformation chain, each addressing a limited set of concerns. However, with current transformation technologies it is hard to (re)use and compose subtransformations without being very familiar with their implementation details. Furthermore, the difficulty of combining different transformation technologies often thwarts choosing the most appropriate technology for each subtransformation. In this paper we propose a model-based approach to reuse and compose subtransformations in a technology-independent fashion. This is accomplished by developing a unified representation of transformations and facilitating detailed transformation specifications. We have implemented our approach in a tool called UniTI, which also provides a transformation chain editor. We have evaluated our approach by comparing it to alternative approaches.
- A. Hovsepyan, S. Van Baelen, K. Yskout, Y. Berbers, W. Joosen,
Composing Application Models and Security Models: on the Value of Aspect-Oriented Technologies,
11th International Workshop on Aspect-Oriented Modeling (AOM), ACM/IEEE 10th International Conference on Model Driven Engineering Languages and Systems (MODELS 2007), Nashville, TN, USA, 30 September-5 October 2007
(PDF format)
Abstract
The increasing complexity and size of software applications requires improved development techniques. The introduction of aspect-oriented software development (AOSD) and the support for model-driven development (MDD) are two important and promising evolutions in this context. In this paper, we report on our exploration to identify and evaluate the synergies between both trends. We have created a domain specific model that supports the description of an essential part of the security concern - i.e. access control. We have prototyped a model transformator that composes the security concern with an existing object-oriented application. We have developed solutions that generate source code for two different platforms: plain Java and (AO) CaesarJ respectively. We argue that AO platforms offer significant advantages over the OO platforms, even though we compose our aspects at the design/modeling level. The knowledge acquired in this paper allows us to formulate a number of important challenges in order to successfully combine AOSD and MDD.
- Andersson, P., and Host, M.,
UML and SystemC - a Comparison and Mapping Rules for Automatic Code Generation,
Proceedings of the Forum on specification and Design Languages conference (FDL), Barcelona, Spain, September 18 - 20, 2007
(PDF format)
Abstract
Today embedded system development is a complex task. To aid the engineers new methodologies and languages are emerging. During the development the system is modelled using different tools and languages. Transformations between the models are traditionally done manually. We investigate the automation of this process, specifically we are looking at automatic UML to SystemC transformation. In this paper we compare UML and SystemC, focusing on communication modelling. We also present mapping rules for automatic SystemC code generation from UML. The mapping has been implemented in our UML to SystemC code generator.
- Jari Kreku, Mika Hoppari, Kari Tiensyrjä and Per Andersson,
SystemC workload model generation from UML for performance simulation,
the Forum on specification and Design Languages (FDL'07), Barcelona, Spain, September 18-20, 2007
Abstract
An extension to workload-based performance simulation approach is presented, which enables modelling applications using a UML tool and transforming them to SystemC with a code generator. Therefore partial reuse of existing UML application models is feasible, removing the need for separate workload modelling in SystemC. The UML to SystemC transformation is applied to a mobile video player case study to gain experience on the strengths and weaknesses of the method.
- J. Cano, N. Martinez, R. Seepold, F. López Aguilar,
Model-driven development of embedded system on heterogeneous platforms,
Forum on Specification and Design Languages (FDL'07), Barcelona, Spain, September 18-20, 2007.
(PDF format)
Abstract
Large and complex systems design is still being a challenge even bigger when developing embedded, distributed or real-time systems. OSGi is a platform created to reduce some of the software design problems, increasing reusability modularity, etc. This paper describes a methodology based in MDA that aims at real-time embedded systems, The approach is based on a target platform using OSGi and thus reducing applications development time and complexity.
- Tero Arpinen, Mikko Setälä, Petri Kukkala, Erno Salminen, Marko Hännikäinen, Timo D. Hämäläinen,
Modeling Embedded Software Platforms with a UML Profile,
Forum on specification and Design Languages (FDL'07), Barcelona, Spain, September 18-20, 2007
Abstract
This paper presents TUT-Profile, a Unified Modeling Language (UML) extension targeted at embedded system design. TUT-Profile defines a set of stereotypes and design rules to enable the modeling of application, platform, and mapping in UML, and it is supported by an automated design flow. TUT-Profile has been successfully utilized in designing System-on-Chips and implementing them on an FPGA. This paper concentrates on the modeling of software platform components, detailed modeling of hardware platform, and modeling of memory mappings of processors. Benefits of the new modeling features are illustrated with an example design of a WLAN terminal on multiprocessor System-on-Chip. In the example, the software platform and memory mapping models are modified in order to minimize the medium access delay in WLAN transmissions.
- Andersson, P., and Host, M.,
UML to SystemC Transformation in the MARTES Project,
Proceedings of the Work in Progress Session at Euromicro SEAA/DSD, Cavat/Dubrovnik, Croatia, August 30th - September 1st, 2006
(PDF format)
Abstract
Today there is a never ending demand for new functionality to be included in embedded systems such as mobile phones. This leads to increased design complexity. To overcome the increased system complexity new design methodologies, such as model driven architecture, have been introduced. New languages, i.e. SystemC, for system level modelling and simulation have also emerged. Combining new methodologies and new languages is a promising approach to manage the increasing system complexity. This is the focus of the MARTES (Model-Based Approach for Real-Time Embedded Systems development) project1. In the project we investigate how UML and SystemC can be used together when the ideas of Model Driven Architecture are applied. One of the tasks of the project is to investigate how transformations from UML to SystemC can be automated and supported by tools. During this research a prototype tool, which manage the UML to SystemC transformations and code generation, is under development as an ad-in to the Telelogic-TAU UML2 modelling tool2. This part of the MARTES project is done in close cooperation between Lund University and Telelogic.
- Erno Salminen, Ari Kulmala, Timo D. Hämäläinen,
On Network-on-chip comparison,
Euromicro conf. on Digital System Design, Lübeck, Germany, August 27-31, 2007, pp. 503-510.
(PDF format)
Abstract
This paper presents the state-of-the-art in the field of network-on-chip (NoC) benchmarking and comparison. The study identifies the mainstream approaches, how NoCs are currently evaluated, and shows which aspects have been covered and those needing more research effort. No single article can cover all the aspects, and therefore, possibility to compare results from various sources must be ensured by proper scientific reporting. Basic guidelines for achieving that are given.
- Kalle Holma, Mikko Setälä, Erno Salminen, Timo D. Hämäläinen,
Evaluating the Model Accuracy in Automated Design Space Exploration,
10th Euromicro Conference on Digital System Design, Lübeck, Germany, August 27-31, 2007, pp. 173-180.
(PDF format)
Abstract
This paper introduces a new cost factor to Systemon- Chip (SoC) design space exploration, a multi-level communication cost. Furthermore, the accuracy of three system abstraction models using the presented communication cost in automated design space exploration is evaluated. During the simulation, one of three different communication costs is applied for each inter-process communication event based on the mappings of the communicating tasks. The accuracy of the model for the exploration including the cost is evaluated using a Motion-JPEG (M-JPEG) application described in Unified Modeling Language (UML). According to the results, the average error in Frames Per Second (FPS) is 3.8% for the trace model, 4.3% for the modulo model, and 12.8% for the probabilistic model compared to FPGA execution. The simulation speed-up is 230 compared to cycle-accurate RTL-level execution.
- D. Ayed, D. Delanote, and Y. Berbers,
MDD approach for the development of context-aware applications,
Proceedings of Sixth International and Interdisciplinary Conference on Modeling and Using Context (Roth-Berghofer, T. and Vieu, L. and Richardson, D., eds.), Lecture Notes in Artificial Intelligence, pp. 1-14, Roskilde, Denmark, 20-24 August 2007
(PDF format)
Abstract
Context-aware systems offer entirely new opportunities for application developers and for end users by gathering context information and adapting systems behavior accordingly. Several context models have been defined and various context-aware middleware has been developed in order to simplify the development of context-aware applications. Unfortunately, the development of an application by using these middleware products introduces several technical details in the application. These technical details are specific to a given middleware and reduce the possibility of reusing the application on other middleware. In this paper, we propose an MDD (Model Driven Development) approach that makes it possible to design context-aware applications independently of the platform. This approach is based on several phases that approach step by step the context platform and allow designers to automatically map their models to several platforms through the definition of automatic and modular transformations. To be able to apply this approach we define a new UML profile for context-aware applications, that we use to explore our approach.
- Erno Salminen, Tero Kangas, Vesa Lahtinen, Jouni Riihimäki, Kimmo Kuusilinna, Timo D. Hämäläinen,
Benchmarking Mesh and Hierarchical Bus Networks in System-on-Chip Context,
Journal of System Architectures, August 1, 2007, Vol.53, Issue 8, pp. 477-488, Elsevier.
(PDF format)
Abstract
The performance and area of a System-on-Chip depend on the utilized communication method. This paper presents simulation-based comparison of generic, synthesizable single bus, hierarchical bus, and 2-dimensional mesh on-chip networks. Performance of the network depends heavily on the application and therefore six test cases with multiple parameter values are used. Furthermore, two versions of each network topology are compared. The results show that hierarchical bus scales well to large number of agents and offers a good performance and area trade-off although it has smaller aggregate bandwidth and area than mesh. Hierarchical HIBI bus achieves runtimes comparable to 2-dimensional cut-through mesh with about 50% smaller network logic. However, depending on the test case, the runtime can be reduced by 20-50% when wider bus links are utilized.
- Panu Hämäläinen, Marko Hännikäinen, Timo D. Hämäläinen,
Review of Hardware Architectures for Advanced Encryption Standard Implementations Considering Wireless Sensor Networks,
International Symposium on Systems, Architectures, Modeling and Simulation (SAMOS VII), Samos. Greece, July 16-19, 2007, pp. 443-453.
(PDF format)
Abstract
Wireless Sensor Networks (WSN) are seen as attractive solutions for various monitoring and controlling applications, a large part of which require cryptographic protection. Due to the strict cost and power consumption requirements, their cryptographic implementations should be compact and energy-efficient. In this paper, we survey hardware architectures proposed for Advanced Encryption Standard (AES) implementations in low-cost and low-power devices. The survey considers both dedicated hardware and specialized processor designs. According to our review, currently 8-bit dedicated hardware designs seem to be the most feasible solutions for embedded, low-power WSN nodes. Alternatively, compact special functional units can be used for extending the instruction sets of WSN node processors for efficient AES execution.
- Ari Kulmala, Erno Salminen, Timo D. Hämäläinen,
Prototyping and Evaluating Large System-on-Chips on Multi-FPGA Platform,
International Workshop on Systems, Architectures, Modeling, and Simulation (SAMOS) 2007, Samos, Greece, July 16-19, 2007, pp. 179-189.
(PDF format)
Abstract
This paper presents a base architecture that allows simple and rapid way to evaluate and prototype large Multi-Processor System-on-Chips on multiple FPGAs with support to arbitrary number of clock domains. It enables early hardware/software co-verification and optimization and is configurable for different applications. The architecture abstracts the underlying hardware details from the processors so that the exact location of individual components is not required for communication. Implemented example architecture contains 58 IP blocks, including 35 soft processors. As a proof of concept, a MPEG-4 video encoder is run on the example architecture prototype requiring three FPGA boards.
- A. Hovsepyan, D. Delanote, S. Van Baelen, Y. Berbers, W. Joosen,
A Model-Based Transformation Approach for Embedded Systems Development,
S. Graf, S. Gerard, K. Larsen, J. Madsen, and M. Torngren, editors, ARTIST International Workshop on Tool Platforms for Modeling, Analysis and Validation of Embedded Systems, 19th International Conference on Computer Aided Verification (CAV) 2007, July 1-7, 2007, Berlin, Germany
(PDF format)
Abstract
We have devised an approach, called Generic Upsilon Transformations, which supports and promotes reusable model transformations. As a proof of concept we have developed an ATL-based tool which realizes the concepts proposed in our framework. Even though the main focus of GUT is reusable design patterns, our approach is generic enough to support model transformations for embedded systems. We have successfully applies our approach on an embedded case study.
- Timo Alho, Panu Hämäläinen, Marko Hännikäinen, Timo D. Hämäläinen,
Compact Modular Exponentiation Accelerator for Modern FPGA Devices,
Special Issue of Computers and Electrical Engineering, July 5, 2007, Issue doi:10.1016/j.compeleceng.2007.05.007, 9 pages, Elsevier.
Abstract
We present a compact FPGA implementation of a modular exponentiation accelerator suited for cryptographic applications. The implementation efficiently exploits the properties of modern FPGAs. The accelerator consumes 434 logic elements, four 9-bit DSP elements, and 13604 memory bits in Altera Stratix EP1S40. It performs modular exponentiations with up to 2250-bit integers and scales easily to larger exponentiations. Excluding pre and post processing time, 1024-bit and 2048-bit exponentiations are performed in 26.39 ms and 199.11 ms, respectively. Due to its compactness, standard interface, and support for different clock domains, the accelerator can effortlessly be integrated into a larger system in the same FPGA. The accelerator and its performance are demonstrated in practice with a fully functional prototype implementation consisting of software and hardware components.
- B. Vanhooff, S. Van Baelen, W. Joosen, Y. Berbers,
Traceability as Input for Model Transformations,
J. Oldevik, G.K. Olsen, and T. Neple, editors, ECMDA Traceability Workshop (ECMDA-TW) 2007 Proceedings, European Conference on Model Driven Architecture (ECMDA), 11-15 June 2007. SINTEF, ISBN 978-82-14-04056-2, pages 37-46, 2007
(PDF format)
Abstract
Some model transformations require more information than can be derived from its source model(s) in order to generate a meaningful target model. For example, a transformation with two source models needs to know how their respective model elements relate; these relations often only exist implicitly as part of the transformations developers knowledge. In this paper we show that traceability models, who can be automatically generated as part of any model transformation, contain explicit inter- and intra-model relations that are valuable to subsequent transformations. We explain how to extract this information and propose a number of additions to current transformation techniques that are needed to completely open up traceability information to transformation developers.
- Tibboel, Walter, Reyes, Victor, Klompstra, Martin, Alders, Dennis,
System-Level Design Flow Based on a Functional Reference for HW and SW,
Design Automation Conference, 2007. DAC '07. 44th ACM/IEEE, 4-8 June 2007, Pages 23-28
Abstract
Heterogeneous MPSoC design where flexible programmable cores are combined with optimized HW co-processors is a quite complex and challenging task. In this paper, we present a system-level design flow that uses a single functional reference for modeling both HW and SW. The models follow an interface-centric design approach based on the TTL interface (Task Transaction Level). TTL models are applied at all three abstraction levels of the design flow: functional, architecture and implementation level. The TTL model at the functional level serves as the functional reference. HW implementations are generated from refined TTL models by behavioral synthesis tooling. Likewise, SW implementations are supported by source code transformations. Both the HW and SW implementations are verified against the functional reference. Details of the complete flow are presented in the paper through an MP3 case study.
- Antti Rasmus,
Integration of Hardware Accelerators into a System-on-Chip Video Encoder,
M.Sc. Thesis, Tampere University of Technology, 2007, 74 pages.
(PDF format)
Abstract
This Thesis studies how hardware accelerators are integrated into a system on chip. It covers the approaches and phases of integration. In addition, the Thesis studies the impact on performance resulting from accelerator integration. Two hardware accelerators are integrated into multiprocessor system-on-chip video encoder as a case study. The one hardware accelerator performs motion estimation (ME) and the other combination of discrete cosine transform, quantization, inverse quantization, and inverse discrete cosine transform (DQ). Integrating the hardware accelerators into the system is complex operation because the functionality and interfaces of the hardware accelerators differ from the rest of the system on both system and signal level. In addition, one must design how application software sees the accelerators through drivers. The work analyzes different integration strategies, one of which is implemented by manually creating wrapper components for the accelerators. In addition, the integration procedure and the wrapper block implementations are introduced. Moreover, this Thesis presents a solution for measuring the quality of integration and for managing the shared resources on multiprocessor system on chips. The benefit of hardware acceleration to system performance is measured and analyzed in the video encoder. Furthermore, measurements cover the performance of hardware accelerators before and after the integration. The hardware-accelerated system encodes 18 frames per second, which is 40% more than in the system with only processors. It was discovered that the accelerators perform worse as a part of the full video encoder than alone in simulation environment. In addition, the hardware accelerator that performed better in the simulation environment performs worse in the video encoder. The reason is in integration overheads, which reduce the frame rate from theoretical 21 frames per second. The integration overheads were analyzed and divided into factors. The proportions of integration overhead were 96% and 55% of the total execution time of the accelerated ME and DQ functions, respectively. Shared resource contention was the largest factor and it consumed 56% of ME and 33% of DQ execution time. The two other factors are low-level driver software delays and data delivery expenses. The integration overhead factor proportions of execution time depend on the accelerator and the system. They cannot be fully generalized. However, identifying and analyzing them is the most important achievement in this Thesis. These integration overhead factors can be used to analyze and evaluate any hardware-accelerated system.
- Tero Arpinen,
Configurable SoC Platform for UML Designed Applications,
M.Sc. Thesis, 2007, 56 pages, Tampere University of Technology.
Abstract
The increasing complexity of modern digital embedded systems requires advanced design methodologies and tools to implement new products in demanded time frame and minimum cost. To be efficient in designing such systems, substantial reuse of hardware and software components is a must. In addition, the design abstraction level must be higher to manage the overall system complexity. Other key enablers of e?cient design are fast analyzing, design-space exploration, and verification. Based on these assumptions, a fully automated design flow for System-on-Chips (SoC), called Koski, has been developed at Tampere University of Technology. Koski is the first Unified Modeling Language (UML) 2.0 based system design flow that allows real multiprocessor SoC (MP-SoC) implementations from UML models. Typical design flows utilize UML only for specification purposes, but Koski uses the same models for automatic software and hardware configuration. The flow consists of several design automation tools that are exploited seamlessly to automate the design phases. To enable fast system prototyping in Koski, it is crucial that the path from UML system model to the real prototype is kept automated. Prototypes based on reprogrammable logic devices, such as Field Programmable Gate Arrays (FPGAs), are well suited for enabling this due to their flexibility. This thesis focuses to complete the path from UML system model to prototyping on MP-SoC by proposing an FPGA-based multiprocessor platform that supports distributed execution of UML designed applications. The thesis also evaluates the applicability of the modeling methodology used in Koski for automatic platform configuration. In practice, this is carried out by introducing a new Koski subtool for automatic architecture configuration. The tool configures the architecture based on the UML system model using library components and governs the synthesis process for an FPGA.
- Yaprakov, D.,
MDD Transformations of OCL Expressions to Source Code,
M.Sc. Thesis, K.U.Leuven Department of Computer Science, 2007
(PDF format)
Abstract
In het kader van dit eindwerk zullen we een bepaald aspect van de modeltransformaties proberen te bestuderen - namelijk de transformaties van OCL expressies naar werkende code. Tot nu toe zijn OCL expressies vooral gebruikt als validatie van modellen en documentatie. Dus spreken we hier over statische OCL. In deze thesis willen we verder gaan door beperkingen te leggen op instantiaties van UML modellen. Dit soort van OCL expressies noemt men run-time OCL. Er zijn twee basissoorten van OCL expressies of constraints: klasse-invarianten en pre- en postcondities. Wat we in deze thesis willen realiseren is precies deze twee basissoorten van OCL expressies te transformeren door een specifieke codegeneratie. We gebruiken JAVA als doeltaal voor deze codegeneratie. Deze specifieke transformatie (codegeneratie) zal gebeuren op basis van twee aanpakken. Een eerste aanpak is de naļeve aanpak waarbij beide basissoorten van constraints gecontroleerd worden op iedere stabiel moment van een systeem. Maar hier stoppen we niet. Een tweede meer geavanceerde aanpak zal bestudeerd worden. Hierbij worden alleen de relevante OCL-expressies gecontroleerd waarbij de overhead drastisch kan dalen en de eciėntie van ons systeem verhoogt. Deze transformaties zullen we uitwerken met het HAT-tool van het software engineeringsbedrijf E2S uit Gent. Hiernaast zullen we ook een aantal bestaande tools evalueren die run-time OCL ondersteunen.
- Hoppari M.,
Transforming a service-oriented application model to a workload model,
M.Sc. Thesis, University of Oulu, Department of Electrical and Information Engineering, 2007, 65 p.
Abstract
The thesis describes a method for transforming a service-oriented application model to a workload model. The focus is on real-time embedded systems interacting directly with end-users, e.g. terminals. In this work the application modelling is done with the LYRA method, which is a service-oriented and application-specific modelling method. The work describes how a workload model is obtained from a service-oriented application model. The idea of the workload model is to present the load that an application causes on a platform when it is executed. The benefit of using workload models is increased simulation speed as the functionality is not simulated and they can be easily modified to quickly evaluate various versions of use cases. The workload has a layered hierarchical structure consisting of four layers: main, application, process, and function layer. The higher layers are built on top of the lower layers. The workload model and the execution platform model are combined by defining the interface between them in the UML model. This requires that the platform skeleton model is included in the UML model. Before the integration of models can be done and a simulation model compiled the workload model must be transformed into SystemC. This is done automatically with the SystemC code generator. The transformation method developed in this thesis was validated by using it in the Virtual Network Computing (VNC) case study. The application model was modelled from the requirement specification, the model was transformed to the workload model by using the transformation method, the workload was transformed into SystemC, and the combination of workload and platform was compiled and simulated. Based on the case study the transformation method was discovered to be valid.
- Yang Qu, Kari Tiensyrjä, Juha-Pekka Soininen and Jari Nurmi,
System-Level Design for Partially Reconfigurable Hardware,
IEEE International Symposium on Circuits and Systems (ISCAS'07), 27-30 May 2007, pp. 2738-2741, 2007
Abstract
In this paper, we present a SystemC-based approach for system-level design of partially reconfigurable hardware. The main focuses are resource estimation to support system analysis, reconfiguration modeling for fast performance simulation, automatic generation of reconfigurable components and a static prefetch scheduler. The approach was applied in a real design case of a part of a WCDMA decoding algorithm on a commercial reconfigurable platform.
- Yang Qu, Juha-Pekka Soininen and Jari Nurmi,
A Genetic Algorithm for Scheduling Tasks Onto Dynamically Reconfigurable Hardware,
IEEE International Symposium on Circuits and Systems (ISCAS'07), 27-30 May 2007, pp. 161-164, 2007
Abstract
In this paper, a genetic algorithm (GA) for scheduling tasks onto dynamically reconfigurable devices is presented. The scheduling problem is NP-hard and more complicated than multiprocessor scheduling, because both the task allocation and the configurations need to be carefully managed. The approach has been validated with a number of random task graphs. The results show that the GA approach has good convergence and it is in average 8.6% better than a list-based scheduler for large task graphs of various sizes.
- S. Van Baelen,
A Constraint-Centric Approach for Object-Oriented Conceptual Modelling,
Ph.D. Dissertation, K.U.Leuven, Department of Computer Science, Leuven, Belgium, ISBN 978-90-5682-820-2, 275 pages, 2007
(PDF format)
Abstract
Object-oriented analysis, and more specifically conceptual modelling, is a software engineering activity that aims at studying, analysing, and capturing the knowledge about the universe of discourse for a system to be developed. This should result in the specification of a consistent and unambiguous model that describes all domain knowledge, facts, and rules, in which every element from the universe of discourse has a transparent one-to-one correspondence to an entity in the conceptual model. We propose in this dissertation a constraint-centric approach towards object-oriented conceptual modelling. This is achieved by the usage of high-level constraint specifications as the core model structure for conceptual modelling. In particular, our approach enriches the conceptual model structure on two levels: by the definition of new structural concepts to express model constraints implicitly in the model structure, and by the introduction of constraints with supporting resolution mechanisms as a first-class model concept. Concerning the definition of structural concepts, we developed new concepts with a dedicated applicability context attached in order to specify constraints implicitly in the model structure. The incorporation of model constraints in each methodological concept, the usage of existential dependency as the key modelling criterion, the introduction of explicit class archives, and the formal specification of model events and queries enrich the expressive power of a conceptual model structure. Concerning the introduction of constraints as a first-class model concept, we developed a mechanism to specify model constraints using many-sorted first order logic. The constraint trigger concept attached to a constraint defines a generic constraint solver that can resolve constraint violations by injecting additional behaviour into an event or by firing an event due to progress of time. Our approach has converged into the EROOS methodology of which two versions are proposed. A core version, the EROOS kernel, uses a constructional modelling approach in which information can only be added to a conceptual model instance. An extended version, the EROOS universe, provides additional support for recurrent EROOS kernel analysis patterns through advanced and more practical concepts using the core version as the underlying base.
- Cristian Grecu, Andrč Ivanov, Axel Jantsch, Partha Pratim Pande, Erno Salminen, Umit Ogras, Radu Marculescu,
Towards Open Network-on-Chip Benchmarks,
First International Symposium on Networks-on-Chip (NOCS'07), Princeton, New Jersey, USA, May 7-9, 2007, pp. 205-205, IEEE.
(PDF format)
Abstract
Measuring and comparing performance, cost, and other features of advanced communication architectures for complex multi core/multiprocessor systems on chip is a significant challenge which has hardly been addressed so far. This document outlines the top-level view on a system of benchmarks for Networks on Chip (NoC), which intends to cover a wide spectrum of NoC design aspects, from application modeling to performance evaluation and post-manufacturing test and reliability. For performance benchmarking, requirements and features are described for application programs, synthetic micro-benchmarks, and abstract benchmark applications. Then, it proposes ways to measure and benchmark reliability, fault tolerance and testability of the on-chip communication fabric. This paper introduces the main concepts and ideas for benchmarking NoCs in a systematic and comparable way. It will be followed up by a report that will define a benchmark framework and the syntax of interfaces for benchmark programs that will allow the community to build-up a benchmark suite.
- D. Ayed, D. Delanote, and Y. Berbers,
MDD approach and evaluation of the development of context-aware applications,
Report CW 495, K.U.Leuven, Department of Computer Science, Leuven, Belgium, May, 2007
(PDF format)
Abstract
Context-aware systems offer entirely new opportunities for application developers and for end users by gathering context information and adapting systems behavior accordingly. Several context models have been defined and various context-aware middleware has been developed in order to simplify the development of context-aware applications. Unfortunately, the development of an application by using these middleware products introduces several technical details in the application. These technical details are specific to a given middleware and reduce the possibility of reusing the application on other middleware. In this paper, we propose an MDD (Model Driven Development) approach that makes it possible to design context-aware applications independently of the platform. This approach is based on several phases that approach step by step the context platform and allow designers to automatically map their models to several platforms through the definition of automatic and modular transformations. To be able to apply this approach we define a new UML profile for context-aware applications that we use to experiment our approach.
- Yang Qu, Juha-Pekka Soininen and Jari Nurmi,
Using Dynamic Voltage Scaling to Reduce the Configuration Energy of Run Time Reconfigurable Devices,
Proceedings of the Design Automation and Test in Europe 2007 Conference (DATE'07), 16-20 April 2007, pp. 147-152
Abstract
In this paper, an approach that uses dynamic voltage scaling (DVS) to reduce the configuration energy of runtime reconfigurable devices is proposed. The basic idea is to use configuration prefetching and parallelism to create excessive system idle time and apply DVS on the configuration process when such idle time can be utilized. A genetic algorithm is developed to solve the task scheduling and voltage assignment problem. With real applications, the results show that up to 19.3% of configuration energy can be reduced. When considering the reduction of the configuration energy, the results show that using more computation resources is more favourable when the configuration latency is relatively small, and using more configuration controllers is more favourable for relatively large latency.
- Timo Alho, Panu Hämäläinen, Marko Hännikäinen, Timo D. Hämäläinen,
Compact Hardware Design of Whirlpool Hashing Core,
Design, Automation and Test in Europe (DATE 2007), Nice, France, April 16-20, 2007, 6 pages
(PDF format)
Abstract
Weaknesses have recently been found in the widely used cryptographic hash functions SHA-1 and MD5. Therefore, a need for new hash algorithms has arisen. One potential alternative for the traditional choices is the Whirlpool hash algorithm, which has been standardized by ISO/IEC and evaluated in the European research project NESSIE. In this paper we present a Whirlpool hashing hardware core suited for devices in which low cost is desired. The core constitutes of a novel 8-bit architecture that allows compact realizations of the algorithm. In the Xilinx Virtex-II Pro XC2VP40 FPGA, our implementation consumes 376 slices and achieves the throughput of 81.5 Mbit/s. Compared to previous Whirlpool implementations, we achieve considerably lower resource consumption while still maintain reasonable throughput level.
- Ari Kulmala, Erno Salminen, Timo D. Hämäläinen,
Instruction Memory Architecture Evaluation on Multiprocessor FPGA MPEG-4 Encoder,
IEEE Workshop on Design and Diagnostics of Electronic Circuits and Systems (DDECS) 2007, Krakow, Poland, April 11-13, 2007, pp. 105-110.
(PDF format)
Abstract
This paper shows how a bus topology performs as a System-on-Chip (SoC) interconnection. We measure and analyze Heterogeneous IP Block Interconnection (HIBI) bus for a multiple clock domain, Multiprocessor System-on-Chip (MPSoC) with an MPEG-4 video encoding application on FPGA. The studied MPSoC contains up to 22 IP blocks, 11 soft processors, 8 hardware accelerators and three other components. A novel approach of frequency scaling is used to isolate the impact of various architecture components. The system is benchmarked in various configurations. For example, HIBI is run at 100x speed with respect to processors to resemble ideal interconnection. Based on the measurements with up to 16.9 frames/second CIF (352x288) encoding speed, estimation for HDTV resolution video encoder is presented. The required optimizations are discussed. Finally, it is shown that 25 frames/second 1280x720 video encoder needs 55 MHz HIBI compared to 670 MHz general-purpose soft RISC processors. In practice, the processing performance has to be boosted by implementing hardware acceleration and improving memory hierarchy. Clearly, HIBI is not the limiting factor.
- Antti Rasmus, Ari Kulmala, Erno Salminen, Timo D. Hämäläinen,
IP Integration Overhead Analysis in System-on-Chip Video Encoder,
IEEE Workshop on Design and Diagnostics of Electronic Circuits and Systems (DDECS) 2007, Krakow, Poland, April 11-13, 2007, pp. 333-336.
(PDF format)
Abstract
Current system-on-chip implementations integrate IP blocks from different vendors. Typical problems are incompatibility and extra overheads. This paper presents integration of two black-box hardware accelerators into highly scalable and modular multiprocessor system-on-chip architecture. The integration was implemented by creating two wrapper components that adapt the interfaces of the hardware accelerators for the used architecture and on-chip network. The benefit of the accelerators was measured in three different configurations and especially the overheads caused by the software, data delivery, and shared resource contention were extracted and analyzed. The operation execution time using the accelerator with overheads included is up to twenty-fold compared to the ideal accelerator computation time. In addition, the accelerator that seemed to be more efficient performed worse in practice. As a conclusion, it is pointed out that the integration might induce great overhead to the execution time rendering a-few-clock-cycle optimizations meaningless.
- Panu Hämäläinen, Marko Hännikäinen, Timo D. Hämäläinen,
Security Enhancement Layer for Bluetooth,
Wireless Security and Cryptography: Specifications and Implementations, N. Sklavos and X. Zhang, Eds., March 1, 2007, pp. 249-274, CRC Press, Taylor and Francis Group.
Abstract
Bluetooth has become the default technology for low-cost and low-power wireless personal area communications. However, a number of vulnerabilities have been identified in its security design. This chapter presents a novel Enhanced Security Layer (ESL) for Bluetooth. In ESL the security level is increased by replacing the Bluetooth encryption algorithm with Advanced Encryption Standard (AES) and adding cryptographic integrity protection to transmissions. Moreover, ESL supports a public-key and a secret-key authentication protocol for exchanging ESL keys as well as standard Bluetooth PIN codes. When the original Bluetooth security design is used, ESL improves it by allowing only safer parameter combinations. As ESL is placed on the top of the standard controller interface, it can be integrated into any standard Bluetooth implementation. A full-scale embedded prototype implementation of ESL is also presented. AES and its operation modes are implemented in hardware for high performance. Excluding the public-key authentication, the ESL features use significantly fewer resources and imply higher performance than a standard Bluetooth security implementation. The easy-to-use programming interface supports straightforward application development.
- Yang Qu, Juha-Pekka Soininen and Jari Nurmi,
Static scheduling techniques for dependent tasks on dynamically reconfigurable devices,
Journal of Systems Architecture. (2007), doi:10.1016/j.sysarc.2007.02.004
Abstract
Dynamically reconfigurable hardware not only has high silicon reusability, but it can also deliver high performance for computation-intensive tasks. Advanced features such as run-time reconfiguration allow multiple tasks to be mapped onto the same device either simultaneously or multiplexed in time domain. These tasks need to be scheduled optimally or near optimally in order to efficiently utilize the device. It is a NP-hard problem, because task scheduling, allocation and configuration prefetching all need to be considered. In this paper, we target dependent task models and propose three static schedulers that use different problem solving strategies. The first is a heuristic approach developed from traditional listbased schedulers. It presents high efficiency but the least accuracy. The second is based on a full-domain search using constraint programming. It can guarantee to produce optimal solutions but requires significant searching effort. The last is a guided random search technique based on a genetic algorithm, which shows reasonable efficiency and much better accuracy than the heuristic approach.
- Petri Kukkala, Tero Arpinen, Mikko Setälä, Marko Hännikäinen, Timo D. Hämäläinen,
Dynamic Power Management with Runtime Process Remapping for UML 2.0 Applications,
IST/SPIE 19th Annual Symposium on Electronic Imaging, San Jose, USA, February 26, 2007, Vol.6507.
(PDF format)
Abstract
This paper presents dynamic power management for distributed UML applications on a multiprocessor System-on-Chip (SoC). It exploits the observing of processor utilizations and runtime process remapping. The presented method has been evaluated with a WLAN medium access control protocol as a test case application on a multiprocessor SoC platform. The platform is implemented on Altera Stratix II FPGA and contains up to five Nios II processors. Measurements on FPGA proved 5 to 21% savings in the power consumption of the whole FPGA board.
- Cristian Grecu, Andrč Ivanov, Partha Pratim Pande, Axel Jantsch, Erno Salminen, Umit Ogras, Radu Marculescu,
An Initiative towards Open Network-on-Chip,
white paper, February 20, 2007, 16 pages, OCP-IP.
(PDF format)
Abstract
Measuring and comparing performance, cost, and other features of advanced communication architectures for complex multi core/multiprocessor systems on chip is a significant challenge which has hardly been addressed so far. This document outlines the top-level view on a system of benchmarks for Networks on Chip (NoC), which intends to cover a wide spectrum of NoC design aspects, from application modeling to performance evaluation and post-manufacturing test and reliability. For performance benchmarking it describes requirements and features for application programs, synthetic micro-benchmarks, and abstract benchmark applications. Then, it proposes ways to measure and benchmark reliability, fault tolerance and testability of the on-chip communication fabric. This paper introduces the main concepts and ideas for benchmarking NoCs in a systematic and comparable way. It will be followed up by a report that will define a benchmark framework and the syntax of interfaces for benchmark programs that will allow the community to build-up a benchmark suite.
- Heikki Orsila, Tero Kangas, Erno Salminen, Timo D. Hämäläinen, Marko Hännikäinen,
Automated Memory-Aware Application Distribution for Multi-Processor System-On-Chips,
Journal of Systems Architecture, February 14, 2007, Elsevier.
(PDF format)
Abstract
Mapping of applications on a Multiprocessor System-on-Chip (MP-SoC) is a crucial step to optimize performance, energy and memory constraints at the same time. The problem is formulated as finding solutions to a cost function of the algorithm performing mapping and scheduling under strict constraints. Our solution is based on simultaneous optimization of execution time and memory consumption whereas traditional methods only concentrate on execution time. Applications are modeled as static acyclic task graphs that are mapped on MP-SoC with customized simulated annealing. The automated mapping in this paper is especially purposed for MP-SoC architecture exploration, which typically requires a large number of trials without human interaction. For this reason, a new parameter selection scheme for simulated annealing is proposed that sets task mapping specific optimization parameters automatically. The scheme bounds optimization iterations to a reasonable limit and defines an annealing schedule that scales up with application and architecture complexity. The presented parameter selection scheme compared to extensive optimization achieves 90% goodness in results with only 5% optimization time, which helps large-scale architecture exploration where optimization time is important. The optimization procedure is analyzed with simulated annealing, group migration and random mapping algorithms using test graphs from the Standard Task Graph Set. Simulated annealing is found better than other algorithms in terms of both optimization time and the result. Simultaneous time and memory optimization method with simulated annealing is shown to speed up execution by 63% without memory buffer size increase. As a comparison, optimizing only execution time yields 112% speedup, but also increases memory buffers by 49%.
- Petri Kukkala, Mikko Setälä, Tero Arpinen, Erno Salminen, Marko Hännikäinen, Timo D. Hämäläinen,
Implementing a WLAN Video Terminal Using UML and Fully-Automated Design Flow,
EURASIP Journal on Embedded Systems, January 10, 2007, Issue Embedded Digital Signal Processing Systems, edited by Jarmo Henrik Takala, Shuvra Bhattacharyya, and Gang Qu., 15 pages.
(PDF format)
Abstract
This case study presents UML-based design and implementation of a wireless video terminal on a multiprocessor System-on-Chip (SoC). The terminal comprises a video encoder and WLAN communications subsystems. In this paper, we present the UML models in designing the functionality of the subsystems and the architecture of the terminal hardware. Further, we use the Koski design flow and its tools for fully-automated implementation of the terminal on FPGA. Measurements were performed to evaluate the performance of the FPGA implementation. Currently, fully-software encoder achieves the frame rate of 3.0 fps with three 50 MHz processors, which is one half of a reference C implementation. Thus, using UML and design automation causes reduced performance, but we argue that this is highly accepted as we gain significant improvement in design efficiency. The experiments with the UML-based design flow proved its suitability and competence in designing complex embedded multimedia terminals.
- Jari Kreku, Yang Qu, Juha-Pekka Soininen and Kari Tiensyrjä,
Layered UML Workload and SystemC Platform Models for Performance Simulation,
VTT's Scientific Research Review 2006. pp. 29-31, VTT 2007
Abstract
Designers of future mobile devices need abstract application and platform models for feasibility and performance evaluation and for defining platform computation and communication capacities. We propose a layered modelling approach that allows application and platform to be modelled at several levels of abstraction to enable early performance evaluation. The approach has been tested by applying it to an MPEG-4 encoder case study.
- Yang Qu, Juha-Pekka Soininen and Kari Tiensyrjä,
Minimizing the Configuration Overhead of Run-Time Reconfigurable Logic by Loading Tasks in Parallel,
VTT's Scientific Research Review 2006. pp. 32-34, VTT 2007
Abstract
Multitasking on reconfigurable logic can achieve very high silicon reusability. However, configuration latency is a major limitation and it can largely degrade the system performance. One reason is that tasks can run in parallel but configurations of the tasks can be done only in sequence. This work presents a novel configuration model to enable configuration parallelism. It consists of multiple homogeneous tiles and each tile has its own configuration-SRAM that can be individually accessed. Thus multiple configuration controllers can load tasks in parallel. The experiment results reveal that in average using multiple controllers can reduce the configuration overheads by 21%. Compared to best cases of using multiple tiles with a single controller, additional 40% speedup can be achieved using multiple controllers.
- Yang Qu, Juha-Pekka Soininen and Jari Nurmi,
Using Constraint Programming to Achieve Optimal Prefetch Scheduling for Dependent Tasks on Run-Time Reconfigurable Devices,
Proceedings of the IEEE International Symposium on System-on-Chip 2006 (SOC 2006), 14. - 16.11.2006, Tampere, Finland, pp. 83-86
Abstract
Dynamically reconfigurable hardware not only has high silicon reusability, but it can also deliver high performance for computation-intensive tasks. However, the reconfiguration process usually has long configuration latency, which contributes only negatively to the system performance. Prefetching is a very effective technique to hide such latency, but there is no scheduler that is capable of optimally scheduling tasks while considering prefetching. In this work, we use constraint programming, an approach with a strong theoretical foundation, to perform offline scheduling for dependent tasks. Our approach can find an optimal schedule that has minimal schedule length. Experiments on randomly generated task graphs have been carried out. In 2/5 of the cases, the optimal solutions can be found within 1 second.
- Erno Salminen, Tero Kangas, Timo D. Hämäläinen,
The impact of communication on the scalability of the data-parallel video encoder on MPSoC,
International Symposium on System-on-Chip, Tampere, Finland, November 14-16, 2006, pp. 191-194
(PDF format)
Abstract
This paper presents the design space exploration of a data-parallel video encoder running on Multiprocessor System-on-Chip (MPSoC). The impact of communication on the scalability is analyzed. The exploration is carried out with the abstract models for application and architecture by using Transaction Generator (TG) which enables rapid modeling and simulation of complex applications regarding the dependencies between application tasks. The application model is based on profiled execution times and data sizes from real application code. Results show that chosen parallelization method keeps communication between processing elements at low level. It allows good scalability in terms of communication bandwidth, image size, and number of processing elements but unequal load in some cases restricts the scalability. Moreover, the impact of overlapping the communication and computation is analyzed.
- Heikki Orsila, Tero Kangas, Erno Salminen, Timo D. Hämäläinen,
Parameterizing Simulated Annealing for Distributing Task Graphs on Multiprocessor SoCs,
International Symposium on System-on-Chip, Tampere, Finland, November 14-16, 2006, pp. 73-76.
(PDF format)
Abstract
Mapping an application on Multiprocessor Systemon- Chip (MPSoC) is a crucial step in architecture exploration. The problem is to minimize optimization effort and application execution time. Simulated annealing is a versatile algorithm for hard optimization problems, such as task distribution on MPSoCs. We propose a new method of automatically selecting parameters for a modified simulated annealing algorithm to save optimization effort. The method determines a proper annealing schedule and transition probabilities for simulated annealing, which makes the algorithm scalable with respect to application and platform size. Applications are modeled as static acyclic task graphs which are mapped to an MPSoC. The parameter selection method is validated by extensive simulations with 50 and 300 node graphs from the Standard Graph Set.
- ITEA-MARTES consortium,
MARTES Posters,
ITEA symposium Paris, 5-6 October 2006
(PDF format)
Abstract
MARTES Vision: Improving the productivity and managing the complexity of embedded systems development through Elevated level of abstraction, Automation through model transformations and Reuse of models
- Timo Alho,
Cryptographic Hardware Implementations for Embedded Devices,
M.Sc. Thesis, 2006, 63 pages, Tampere University of Technology.
Abstract
Implementing security processing for embedded devices is a challenging task due to their non-functional requirements, such as cost and power consumption. Compared to software implementations on general-purpose processors, cryptographic algorithms can be executed significantly more efficiently with tailored hardware designs. Cryptographic primitives can be classified into symmetric-key, public-key, and unkeyed primitives. This thesis presents a hardware design to each of these classes. The first design is an Advanced Encryption Standard (AES) encryption core. AES has become the default choice for an encryption algorithm in many standard networking technologies, protocols, and applications. To achieve low area and low power consumption, the highly parallel AES algorithm is carefully serialized for an 8-bit data path. When synthesized to a 0.13µm Complementary Metal Oxide Semiconductor (CMOS) technology, the implemented encryption core consumes silicon area equivalent to 3100 NAND gates and achieves the throughput of 121 Mbit/s at the 152 MHz maximum clock frequency. The results show that the core is well suited for the cost and energy sensitive target applications. The second design is a modular exponentiation accelerator. The computation of exponentiation over a modulus is widely used in many public-key cryptosystems. The operation is computationally very expensive when the operands are large. The designed hardware exploits the functionalities of the target Field Programmable Gate Array (FPGA) device, Altera Stratix, to achieve low resource utilization and high performance. Compared to the previously reported FPGA implementations, the implementation of this work consumes considerably fewer resources yet achieving comparable performance. Due to its compactness and support for different clock domains, the accelerator can effortlessly be integrated into a larger system in the same FPGA. The performance of the accelerator is demonstrated in practice with a System-on-Chip (SoC) prototype implementation. The third design is a Whirlpool hashing core. The Whirlpool hash algorithm is a potential alternative to the widely used hash algorithms SHA-1 and MD5 that have been found to contain weaknesses. The core constitutes of a novel 8-bit architecture that allows a compact realization of the algorithm. In the Xilinx Virtex-II Pro FPGA, the implementation consumes 376 slices and achieves the throughput of 81.5 Mbit/s at the 214 MHz maximum clock frequency. The resource consumption of the implementation is one fourth of the smallest Whirlpool implementation presented to date.
- Tero Kangas,
Methods and Implementations for Automated System on Chip Architecture Exploration,
Ph.D., Tampere University of Technology, Issue 616, 29 September 2006
(PDF format)
Abstract
Contemporary design methods are not able to meet the requirements of the increasing complexity of digital embedded systems. Especially, system-level performance analysis and design space exploration must be performed much earlier than during the cycle-accurate simulation or prototyping phase. By this way, the error-prone and time-consuming path from the system modeling to implementation can be shortened as the crucial architectural decisions can be reasoned and verified earlier in the design flow. The challenge with early design methods is to define an abstraction for the system model that accelerate the design space exploration, facilitate the design flow automation, and still preserve adequate accuracy of performance estimations. Too detailed system models in many contemporary proposals lead to extensive system-level analysis and exploration times. This Thesis describes a new design flow, named Koski, for multiprocessor Systemson- Chips (SoCs) covering the design phases from system-level modeling to FPGA prototyping. The emphasis is on the automation of early architecture exploration which optimizes the component selection and configuration, application mapping, and scheduling. To automate the exploration, abstract system models, their transformations, and intermediate formats and interfaces of the related exploration tools are examined. The architecture exploration flow is an integral part of the Koski framework, which provides the modeling, simulation, verification, synthesis, and prototyping environment for the SoC design. The system is modeled in Unified Modeling Language (UML) following a generic UML profile that specifies the practices for orthogonal application and architecture modeling. The distinctive property of Koski is that the system models for architecture exploration are abstracted automatically from the UML models. In addition, the UML models are automatically updated according to the results of the architecture exploration including optimized architecture configuration, application mapping, and timing estimates. There is no manual work for architecture exploration, which is often the case in current exploration methods. The capabilities of Koski is shown in a case study, which integrates state-of-the-art technology approaches including a wireless terminal architecture, a network-on-chip, and multiprocessing utilizing RTOS in a SoC. The details of the architecture exploration are presented with two applications. First, a central part of a WLAN terminal, a medium access control protocol, is modeled and optimized for parallel architecture. Second, the exploration of a data parallel video encoder is carried out to analyze the suitability of the exploration framework for a dataflow type of application. Although the Koski methodology is illustrated with a specific tool set and applications, the results are applicable in the general context as well. The tools are the realization of the methodology and can also be implemented in other ways than presented. The system model abstraction, design flow automation, and back-annotation are not specific to any language or tool.
- J. Kreku, Y. Qu, J.-P. Soininen, K. Tiensyrjä,
Layered UML Workload and SystemC Platform Models for Performance Simulation,
Proceedings of Forum on Specification and Design Languages (FDL'06), Darmstadt, Germany, September 19-22 2006, ECSI, pp.223-228
(PDF format)
Abstract
Future mobile devices will be based on heterogeneous multiprocessing platforms accommodating several currently stand-alone applications. Increasing complexity of both application and platform development requires coordinated separation of concerns so that interoperability can be preserved. Application designer needs abstract platform models to check rapidly whether a new feature or application is feasible on a platform and how it will impact on the performance of other coexisting applications. Platform designer needs abstract application models for defining platform computation and communication capacities. We propose a layered UML workload and SystemC platform modelling approach that allows application and platform to be modelled at several levels of abstraction to enable early performance evaluation of the resulting system. Platform services are presented to workload models through APIs that allow a Y-chartlike specify-explore-refine performance modelling and simulation. The approach has been experimented by applying it to MPEG-4 encoder, Quake2 3D game and MP3 decoder case studies that validate the approach.
- Petri Kukkala, Marko Hännikäinen, Timo D. Hämäläinen,
Configurable Protocol Engine for Runtime-Configurable Communication Subsystems on Multiprocessor SoC,
17th International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC 2006), Helsinki, Finland, September 11-14, 2006, 5 pages.
(PDF format)
Abstract
This paper presents a Configurable Protocol Engine (CPE) to implement runtime-configurable communication subsystems, which are able to adapt their protocol stacks to varying service requirements. The communication subsystems with CPE are designed and implemented using a UML-based design methodology and automated design flow. CPE has been applied to implementing wireless protocol stacks on multiprocessor System-on-Chip (SoC) platforms. As a design case study, we present the implementation of a WSN-to-WLAN bridge on a multiprocessor SoC on FPGA. Experiences with CPE proved its feasibility in rapid implementation of communication subsystems with very decent performance.
- Panu Hämäläinen, Timo Alho, Marko Hännikäinen, Timo D. Hämäläinen,
Design and Implementation of Low-area and Low-power AES Encryption Hardware Core,
9th Euromicro Conference on Digital System Design - Architectures, Methods and Tools (DSD 2006), Cavtat, Croatia, August 30, 2006 - September 1, 2006, pp. 577-583.
(PDF format)
Abstract
The Advanced Encryption Standard (AES) algorithm has become the default choice for various security services in numerous applications. In this paper we present an AES encryption hardware core suited for devices in which low cost and low power consumption are desired. The core constitutes of a novel 8-bit architecture and supports encryption with 128-bit keys. In a 0.13 um CMOS technology our area optimized implementation consumes 3.1 kgates. The throughput at the maximum clock frequency of 153 MHz is 121 Mbps, also in feedback encryption modes. Compared to previous 8-bit implementations, we achieve significantly higher throughput with corresponding area. The energy consumption per processed block is also lower.
- Kostas Masselos, Nikos Voros, Yang Qu, Kari Tiensyrjä, Miroslac Cupak, Luc Rijnders and Marko Pettissalo,
System Level Architecture Exploration for Reconfigurable Systems on Chip,
Proceedings of FPL06 conference, August 28-30 2006, Madrid, Spain
Abstract
During the last years, a new type of Systems-on-Chip called, Reconfigurable Systems-on-Chip (RSoCs), has appeared. The design of such systems is a complex task and requires innovative methods to support the development process. In this paper, we present two alternative approaches for the efficient architecture exploration of RSoCs, based on SystemC language and on OCAPI-xl environment. The approaches introduced, allow early evaluation of alternative mappings of system's functionality onto different architectures. As a result, the time consuming iterations from lower design stages are eliminated, and reduced design time is achieved. The paper proves the effectiveness of the proposed approaches through three different case studies, borrowed from complementary domains.
- Petri Kukkala, Mikko Setälä, Tero Arpinen, Erno Salminen, Marko Hännikäinen, Timo D. Hämäläinen,
Implementing a WLAN Video Terminal Using UML and Fully-Automated Design Flow,
EURASIP Journal on Embedded Systems, July 28, 2006, 16 pages.
(PDF format)
Abstract
This case study presents UML-based design and implementation of a wireless video terminal on a multiprocessor System-on-Chip (SoC). The terminal comprises a video encoder and WLAN communications subsystems. In this paper, we present the UML models in designing the functionality of the subsystems and the architecture of the terminal hardware. Further, we use the Koski design flow and its tools for fully-automated implementation of the terminal on FPGA. Measurements were performed to evaluate the performance of the FPGA implementation. Currently, fully-software encoder achieves the frame rate of 3.0 fps with three 50 MHz processors, which is one half of a reference C implementation. Thus, using UML and design automation causes reduced performance, but we argue that this is highly accepted as we gain significant improvement in design efficiency. The experiments with the UML-based design flow proved its suitability and competence in designing complex embedded multimedia terminals.
- Timo Alho, Panu Hämäläinen, Marko Hännikäinen, Timo D. Hämäläinen,
Design of a Compact Modular Exponentiation Accelerator for Modern FPGA Devices,
World Automation Congress 2006 (WAC 2006) - Special Session on Information Security and Hardware Implementations, Budapest, Hungary, July 24-26, 2006, 7 pages.
(PDF format)
Abstract
We present a compact FPGA implementation of modular exponentiation accelerator suited for cryptographic applications. The implementation efficiently exploits the properties of modern FPGAs. The accelerator consumes 341 logic elements, 1 DSP block, and 13 604 memory bits in Altera Stratix EP1S40. It performs modular exponentiations with up to 2250-bit integers and scales easily to larger exponentiations. Excluding pre and post processing time, 1024-bit and 2048-bit exponentiations are performed in 28.03 ms and 212.09 ms, respectively. Due to its compactness, standard interface, and support for different clock domains, the accelerator can effortlessly be integrated into a larger system in the same FPGA.
- B. Vanhooff, S. Van Baelen, A. Hovsepyan, W. Joosen, Y. Berbers,
Towards a Transformation Chain Modeling Language,
Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS VI), Samos, Greece, 17-20 July 2006
(PDF format)
Abstract
The Model Driven Development (MDD) paradigm stimulates the use of models as the main artifacts for software development. These models can be situated at high levels of abstraction, close to the application's business domain. Many consecutive automatic transformations (a transformation chain) can be applied to these models to add the necessary details in order to generate a concrete implementation. This means that a large part of the total development effort is relocated to the development of transformations and hence we should have the nec-essary tooling support for designing transformation chains. In this paper we propose metamodel for a transformation chain specification language that enables gluing together many different transformation components in an implementation independent fashion. The concrete syntax for this language is based on UML activity diagrams with a appropriate profile.
- A. Hovsepyan, S. Van Baelen, B. Vanhooff, W. Joosen, Y. Berbers,
Key Research Challenges for Successfully Applying MDD within Real-Time Embedded Software Development,
Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS VI), Samos, Greece, 17-20 July 2006
(PDF format)
Abstract
Model-Driven Development (MDD) is a software development paradigm that promotes the use of models at different levels of abstraction and perform transformations between them to derive one or more concrete application implementations. In this paper we analyze the current status of MDD regarding its applicability for the development of Real-Time Embedded Software. We discuss different modeling framework approaches used to specify the various models, and compare OMG/MDA-based approaches (MOF, UML Profiles and executable UML) with a generic MDD-based approach (GME). Finally, we identify the key challenges for future MDD research in order to successfully apply MDD within RTES Development. These challenges are mainly situated in the field of modeling and standardization of abstraction levels, model transformations and code generation, traceability, and integration of existing software within the MDD development process.
- Mikko Setälä, Petri Kukkala, Tero Arpinen, Marko Hännikäinen, Timo D. Hämäläinen,
Automated Distribution of UML 2.0 Designed Applications to a Configurable Multiprocessor Platform,
Embedded Computer Systems: Architectures, Modeling, and Simulation, Samos, Greece, July 17-20, 2006, pp. 27-38.
(PDF format)
Abstract
This paper presents automated distribution of embedded real-time applications modeled in Unified Modeling Language version 2.0 (UML 2.0). The automated distribution requires methods and tools for design automation, as well as the run-time environment for the distributed execution on the target platform. Executable application code is generated from UML models, and UML with a custom profile is used to abstract hardware architecture and configure application mapping. For experimenting, a full featured WLAN terminal was designed in UML and implemented as a distributed multiprocessor system-on-chip (SoC) on an FPGA prototype platform. Measurements show that a 50-70% reduction in protocol delays is achived with distribution, and delay variations are reduced 45-85%.
- B. Vanhooff, D. Ayed, Y. Berbers,
A framework for transformation chain design processes,
First European Workshop on Composition of Model Transformations (CMT 2006), Kleppe, A.G. (ed.), vol TR-CTIT-06-34, CTIT Technical Report, pp. 3-8, 2006
(PDF format)
Abstract
Model Driven Development (MDD) promotes the use of abstract models in software development. A key ingredient of MDD is the application of transformations to these models, which means that part of the development effort is relocated to the transformations. Currently there is almost no available guidance to help designing a suitable, project specific, transformation chain. We propose a framework of four concern layers to organize transformations, which facilitates better separation-of-concerns and offers opportunities for transformation reuse and replacement. We use this framework as a foundation to build a incremental transformation chain design process.
- Delanote, D.,
Study of Transformation Languages for the Description of MDA Transformation and Code Generation Rules,
M.Sc. Thesis, K.U.Leuven Department of Computer Science, June 2006
(PDF format)
Abstract
In de standaardisatie van MDA door OMG ontbreekt nog de omschrijving van een geschikte taal voor transformatieregels. Hierdoor kan MDA nog niet zonder meer in softwarewerktuigen geļmplementeerd worden. Het vinden van een geschikte transformatietaal wordt bestudeerd in dit eindwerk. Hiervoor worden huidige standaarden bestudeerd waar MDA en een transformatietaal van gebruik maken. Daarnaast wordt de bruikbaarheid van een aantal transformatietalen nagegaan en worden deze talen met elkaar vergeleken. Hieruit wordt een taal gekozen en een proefimplementatie van transformatieregels in deze taal uitgewerkt. Deze transformatieregels worden afgeleid uit de bestaande software ontwikkelingsnoden van E2S. Op deze manier wordt onderzocht hoe de huidige mogelijkheden van deze softwaresuite met MDA kunnen gerealiseerd worden. Tot slot wordt onderzocht hoe de transformatietaal met de applicatiesuite kan geļntegreerd worden. De doelstelling van deze thesis kan als volgt samengevat worden: Een studie van de codegeneratiemechanismen van de huidige E2S software; Het identificeren van codegeneratiepatronen hierin; Aan de hand van een literatuurstudie een geschikte modeltransformatietaal vinden; Het omvormen van de gevonden codegeneratiepatronen tot een proefimplementatie van transformatieregels in de gekozen transformatietaal; Het onderzoeken van een integratie van de bestaande E2S software met de transformatietaal; Elk van deze punten worden verderop in dit proefschrift toegelicht.
- Ari Kulmala, Olli Lehtoranta, Timo D. Hämäläinen, Marko Hännikäinen,
Scalable MPEG-4 Encoder on FPGA Multiprocessor SoC,
EURASIP Journal on Embedded Systems, June 27, 2006, Issue Field-Programmable Gate Arrays in Embedded Systems, Hindawi Publishing Corporation, 15 pages
(PDF format)
Abstract
High computational requirements combined to rapidly evolving video coding algorithms and standards are a great challenge for contemporary encoder implementations. Rapid design changes prefer full programmability and configurability both for software and hardware. This paper presents a novel scaleable MPEG-4 video encoder on an FPGA based Multiprocessor System-on-Chip (MPSOC). The MPSOC architecture is truly scalable and is based on vendor independent Intellectual Property (IP) block interconnection network. The scalability in video encoding is achieved by spatial parallelization where images are divided to horizontal slices. A case design is presented with up to four synthesized processors on Altera Stratix 1S40 device. A truly portable ANSI-C implementation that supports arbitrary number of processors gives 11 QCIF frames/s at 50 MHz without processor specific optimizations. The parallelization efficiency is 97% for two processors and 93% with three. The FPGA utilization is 70%, requiring 28 797 logic elements. The effort of implementation is significantly reduced compared to traditional multiprocessor implementation.
- Heikki Orsila, Tero Kangas, Erno Salminen, Timo D. Hämäläinen, Marko Hännikäinen,
Automated Memory-Aware Application Distribution for Multi-Processor System-On-Chips,
Journal of Systems Architecture, June 21, 2006, Elsevier
Abstract
Mapping of applications on a Multiprocessor System-on-Chip (MP-SoC) is a crucial step to optimize performance, energy and memory constraints at the same time. The problem is formulated as finding solutions to a cost function of the algorithm performing mapping and scheduling under strict constraints. Our solution is based on simultaneous optimization of execution time and memory consumption whereas traditional methods only concentrate on execution time. Applications are modeled as static acyclic task graphs that are mapped on MP-SoC with customized simulated annealing. The automated mapping in this paper is especially purposed for MP-SoC architecture exploration, which typically requires a large number of trials without human interaction. For this reason, a new parameter selection scheme for simulated annealing is proposed that sets task mapping specific optimization parameters automatically. The scheme bounds optimization iterations to a reasonable limit and defines an annealing schedule that scales up with application and architecture complexity. The optimization procedure is analyzed with simulated annealing, group migration and random mapping algorithms using test graphs from the Standard Task Graph Set. Simultaneous time and memory optimization method with simulated annealing is shown to speed up execution by 63% without memory buffer size increase. As a comparison, optimizing only the execution time is shown to speed up by 112% but also increasing memory buffer size by 49%.
- J. Pauty, S. Van Baelen, and Y. Berbers,
Adapting Model-Driven Architecture to Ubiquitous Computing,
In: G. Kortuem, editor, Workshop on Software Engineering Challenges for Ubiquitous Computing, Lancaster University, pages 42-43, 2006.
(PDF format)
Abstract
- Erno Salminen, Tero Kangas, Jouni Riihimäki, Vesa Lahtinen, Kimmo Kuusilinna, Timo D. Hämäläinen,
HIBI Communication Network for System-on-Chip,
Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology, Heidelberg, Berlin, June 1, 2006, Vol.43, Issue 2-3, pp. 185-205, Springer-Verlag.
(PDF format)
Abstract
This paper presents a communication network targeted for complex system-on-chip (SoC) and network-on-chip (NoC) designs. The Heterogeneous IP Block Interconnection (HIBI) aims at maximum efficiency and energy saving per transmitted bit combined with guaranteed quality-of-service (QoS) in transfers. Other features include support for arbitrary topologies with several clock domains, flexible scalability, and run-time reconfiguration of network parameters. HIBI is intended for integrating coarsegrain components such as intellectual property (IP) blocks of size thousands of gates. HIBI is not just a communication network but it is accompanied with a design framework with tools for optimizing the network both at design and run time. HIBI has been implemented in VHDL and SystemC and synthesized on several CMOS technologies and FPGA. The area results are well comparable to other NoC proposals, which show that only minimal area overhead is paid by the advanced features. Furthermore, data transfers are shown to approach the maximum theoretical performance for protocol efficiency.
- Tero Kangas, Petri Kukkala, Heikki Orsila, Erno Salminen, Marko Hännikäinen, Timo D. Hämäläinen, Jouni Riihimäki, Kimmo Kuusilinna,
UML-based Multi-Processor SoC Design Framework,
Transactions on Embedded Computing Systems, May 1, 2006, Vol.5, Issue 2, pp. 281-320, ACM.
(PDF format)
Abstract
This paper describes a complete design flow for multi-processor Systems-on-Chips (SoCs) covering the design phases from system-level modeling to FPGA prototyping. The design of complex heterogeneous systems is enabled by raising the abstraction level and providing several systemlevel design automation tools. The system is modeled in a UML design environment following a new UML profile that specifies the practices for orthogonal application and architecture modeling. The design flow tools are governed in a single framework that combines the sub-tools into a seamless flow and visualizes the design process. Novel features include also an automated architecture exploration based on the system models in UML as well as the automatic back and forward annotation of information in the design flow. The architecture exploration is based on the global optimization of systems that are composed of sub-systems, which are then locally optimized for their particular purposes. As a result, the design flow produces an optimized component allocation, task mapping, and scheduling for the described application. In addition, it implements the entire system for FPGA prototyping board. As a case study, the design flow is utilized in the integration of state-of-the-art technology approaches including a wireless terminal architecture, a network-on-chip, and multiprocessing utilizing RTOS in a SoC. In this study, a central part of a WLAN terminal is modeled, verified, optimized, and prototyped with the presented framework.
- Panu Hämäläinen, Marko Hännikäinen, Timo D. Hämäläinen,
Security Enhancement Layer for Bluetooth,
Wireless Security and Cryptography: Specifications and Implementations, N. Sklavos and X. Zhang, Eds., April 25, 2006, 51 pages, CRC Press, Taylor and Francis
Abstract
Bluetooth has become the default technology for low-cost and low-power wireless personal area communications. However, a number of vulnerabilities have been identified in its security design. This chapter presents a novel Enhanced Security Layer (ESL) for Bluetooth. In ESL the security level is increased by replacing the Bluetooth encryption algorithm with Advanced Encryption Standard (AES) and adding cryptographic integrity protection to transmissions. Moreover, ESL supports a public-key and a secret-key authentication protocol for exchanging ESL keys as well as standard Bluetooth PIN codes. When the original Bluetooth security design is used, ESL improves it by allowing only safer parameter combinations. As ESL is placed on the top of the standard controller interface, it can be integrated into any standard Bluetooth implementation. A full-scale embedded prototype implementation of ESL is also presented. AES and its operation modes are implemented in hardware for high performance. Excluding the public-key authentication, the ESL features use significantly fewer resources and imply higher performance than a standard Bluetooth security implementation. The easy-to-use programming interface supports straightforward application development.
- Eero Aho, Jarno Vanne, Timo D. Hämäläinen,
Parallel Memory Architecture for Arbitrary Stride Accesses,
IEEE Workshop on Design and Diagnostics of Electronic Circuits and Systems (DDECS), Prague, Czech Republic, April 18-21, 2006, pp. 65-70.
(PDF format)
Abstract
Parallel memory modules can be used to increase memory bandwidth and feed a processor with only necessary data. Arbitrary stride access capability with interleaved memories is described in previous research where the skewing scheme is changed at run time according to the currently used stride. This paper presents the improved schemes which are adapted to parallel memories. The proposed novel parallel memory architecture allows conflict free accesses with all the constant strides which has not been possible in prior application specific parallel memories. Moreover, the possible access locations are unrestricted and the data patterns have equal amount of accessed data elements as the number of memory modules. The complexity is evaluated with resource counts.
- Ari Kulmala, Erno Salminen, Olli Lehtoranta, Timo D. Hämäläinen, Marko Hännikäinen,
Impact of Shared Instruction Memory on Performance of FPGA-based MP-SoC Video Encoder,
The IEEE Workshop on Design and Diagnostics of Electronic Circuits and Systems 2006 (DDECS'06), Prague, Czech Republic, April 18-21, 2006, pp. 59-64.
(PDF format)
Abstract
The impact of shared instruction memory on performance is measured and analyzed for an FPGA-based Multiprocessor System-on-Chip (MP-SoC) with an MPEG-4 video encoding application. Our MP-SoC architecture allows arbitrary scaling of the number of synthesized processors and includes a monitoring unit for memory transfers. Based on the measurements with up to four processors on Altera Stratix 1S40, an estimate of the effect of shared memory for larger configurations is presented. Shared instruction memory is shown to be area-efficient and sufficient in performance for configurations up to five processors, as the drop in encoded video frame rate stays below one compared to distributed instruction memory organization.
- Tero Kangas, Kimmo Kuusilinna, Timo D. Hämäläinen,
Scalable Architecture for SoC Video Encoders,
Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology, March 27, 2006, Vol.44, Issue 1-2, pp. 79-95, Springer.
(PDF format)
Abstract
Evolving video coding standards demand functional flexibility for implementations, not only at design time but also after fabrication. This paper presents a System-on-Chip design approach with a feasible combination of performance, scalability, programmability, area efficiency, and design time effort for a video encoder. The encoder is based on a homogeneous master-slave processor architecture. Each slave encodes a part of the frame in the Single Program Multiple Data (SPMD) data parallel model. Both shared and distributed memory architectures are presented. Design effort is reduced by identical program codes, automated assembly of software and hardware modules independent of the number and type of processors, as well as our flexible on-chip communication network called Heterogeneous IP Block Interconnection (HIBI). A case study implementation with two to ten simple ARM7 processors, 32-bit HIBI bus and non-optimized processor-independent software gives the performance from 6 to 53 fps for QCIF. The whole encoder area ranges from 173 to 770 kgates excluding the memories. The relation scales reasonably well to systems with more powerful processors and optimized code. The optimization of the communication network shows that with more than six slaves even a serial HIBI connection with 100 MHz speed is feasible. HIBI and the parallelization approach allow exploration and optimization of the communication both at the application and architecture layers.
- Y. Qu, J.-P. Soininen, J. Nurmi,
A Parallel Configuration Model for Reducing the Run-Time Reconfiguration Overhead,
Design Automation and Test in Europe 2006 Conference (DATE06), 6-10 March 2006, Munich, Germany
(PDF format)
Abstract
Multitasking on reconfigurable logic can achieve very high silicon reusability. However, configuration latency is a major limitation and it can largely degrade the system performance. One reason is that tasks can run in parallel but configurations of the tasks can be done only in sequence. This work presents a novel configuration model to enable configuration parallelism. It consists of multiple homogeneous tiles and each tile has its own configuration SRAM that can be individually accessed. Thus multiple configuration controllers can load tasks in parallel and more speedups can be achieved. We used a prefetch scheduling technique to evaluate the model with randomly generated tasks. The experiment results reveal that in average using multiple controllers can reduce the configuration overheads by 21%. Compared to best cases of using multiple tiles with a single controller, additional 40% speedup can be achieved using multiple controllers.
- Tero Arpinen, Petri Kukkala, Erno Salminen, Marko Hännikäinen, Timo D. Hämäläinen,
Configurable Multiprocessor Platform with RTOS for Distributed Execution of UML 2.0 Designed Applications,
9th Design, Automation and Test in Europe Conference (DATE 2006), Munich, Germany, March 6-10, 2006, pp. 1324-1329.
(PDF format)
Abstract
This paper presents the design and full prototype implementation of a configurable multiprocessor platform that supports distributed execution of applications described in UML 2.0. The platform is comprised of multiple Altera Nios II softcore processors and custom hardware accelerators connected by the Heterogeneous IP Block Interconnection (HIBI) communication architecture. Each processor has a local copy of eCos real-time operating system for the scheduling of multiple application threads. The mapping of a UML application into the proposed platform is presented by distributing a WLAN medium access control protocol onto multiple CPUs. The experiments performed on FPGA show that our approach raises system design to a new level. To our knowledge, this is the first real implementation combining a high-level design flow with a synthesizable platform.
- Y. Qu, J.-P. Soininen and J. Nurmi,
Using Multiple Configuration Controllers to Reduce the Reconfiguration Overhead,
The 23rd IEEE Norchip Conference (Norchip2005), 20-22 November 2005, Oulu, Finland
(PDF format)
Abstract
This work presents a novel run-time reconfiguration model. It uses multiple configuration controllers instead of only one in traditional devices. The configuration SRAM is divided into several individual sections, and controllers can reconfigure different sections in parallel. Therefore, multiple tasks can be loaded simultaneously. Two static task schedulers are developed to evaluate the device. Already with two controllers, the overall configuration overhead can be reduced by about 40% using non-prefetch scheduling. When using prefetch scheduling, another 16.2% reduction can be achieved on average of all results.
- Erno Salminen, Tero Kangas, Timo D. Hämäläinen, Jouni Riihimäki,
Requirements for Network-on-Chip Benchmarking,
Norchip, Oulu, Finland, November 20-22, 2005, pp. 82-85.
(PDF format)
Abstract
This work presents the motivation, basic concepts, and requirements for benchmarking a Network-on-Chip (NoC). Currently there is practically no benchmark sets for NoC or the presented tools do not meet the requirements. The presented benchmarking method utilizes traffic generator with a dataflow models of the applications. Combined with transaction-level NoC, the abstract application model allows approximately 200x speedup and on average 10% error in estimated runtime w.r.t. cycle-accurate HW/SW co-simulation without exposing the exact internal functionality of the application.
- J. Kreku, M. Eteläperä and J.-P. Soininen,
Exploitation of UML 2.0-Based Platform Service Model and SystemC Workload Simulation in MPEG-4 Partitioning,
International Symposium on System-on-Chip (SoC2005), 15-17 November 2005, Tampere, Finland
(PDF format)
Abstract
Performance evaluation in an early phase of system design is a crucial part of system optimisation and validation. We present a method for combining UML-based application workload models with hardware models written using the SystemC language, and introduce a layer of platform service models between the application and hardware architecture models. The modelling approach is validated with a case study consisting of MPEG-4 encoder partitioning for OMAP 5912 architecture. The average error between the simulations and the measurements is about 12%.
- Heikki Orsila, Tero Kangas, Timo D. Hämäläinen,
Hybrid Algorithm for Mapping Static Task Graphs on Multiprocessor SoCs,
International Symposium on System-on-Chip (SoC 2005), Tampere, Finland, November 15-17, 2005.
(PDF format)
Abstract
Mapping of applications on multiprocessor System-on-Chip is a crucial step in the system design to optimize the performance, energy and memory constraints at the same time. The problem is formulated as finding solutions to an objective function of the algorithm performing the mapping and scheduling under strict constraints. Our solution is a new hybrid algorithm that distributes the computational tasks modeled as static acyclic task graphs The algorithm uses simulated annealing and group migration algorithms consecutively and it combines a non-greedy global and greedy local optimization techniques to have good properties of both ways. The algorithm begins as coarse grain optimization and moves towards fine grained optimization. As a case study we used ten 50-nodes graphs from the Standard Task Graph Set and averaged results over 100 optimization runs. The hybrid algorithm gives 8% better execution time on a system with four processing elements compared to simulated annealing. In addition, the number of iterations increased only moderately, which justifies the new algorithm in SoC design.
- Jouni Riihimäki, Petri Kukkala, Tero Kangas, Marko Hännikäinen, Timo D. Hämäläinen,
Interfacing UML 2.0 for Multiprocessor System-on-Chip Design Flow,
International Symposium of System-on-Chip (SoC 2005), Tampere, Finland, November 15-17, 2005, pp. 108-111.
(PDF format)
Abstract
UML 2.0 can be extended for embedded system design. Our solution is a well-defined modeling approach, known as TUT-Profile, for UML 2.0 together with our Systemon- Chip architecture exploration tools. The two major novel features are an explicit control of real-time constraints at UML level and the transformation of the original UML model using back-annotated results of SoC architecture exploration. In this way, all information is kept up to date in a single UML model, in contrary to other flows that use UML only as a front end. This paper focuses on the interface between UML model and the architecture exploration and presents conversions, tools, and intermediate format required for the flow.
- Panu Hämäläinen, Ning Liu, Marko Hännikäinen, Timo D. Hämäläinen,
Acceleration of Modular Exponentiation on System-on-a-Programmable-Chip,
2005 IEEE International Symposium of System-on-Chip (SoC 2005), Tampere, Finland, November 15-17, 2005, pp. 14-17.
(PDF format)
Abstract
Computing modular exponentiations with long integers is required in a number of security protocols. Since security procedures typically consume large amount of processing capacity in network devices, efficient implementations are needed. As a solution, this paper presents an exponentiation accelerator suited for efficient processing in security protocols using public key schemes, such as TLS and IPsec. The accelerator is implemented on a System-on-a-Programmable-Chip, partitioned into software control and hardware processing. Compared to previous radix-2 designs, significantly higher performance is achieved. The design computes a full exponentiation in (n+k)(n+4) clock cycles, in which n is the bit length of the modulus and the exponent and k is the number of ones in the binary representation of the exponent. In the average case, the design executes the exponentiation 25% faster than the previous hardware designs at equal clock speeds. The proposed exponentiation control and 1-cycle processing mode can also be utilized for improving higher radix designs.
- Petri Kukkala, Marko Hännikäinen, Timo D. Hämäläinen,
Performance Modeling and Reporting for the UML 2.0 Design of Embedded Systems,
2005 IEEE International Symposium on System-on-Chip, Tampere, Finland, November 15-17, 2005, pp. 50-53.
(PDF format)
Abstract
This paper presents a new performance modeling approach for the design of embedded real-time systems using UML 2.0. The approach responds to the lack of specific semantics for the performance modeling. The existing UML metamodel is extended by defining stereotypes to include the message latency and execution time in UML statecharts. The information may contain both the real-time constraints and measured values that are back-annotated to the UML model. Further, fully automated model transformation is used to visualize this information with sequence diagrams. The modeling approach has been prototyped with the UML implementation of a WLAN medium access control protocol. The experiences proved the approach to be practical and intuitive.
- Vanhoof, B., Berbers, Y.,
Supporting Modular Transformation Units with Precise Transformation Traceability Metadata,
ECMDA Traceability Workshop, 7-10 November 2005, Nuernberg, Germany,SINTEF, 2005, pp. 15-27.
(PDF format)
Abstract
The Model Driven Architecture (MDA) initiative of the OMG heavily promotes the use of abstract models for software development. A key ingredient of the MDA is the automation of transformations on these models. Inserting semantically rich transformation traceability links into our models allows us to better comprehend the exact effects of the applied transformations. Empowering each transformation unit to insert its own specific traceability links provides subsequent transformation units with the ability to make use of that traceability information to improve their own actions. In addition, this permits us to better modularize transfor-mations into smaller and more reusable units that, to a certain extent, depend on each other. In this paper, we define a UML transformation traceability profile that allows the addition of semantically rich trace-ability links into UML models.
- Vanhoof, B., Berbers, Y.,
Breaking up the transformation chain,
Proceedings of the Best Practices for Model-Driven Software Development at OOPSLA 2005, 16-20 October 2005, San Diego, California, USA
(PDF format)
Abstract
Both Model-Driven Software Development (MDSD) and Model Driven Architecture (MDA) emphasize the importance of precise machine-readable models and automatic transformations on these models. In this paper we identify the need to externally specify transformation units in terms of required and provided model properties. We also present shortly how one can use semantically rich transformation traceability information as a specific kind of externally quantifiable property. Using precise specifications allows to break up monolithic transformation implementations into modular chains of transformation units that only depend each other's, externally specified, output model characteristics.
- ITEA-MARTES consortium,
MARTES Leaflet,
ITEA symposium Helsinki, 13-14 October 2005
(PDF format)
Abstract
The main focus of MARTES is on how to use the standard modelling languages UML and SystemC efficiently in combination for systematic model-based development of real-time embedded systems in an era of digital convergence.The project adopts ideas from MDA (Model Driven Architecture), particularly the separation of application functionality and platform. A number of special techniques developed around UML and SystemC will also be integrated, to form a coherent methodology.
- Tuomas Järvinen, Perttu Salmela, Panu Hämäläinen, Jarmo Takala,
Efficient Byte Permutation Realizations for Compact AES Implementations,
13th European Signal Processing Conference (EUSIPCO 2005), Antalya, Turkey, September 4-8, 2005, 4 pages.
(PDF format)
Abstract
Advanced Encryption Standard (AES) algorithm incorporates a byte permutation operation which reorders the bytes within a 128-bit data block. This permutation can be described by reading the input data bytes into a 4×4 matrix called state in column wise and shifting the rows by one, two, or three bytes to the left. In decryption, the shifting is reversed, i.e., the rows are shifted to the right. While such shifting operations are straightforward if the computation is done with 128-bit data blocks at a time, they become more complex in area-efficient folded implementations where smaller than 128-bit data blocks are used. In such cases, a storage of data is required, either in the form of registers or memories. In this paper, efficient realizations of the byte permutations in AES algorithm, where the size of simultaneously computed data can be 1, 2, 4, or 8 bytes, are presented. All the realizations use the minimum number of storage elements implying area-efficiency.
- Y. Qu, K. Tiensyrjä and J.-P. Soininen,
SystemC-based Design Methodology for Reconfigurable System-on-Chip,
The 8th Euromicro Conference on Digital System Design (DSD 2005), 30 August-3 September 2005, Porto, Portugal
(PDF format)
Abstract
Reconfigurable system is a promising alternative to deliver both flexibility and performance at the same time. New reconfigurable technologies and technology dependent tools have been developed, but a system-level design methodology to support system analysis and fast design space exploration is missing. In this paper, we present a SystemC-based system-level design approach. The main focuses are the resource estimation to support system analysis and reconfiguration modeling for fast performance simulation. The approach was applied in a real design case of a WCDMA detector on a commercially available reconfigurable platform. The runtime reconfiguration was used and the design showed 40% area saving when compared to a functionally equivalent fixed system and 30 times better in processing time when compared to a functionally equivalent pure software design.
- Panu Hämäläinen, Jari Heikkinen, Marko Hännikäinen, Timo D. Hämäläinen,
Design of Transport Triggered Architecture Processors for Wireless Encryption,
8th Euromicro Conference on Digital System Design (DSD 2005), Porto, Portugal, August 30, 2005 - September 3, 2005, pp. 144-152.
(PDF format)
Abstract
Transport Triggered Architecture (TTA) offers a costeffective trade-off between the size and performance of ASICs and the programmability of general-purpose processors. In this paper TTA processors for the RC4 and AES encryption algorithms of the new IEEE 802.11i WLAN security standard are designed. Special operations efficiently supporting the ciphers are developed. The TTA design flow is utilized for finding configurations with the best performance-size ratios. The size of the configuration supporting both the algorithms is 69.4 kgates and the throughput 100 Mb/s for RC4 and 68.5 Mb/s for AES at 100 MHz in the 0.13 um CMOS technology. Compared to commercial processors of the same wireless application domain, higher throughputs are achieved at significantly smaller area and lower clock speed, which also results in decreased energy consumption.
- Petri Kukkala, Marko Hännikäinen, Timo D. Hämäläinen,
Co-simulation of Wireless Local Area Network Terminals with Protocol Software Implemented in SDL,
8th Euromicro Conference on Digital System Design (DSD'2005), Porto, Portugal, August 30, 2005 - September 3, 2005, pp. 161-164.
(PDF format)
Abstract
The increasing complexity of modern embedded systems requires high-level design languages to meet the challenges in design. Combining of the high-level languages to fast and accurate hardware/software verification of System-on-Chip (SoC) architectures enhances the quality of the design process. This paper presents the verification of our WLAN terminal (TUTWLAN), with its Medium Access Control (MAC) protocol and test applications, using hardware/software cycle-accurate co-simulation. The protocol has been designed using Specification and Description Language (SDL), and automatic C code generation from SDL for implementation. The hardware implementation of the terminal contains hardware accelerators for time-critical protocol functions. Full system co-simulations were used for both the functional verification and performance evaluation of a single TUTWLAN terminal, as well as a network of terminals. With simulations, the performance bottlenecks were identified, and the results enable the implementing of the next generation TUTWLAN terminal as a single-chip.
- Olli Lehtoranta, Erno Salminen, Ari Kulmala, Marko Hännikäinen, Timo D. Hämäläinen,
A Parallel MPEG-4 Encoder for FPGA Based Multiprocessor SoC,
15th International Conference on Field Programmable Logic and Applications (FPL 2005), Tampere, Finland, August 24-26, 2005, pp. 380--385, Springer LNCS.
(PDF format)
Abstract
A parallel MPEG-4 Simple Profile encoder for FPGA based multiprocessor System-on-Chip is presented. The goal is a computationally scalable framework independent of platform. The scalability is achieved by spatial parallelization where images are divided to horizontal slices. Slice coding tasks are mapped to the multiprocessor consisting of four soft-cores arranged into master-slave configuration. Also, the shared memory model is adopted where large images are stored in shared external memory while small on-chip buffers are used for processing. The interconnections between memories and processors are realized with our HIBI network. Our main contributions are the scalable encoder framework as well as methods for coping with limited memory of FPGA. The current software only implementation processes 6 QCIF frames/s with three en-coding slaves. In practice, speed-ups of 1.7 and 2.3 have been measured with two and three slaves, respectively. FPGA utilization of current implementation is 59% requiring 24 207 logic elements on Altera Stratix EP1S40.
- Y. Qu, J.-P. Soininen, J. Nurmi,
An Efficient Approach to Hide the Run-Time Reconfiguration from SW Applications,
The 15th International Conference on Field Programmable Logic and Applications (FPL2005), 24-26 August 2005, Tampere, Finland
(PDF format)
Abstract
Dynamically reconfigurable logic is becoming an important design unit in SoC system. A method to make the reconfiguration management transparent to software applications is required in order to make easier the design with such devices. In this paper, we present an efficient approach similar to the cache miss and the data replacement in modern computer system for the task. The main advantage is that the reconfiguration can be correctly issued without extra instructions inserted either manually by SW application programmers or automatically by compilers. The approach was validated in a real case design. In the Virtex2P20 implementation platform, the resource overhead was 2.45% in terms of the number of LUTs. Performance is measured in cycle-accurate simulation environment. The overhead is about equal when compared with an OS-based equivalent design that uses system calls and critical section code to manage the reconfiguration.
- Panu Hämäläinen, Marko Hännikäinen, Timo D. Hämäläinen,
Efficient Hardware Implementation of Security Processing for IEEE 802.15.4 Wireless Networks,
2005 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS 2005), Cincinnati, Ohio, USA, August 7-10, 2005, pp. 484-487.
(PDF format)
Abstract
The IEEE 802.15.4 standard defines the medium access control and physical layer for low-rate, low-powerWireless Personal Area Networks (WPAN). As a number of WPAN applications require protected communications, the standard defines security procedures. Since the procedures typically consume most processing capacity in the limited 802.15.4 devices, efficient implementations are needed. As a solution, this paper presents a compact and energy-efficient hardware design, supporting all the security suites of the standard. Compared to typical WPAN processors, the presented FPGA prototype and the estimated ASIC implementation offer significantly higher performance and lower energy consumption. The FPGA throughput at the highest security level is 90 Mb/s and the energy consumption is 1/190 of an 8-bit microcontroller and 1/5 of an ARM9. The estimated energy consumption for the equivalent ASIC implementation is 1/10 of the FPGA prototype. In addition to 802.15.4, the hardware design supports all wireless technologies derived from the IEEE 802.11i security specification.
- Erno Salminen, Tero Kangas, Jouni Riihimäki, Vesa Lahtinen, Kimmo Kuusilinna, Timo D. Hämäläinen,
Benchmarking Mesh and Hierarchical Bus Networks in System-on-Chip Context,
Embedded Computer Systems: Architectures, MOdeling, and Simulation (SAMOS V), Samos, Greece, July 18-20, 2005, Vol.LNCS 3553, pp. 354-363, Springer.
(PDF format)
Abstract
A simulation-based comparison scheme for on-chip communication networks is presented. Performance of the network depends heavily on the application and therefore several test cases are required. In this paper, generic synthesizable 2-dimensional mesh and hierarchical bus, which is an extended version of a single bus, are benchmarked in a SoC context with five parameterizable test cases. The results show that the hierarchical bus offers a good performance and area trade-off. In the presented test cases, a 2-dimensional mesh offers a speedup of 1.1x - 3.3x over hierarchical bus, but the area overhead is of 2.3x - 3.4x, which is larger than performance improvement.
- Petri Kukkala, Marko Hännikäinen, Timo D. Hämäläinen,
Design and Implementation of a WLAN Terminal Using UML 2.0 Based Design Flow,
Embedded Computer Systems: Architectures, MOdeling, and Simulation (SAMOS V), Samos, Greece, July 18-20, 2005, pp. 404-413.
(PDF format)
Abstract
This paper presents a UML 2.0 based design flow for real-time embedded systems. The flow starts with UML 2.0 application, architecture and mapping models for our TUTWLAN terminal with its medium access control protocol. As a result, the hardware/software implementation on Altera Excalibur FPGA is achieved. Implementation utilizes eCos real-time operating system, and hardware accelerators for time-critical protocol functions. The design flow is prototyped in practice showing rapid UML 2.0 application model modification, real-time protocol processing in an image transfer application, and execution monitoring.
- Erno Salminen, Ari Kulmala, Timo D. Hämäläinen,
HIBI-Based Multiprocessor SoC on FPGA,
IEEE International Symposium on Circuits and Systems, ISCAS'05, Kobe, Japan, May 23-26, 2005, pp. 3351--3354, IEEE.
(PDF format)
Abstract
FPGAs offer excellent platform for System-on- Chips consisting of Intellectual Property (IP) blocks. The problem is that IP blocks and their interconnections are often FPGA vendor dependent. Our HIBI Network-on-Chip (NoC) scheme solves the problem by providing flexible interconnection network and IP block integration with Open Core Protocol (OCP) interface. Therefore, IP components can be of any type: processors, hardware accelerators, communication interfaces, or memories. As a proof of concept, a multiprocessor system with eight soft processor cores and HIBI is prototyped on FPGA. The whole system uses 36402 logic elements, 2.9 Mbits of RAM, and operates on 78 MHz frequency on Altera Stratix 1S40, which is comparable to other FPGA multiprocessors. The most important benefit is significant reduction of the design effort compared to system specific interconnection networks. HIBI also presents the first OCP compliant IP-block integration in FPGA.
- Petri Kukkala, Jouni Riihimäki, Marko Hännikäinen, Timo D. Hämäläinen, Klaus Kronlöf,
UML 2.0 Profile for Embedded System Design,
8th Design, Automation and Test in Europe Conference (DATE 2005), Munich, Germany, March 7-11, 2005, Vol.2, pp. 710-715.
(PDF format)
Abstract
Unified Modeling Language (UML) 2.0 is emerging in the area of embedded system design. This paper presents a new UML 2.0 profile - called TUT-Profile - that introduces a set of stereotypes and design rules for an application, platform, and mapping. The profile classifies different application and platform components, and enables their parameterization. TUT-Profile concentrates on the structure of an application and platform, and utilizes standard UML 2.0 for the behavioral modeling. The application is seen as a set of active classes with an internal behavior. Correspondingly, the platform is seen as a component library with a parameterized presentation in UML 2.0 for each library component.
- Olli Lehtoranta, Timo D. Hämäläinen,
Feasibility study of a real-time operating system for a multi-channel MPEG-4 encoder,
IS and T/SPIE's 17th Annual Symposium on Electronic Imaging Science and Technology, San Jose, California, USA, January 17-18, 2005, Vol.5684, pp. 292-299, SPIE.
Abstract
Feasibility of DSP/BIOS real-time operating system for a multi-channel MPEG-4 encoder is studied. Performances of two MPEG-4 encoder implementations with and without the operating system are compared in terms of encoding frame rate and memory requirements. The effects of task switching frequency and number of parallel video channels to the encoding frame rate are measured. The research is carried out on a 200 MHz TMS320C6201 fixed point DSP using QCIF (176x144 pixels) video format. Compared to a traditional DSP implementation without an operating system, inclusion of DSP/BIOS reduces total system throughput only by 1 QCIF frames/s. The operating system has 6 KB data memory overhead and program memory requirement of 15.7 KB. Hence, the overhead is considered low enough for resource critical mobile video applications.
- Berbers, Y., Rigole, P., Vandewoude, Y., Van Baelen, S.,
CoConES: An Approach for Components and Contracts in Embedded Systems,
chapter of Component-based Software Development of Embedded Systems; An Overview of Current Research Trends, C. Atkinson, C. Bunse, H.-G. Gross, and C. Peper (eds.), Lecture Notes in Computer Science (LNCS), Vol. 3778, Springer, Berlin, Germany, 2005
(PDF format)
Abstract
This paper presents CoConES (Components and Contracts for Embedded Software), a methodology for the development of embedded software, supported by a tool chain. The methodology is based on the composition of reusable components with the addition of a contract principle for modeling non-functional constraints. Non-functional con-straints are an important aspect of embedded systems, and need to be modeled explicitly. The tool chain contains CCOM, a tool used for the design phase of software development, coupled to Draco, a middleware layer that supports the methodology at run-time.
- J.-P. Soininen,
Multicore architecture based product platforms,
The annual Finnish ELMO technology programme seminar (ELMO2005)
(PDF format)
Abstract
Why to design and use multiprocessor platforms; What are multiprocessor platforms; Basic idea of platform- based design; Most important platform- based design challenges; Conclusions

