Subrangshu Das

Subrangshu Das

Bengaluru, Karnataka, India
4K followers 500+ connections

About

Currently I am working as a Director, Silicon at Google.

Prior to this, as…

Activity

Join now to see all activity

Experience

  • Google Graphic

    Google

    Bengaluru, Karnataka, India

  • -

    Bangalore

  • -

    Bengaluru Area, India

  • -

    Bengaluru Area, India

  • -

    Bengaluru Area, India

  • -

    Bengaluru Area, India

  • -

    Bengaluru Area, India

  • -

    India

  • -

  • -

    India

  • -

    Bengaluru Area, India

  • -

  • -

    Bangalore

  • -

    Bangalore

  • -

  • -

Education

Licenses & Certifications

Publications

  • A 28nm Programmable and Low Power Ultra-HD Video Codec Engine

    IEEE International Symposium on Circuits and Systems (ISCAS)

    Other authors
  • A Monolithic Programmable Ultra-HD Video Codec Engine

    IEEE International Conference on Acoustics, Speech and Signal Processing

    Other authors
  • Cross Bar Message Network Protocol Bridge Verification using Formal Methodology

    Design Automation Conference (DAC)

    Other authors
  • Verification of Hardware Scheduler in a Multimedia accelerator through Formal Methodology

    Design Automation Conference (DAC)

    Other authors
  • A true multistandard, programmable, low-power, full HD video-codec engine for smartphone SoC

    International Solid State Circuits Conference (ISSCC)

    In this paper, we present IVA-HD, a true multistandard, programmable, full HD video coding engine which adopts optimal hardware-software partitioning to achieve the low-power and area requirements of the OMAP 4 processor. Unlike the approach of using separate IPs for encoder and decoder, IVA-HD uses an integrated codec engine which is area efficient, as most of the decoder logic is reused for the encoder. IVA-HD is architected to perform stream-rate and pixel- rate processing in a single…

    In this paper, we present IVA-HD, a true multistandard, programmable, full HD video coding engine which adopts optimal hardware-software partitioning to achieve the low-power and area requirements of the OMAP 4 processor. Unlike the approach of using separate IPs for encoder and decoder, IVA-HD uses an integrated codec engine which is area efficient, as most of the decoder logic is reused for the encoder. IVA-HD is architected to perform stream-rate and pixel- rate processing in a single pipeline (that processes one 16x16 macroblock at a time), so as to support the latency requirements of video conferencing.

    Other authors
    See publication
  • PowerAdviser: An RTL power platform for interactive sequential optimizations

    Proceedings of the Conference on Design, Automation and Test in Europe (DATE)

    Power has become the overriding concern for most modern electronic applications today. To reduce clock power, sequential clock gating is increasingly getting used over and above combinational clock gating. Given the complexity of manually identifying sequential clock gating changes, automatic tools are becoming popular. However, since these tools always work within the scope of the design and the constraints provided, they do not provide any insight into additional power savings that might…

    Power has become the overriding concern for most modern electronic applications today. To reduce clock power, sequential clock gating is increasingly getting used over and above combinational clock gating. Given the complexity of manually identifying sequential clock gating changes, automatic tools are becoming popular. However, since these tools always work within the scope of the design and the constraints provided, they do not provide any insight into additional power savings that might still be possible. In this paper we present an interactive sequential analysis flow, PowerAdviser, which besides performing automatic sequential changes also provides information for additional power savings that the user can realize through manual changes. Using this new flow we have achieved dynamic power reduction upto 45% more than a purely automated flow.

    Other authors
    See publication
  • Reducing Dynamic Power with Gate-level Clock-Gating Optimization

    Design Automation Conference (DAC)

    Reducing power consumption, improving battery life and ultimately reducing the carbon footprint of a device is becoming one of the most important care-about in digital design today. To reduce clock and flip-flop power (which is usually the most energy consuming component in our designs today), gate-level clock-gating techniques such as those found in the Azuro PowerCentric tool, are increasingly getting used over and above traditional RTL clock-gating. This technique is very effective in…

    Reducing power consumption, improving battery life and ultimately reducing the carbon footprint of a device is becoming one of the most important care-about in digital design today. To reduce clock and flip-flop power (which is usually the most energy consuming component in our designs today), gate-level clock-gating techniques such as those found in the Azuro PowerCentric tool, are increasingly getting used over and above traditional RTL clock-gating. This technique is very effective in reducing clock and flip-flop power by a further 20-30% beyond RTL clock-gating. However, unlike RTL clock gating, gate-level clock-gating introduces new logic paths in the design, which are redundant and added at the end of the datapath. A good scheme needs to be put in place to prevent this added logic from creating functional, testability, timing closure and gate-level simulation/equivalence verification issues. In this paper, we describe a robust implementation methodology that was used to successfully implement gate-level clock-gating using Azuro PowerCentric tool on 45nm low-power multi-media IP and reduce clock/flip-flop power by 22%.

    Other authors
  • RTL Power Optimization in Sequential Analysis Platforms

    Design Automation Conference (DAC)

    In this work we present an automated approach for RTL power optimization using sequential analysis. The approach analyzes pipelined datapaths in both forward and backward directions of clock cycles; based on which it derives conditions for which sequential stages can be clock gated. This approach
    becomes all the more critical, when the source RTL is machine generated (also called Electronic System Level; viz. ESL) and manual analysis of RTL is not possible. This approach has been deployed…

    In this work we present an automated approach for RTL power optimization using sequential analysis. The approach analyzes pipelined datapaths in both forward and backward directions of clock cycles; based on which it derives conditions for which sequential stages can be clock gated. This approach
    becomes all the more critical, when the source RTL is machine generated (also called Electronic System Level; viz. ESL) and manual analysis of RTL is not possible. This approach has been deployed for various datapath oriented designs in 45nm technology node. Using this technique, we have achieved power reduction in the order of 15% on top of low power synthesis solutions. We also report power optimization, obtained through an interactive mode, where opportunities were computed by sequential analysis, outside the aegis of automatic RTL modification. We also showcase an approach which combines solutions in the domain of Power Optimization, Estimation, and Logic
    Simulation to arrive at an integrated methodology for in-the loop power optimization.

    Other authors
  • DFT Challenges in Next Generation Multi-media IP

    Asian Test Symposium (ATS)

    Multi-media based applications have increased immensely in the last few years. The need to have better video quality, higher recording and playback time, more video channels and faster time to market (TTM) requires DFT solutions that use core-based testing to allow concurrent IP and SOC development, scalable to support multiple technologies and eases the development of timing constraints. This paper describes the challenges and solutions used to address them.

    Other authors
    See publication
  • The Automatic Generation of Merged-Mode Design Constraints

    Design Automation Conference (DAC)

    Multi-mode timing closure is a latent design issue that critically impacts the performance and schedule of our designs today. Even though P&R tools today support concurrent optimization of the design across multiple timing modes, our experience suggests that these solutions start to choke beyond 2-3 modes. One way to solve this issue has been to manually develop “merged-mode” constraints, which effectively capture the timing requirements across multiple different operating modes of the design…

    Multi-mode timing closure is a latent design issue that critically impacts the performance and schedule of our designs today. Even though P&R tools today support concurrent optimization of the design across multiple timing modes, our experience suggests that these solutions start to choke beyond 2-3 modes. One way to solve this issue has been to manually develop “merged-mode” constraints, which effectively capture the timing requirements across multiple different operating modes of the design. But without a way to evaluate the completeness and accuracy of the “merged-mode” timing constraints, it often becomes necessary to fix the constraints late in the design flow – causing undesirable slip in design schedules. In order to circumvent this issue and generate constraints that are correct-by-construction, Company A and Company B have worked together to develop an automated technique using Company B’s tool (Product A) to generate “merged-mode” constraints. The input to this flow is a mode-table spreadsheet, which captures the complete list of operating modes supported by the design and the configuration settings required to put the design into the corresponding operating mode. This approach also helped reduce the cycle-time of developing high-quality merged-mode constraints by 2-3X from manually merged-constraints.

    Other authors
Join now to see all publications

Patents

  • Dynamic frame padding in a video hardware engine

    Issued US 10547859

    A video hardware engine which support dynamic frame padding is disclosed. The video hardware engine includes an external memory. The external memory stores a reference frame. The reference frame includes a plurality of reference pixels. A motion estimation (ME) engine receives a current LCU (largest coding unit), and defines a search area around the current LCU for motion estimation. The ME engine receives a set of reference pixels corresponding to the current LCU. The set of reference pixels…

    A video hardware engine which support dynamic frame padding is disclosed. The video hardware engine includes an external memory. The external memory stores a reference frame. The reference frame includes a plurality of reference pixels. A motion estimation (ME) engine receives a current LCU (largest coding unit), and defines a search area around the current LCU for motion estimation. The ME engine receives a set of reference pixels corresponding to the current LCU. The set of reference pixels of the plurality of reference pixels are received from the external memory. The ME engine pads a set of duplicate pixels along an edge of the reference frame when a part area of the search area is outside the reference frame.

    See patent
  • Low power ultra-HD video hardware engine

    Issued US 9973754

    A low power video hardware engine is disclosed. The video hardware engine includes a video hardware accelerator unit. A shared memory is coupled to the video hardware accelerator unit, and a scrambler is coupled to the shared memory. A vDMA (video direct memory access) engine is coupled to the scrambler, and an external memory is coupled to the vDMA engine. The scrambler receives an LCU (largest coding unit) from the vDMA engine. The LCU comprises N.times.N pixels, and the scrambler scrambles…

    A low power video hardware engine is disclosed. The video hardware engine includes a video hardware accelerator unit. A shared memory is coupled to the video hardware accelerator unit, and a scrambler is coupled to the shared memory. A vDMA (video direct memory access) engine is coupled to the scrambler, and an external memory is coupled to the vDMA engine. The scrambler receives an LCU (largest coding unit) from the vDMA engine. The LCU comprises N.times.N pixels, and the scrambler scrambles N.times.N pixels in the LCU to generate a plurality of blocks with M.times.M pixels. N and M are integers and M is less than N.

    Other inventors
    See patent
  • System and method for managing cache

    Issued US US9430393

    A system includes first and second processing components, a qualified based splitter component, a first and second configurable cache element and an arbiter component. The first data processing component generates a first request for a first portion of data at a first location within a memory. The second data processing component generates a second request for a second portion of data at a second location within the memory. The qualifier based splitter component routes the first request and the…

    A system includes first and second processing components, a qualified based splitter component, a first and second configurable cache element and an arbiter component. The first data processing component generates a first request for a first portion of data at a first location within a memory. The second data processing component generates a second request for a second portion of data at a second location within the memory. The qualifier based splitter component routes the first request and the second request based on a qualifier. The first configurable cache element enables or disables prefetching data within a first region of the memory. The second configurable cache element enables or disables prefetching data within a second region of the memory. The arbiter component routes the first request and the second request to the memory.

    Other inventors
    See patent
  • Method to hide or reduce access latency of a slow peripheral in a pipelined direct memory access system

    Issued US 7673091

    A bus bridge between a high speed DMA bus and a lower speed peripheral bus sets a threshold for minimum available buffer space to send a read request dependent upon a frequency ratio and the DMA read latency. Similarly, a threshold for minimum available data for a write request depends on the frequency ratio and the DMA write latency. The bus bridge can store programmable values for the DMA read latency and write latency.

    Other inventors
    • Ashutosh Tiwari
  • Software power control of circuit modules in a shared and distributed DMA system

    Issued US 7321980

    A system-on-chip integrated circuit selectively gates clocks to individual modules corresponding to the state of a corresponding bit of a peripheral enable register. A reset circuit supplies a signal to a reset input of the digital module for a normal mode if the bit indicates the power-up state and a reset mode if the bit indicates a power-down state. Return to normal mode is delayed a predetermined time after the said bit of indicates the power-up state to ensure clean power up. A false…

    A system-on-chip integrated circuit selectively gates clocks to individual modules corresponding to the state of a corresponding bit of a peripheral enable register. A reset circuit supplies a signal to a reset input of the digital module for a normal mode if the bit indicates the power-up state and a reset mode if the bit indicates a power-down state. Return to normal mode is delayed a predetermined time after the said bit of indicates the power-up state to ensure clean power up. A false acknowledge circuit for each module supplies an acknowledge signal in response to a received command if the corresponding bit indicates the power-down state.

    Other inventors
  • Software controlled hard reset of mastering IPs

    Issued US 7315905

    A system-on-chip integrated circuit includes a peripheral initialization register has a bit corresponding to each module. Each bit indicates a normal mode or a reset mode for the corresponding module. A direct memory access unit can receive, prioritize and queue date movement transactions between modules and can read from or write to the peripheral initialization register. A peripheral interface unit prevents a write to the peripheral initialization register changing a module from reset mode to…

    A system-on-chip integrated circuit includes a peripheral initialization register has a bit corresponding to each module. Each bit indicates a normal mode or a reset mode for the corresponding module. A direct memory access unit can receive, prioritize and queue date movement transactions between modules and can read from or write to the peripheral initialization register. A peripheral interface unit prevents a write to the peripheral initialization register changing a module from reset mode to normal mode while there is an uncompleted data movement transaction involving that module. A false acknowledge circuit for each module supplies an acknowledge signal in response to a received command if the module is in reset mode.

    Other inventors
  • Method for automating validation of integrated circuit test logic

    Issued US 6553524

    A methodology for automatic validation of integrated circuit (IC) test hardware that is performed during extraction of the test hardware. Signal connectivity between output test ports of one or more test control blocks and serially-connected scan latches of the test hardware is automatically validated, as is inter-connectivity between the serially-connected scan latches. Every instance to which a test signal and a test data signal at an output test port (both test signal and test data ports) of…

    A methodology for automatic validation of integrated circuit (IC) test hardware that is performed during extraction of the test hardware. Signal connectivity between output test ports of one or more test control blocks and serially-connected scan latches of the test hardware is automatically validated, as is inter-connectivity between the serially-connected scan latches. Every instance to which a test signal and a test data signal at an output test port (both test signal and test data ports) of a test control block fans out to is traversed until a scan latch is reached in order to provide electrical and functional verification of the test hardware.

Recommendations received

More activity by Subrangshu

View Subrangshu’s full profile

  • See who you know in common
  • Get introduced
  • Contact Subrangshu directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Subrangshu Das in India

Add new skills with these courses