Why MQTT Projects Fail After Successful FAT Procedures


MQTT-project-challenges-after-FAT, MQTT-commissioning-issues

3rd June, 2026.

In this post, we will see the concept of failure of MQTT projects even after successful FAT procedures.

MQTT has become a preferred communication protocol for Industrial IoT and modern automation systems due to its simplicity, efficiency, and scalability. During Factory Acceptance Testing (FAT), MQTT-based solutions often demonstrate excellent performance, leading stakeholders to believe the project is ready for deployment. However, many of these projects encounter unexpected challenges once commissioned on-site. Real-world networks, cybersecurity policies, infrastructure limitations, and operational demands can expose weaknesses that were never evident during testing. Understanding why MQTT projects fail after a successful FAT can help engineers design more robust, reliable, and maintainable systems from the outset.



1. The site network is very different from the FAT network:


MQTT is a communication protocol commonly used to transfer data between PLCs, SCADA systems, historians, and cloud applications over a network. In an MQTT architecture, devices publish data to a central MQTT broker, while other devices subscribe to the data they need. The broker acts as an intermediary, receiving messages from publishers and forwarding them to the appropriate subscribers.


During FAT, the MQTT broker, PLC, SCADA, and test clients are usually connected on the same local network switch in a controlled environment. Communication is fast and stable, so messages are exchanged without issues.


After commissioning, the same system may have to communicate through multiple network switches, firewalls, VLANs, or internet connections. Network interruptions, bandwidth limitations, or IT security restrictions can affect communication. If these real-world conditions were not tested during FAT, engineers may experience missing data, delayed updates, or frequent MQTT disconnections after the system goes live.


2. The actual data volume is much higher than during FAT:


During FAT, MQTT communication is often tested with a limited number of PLCs, devices, and tags. The broker handles the traffic comfortably, data updates are received on time, and the system appears to perform well.


Once the plant is commissioned, the situation can be very different. Additional PLCs, remote stations, historians, dashboards, and cloud applications may start publishing and subscribing to data. Thousands of tags can be exchanged simultaneously, significantly increasing network and broker traffic.


If the MQTT architecture was not designed and tested for the actual message volume, the system may experience delayed updates, increased resource usage, or communication bottlenecks. As a result, a solution that worked flawlessly during FAT may struggle when exposed to real plant-scale operation.



3. Temporary communication loss can lead to permanent data loss:

During FAT, MQTT communication is usually tested under normal operating conditions where all systems remain connected. Engineers verify that data is being transmitted successfully and often assume the solution is ready for deployment. In a live plant, however, temporary communication interruptions are inevitable. Internet links may fail, cloud services may become unavailable, or network equipment may be taken offline for maintenance. While communication is unavailable, the process itself continues to operate, generating valuable operational data.

A well-designed MQTT solution should include a store-and-forward mechanism. This allows data to be temporarily stored locally when communication is lost and automatically forwarded to the destination once connectivity is restored. Without this capability, any data generated during the outage may be permanently lost.

For example, consider a remote pumping station publishing flow, level, and runtime data to a cloud platform. If the internet connection is unavailable for two hours and no store-and-forward mechanism exists, all data generated during that period will be lost. Historical trends may contain gaps, production reports may be incomplete, and important operational events may never be recorded.

Since communication outages are rarely simulated during FAT, this weakness often remains hidden until the system is operating in the real world. As a result, a project that appears successful during testing may later suffer from missing historical data and unreliable reporting.

4. Poor topic design becomes a maintenance challenge:


During FAT, MQTT communication is typically tested with a small number of devices and topics. Even a simple topic structure can appear organized and easy to manage.


After commissioning, however, the system may expand to include multiple PLCs, process areas, and thousands of data points. If topic names were not planned properly from the beginning, engineers can find it difficult to identify data sources, troubleshoot issues, or integrate new systems. A topic structure that seemed adequate during FAT can therefore become a significant maintenance and scalability challenge as the MQTT deployment grows.



Bad example:

Status

Flow

Level

Pump1

Pump2


Good example:

Plant1/InletWorks/Pump01/Status

Plant1/InletWorks/Pump01/Runtime

Plant1/AerationArea/Blower01/Status


5. Cybersecurity requirements are introduced after FAT:

During FAT, the primary objective is usually to demonstrate that MQTT communication is working correctly. Engineers focus on ensuring that data can be exchanged successfully between PLCs, SCADA systems, MQTT brokers, and cloud applications. To simplify testing and troubleshooting, communication is often configured with minimal security restrictions.

Once the system is deployed in the plant, however, the reality can be very different. Most organizations have cybersecurity policies that govern how data can be exchanged between operational and business networks. As a result, the MQTT solution may be subjected to additional requirements such as encrypted communication, user authentication, certificate management, firewall approvals, network segmentation, and strict access controls.

For example, an MQTT application may communicate flawlessly during FAT using a simple username and password. However, before commissioning, the client's IT department may require TLS encryption, certificate-based authentication, and communication through specific approved ports. Implementing these requirements can involve significant configuration changes and additional testing.

In many cases, the MQTT architecture itself is sound, but the project experiences delays because cybersecurity requirements were not identified and addressed early enough. By involving cybersecurity and IT stakeholders during the design phase rather than after FAT, many of these challenges can be avoided.

6. MQTT communication is tested, but data quality is not:

During FAT, engineers typically focus on verifying that data is successfully transmitted from the source to the destination. If the PLC value appears correctly in SCADA, a dashboard, or a cloud application, the communication test is often considered successful.

However, successful communication does not necessarily mean that the data is useful or reliable. Once the system is commissioned, operators and plant personnel begin using the data for monitoring, reporting, alarming, and decision-making. This is when data quality issues often become apparent. For example, a flow transmitter may publish values every 60 seconds when the application requires updates every 5 seconds. Different systems may use different engineering units, such as m³/h in the PLC and L/s in the cloud application. Timestamps may be missing or generated by different devices, making historical analysis difficult. In some cases, old values may continue to be displayed even though communication has been lost, leading operators to believe the data is still current.

During FAT, the focus is often on answering the question, "Is data being transmitted?" In actual plant operation, the more important question becomes, "Can this data be trusted?" If data quality checks are not included during testing, users may lose confidence in the MQTT solution even though the communication itself is functioning correctly.

7. FAT focuses on normal operation, not plant disturbance:

During FAT, MQTT communication is typically tested under ideal operating conditions. Engineers verify that process values, equipment statuses, and commands are successfully exchanged between publishers and subscribers. Since all devices are functioning normally, the MQTT system appears stable and reliable.

The real challenge begins after commissioning when the plant starts experiencing actual operational events. Pumps trip unexpectedly, instruments fail, power interruptions occur, communication links fluctuate, and multiple alarms can be generated within a short period. These situations create sudden bursts of MQTT traffic that may never have been simulated during FAT.

For example, an automation process may normally publish a few status and process values every few seconds. However, if a power failure occurs, dozens of alarms, fault statuses, and communication events may be generated simultaneously. Similarly, when power is restored, many devices may reconnect and begin publishing data at the same time. If the MQTT architecture has only been tested under normal conditions, engineers may discover delayed alarm notifications, missed event records, or performance issues during plant disturbances. As a result, a system that appeared flawless during FAT may struggle when faced with the abnormal conditions that occur in day-to-day plant operation.

8. Client reconnection scenarios are rarely tested:

During FAT, MQTT communication is usually tested under stable conditions where the broker, network, and connected devices remain online throughout the test. As long as data is being exchanged successfully, the communication is often considered validated.

In real plant operation, however, temporary communication interruptions are inevitable. A network switch may reboot, an internet connection may drop, a PLC may be restarted after maintenance, or the MQTT broker may become temporarily unavailable. When such events occur, MQTT clients must automatically reconnect to the broker and resume normal communication without operator intervention. Unfortunately, these reconnection scenarios are often not tested thoroughly during FAT. As a result, engineers may discover after commissioning that some devices fail to reconnect, stop publishing data, lose subscriptions, or require manual intervention to restore communication.

A robust MQTT implementation should be capable of automatically reconnecting to the broker, re-establishing subscriptions, recovering retained data where applicable, and resuming data publishing once communication is restored. These capabilities should be verified during FAT by intentionally simulating network failures and recovery scenarios.

I have covered the general theory on failure of MQTT projects even after successful FAT procedures. I have also not attempted to cover all the topics related to it, as it can vary from case to case. Once you are familiar with this type of technology, you can easily troubleshoot any issues related to it.

Thank you for reading the post. I hope you liked it and will find a new way in this type of technology.




Written by Viral Nagda, Industrial Automation Engineer with 12+ years of experience…



Comments