* Add splitwise deployment with using rdma * clean cuda
* remove splitwise deployment on single node and refine the code * up * up * up * add test * up