Data Scientist

Abzooba is a data analytics, big data and cloud solutions company is building a data science team with a mission to discover insightful information hidden in vast amounts of data and to help us make smarter predictive and prescriptive analytical decisions relevant to the business problems at hand.

The team’s primary focus will ange from understanding the business problems, doing statistical analysis to experimenting with latest Machine Learning modelling techniques (including classical data mining and the latest deep learning) with an aim to build high quality prediction systems integrated with our products and solutions.

Examples of the types of tasks that this may involve are :
developing data pipeline stages like cleaning, validation, wrangling etc.

for a variety of data types like text, images and categorical or custom design and develop metrics/scoring pipelines using machine learning techniques develop feature extraction pipelines from raw data stored in a variety of formats design and develop feature representation using a variety of data formats like SQL databases, key-value or object storage, or knowledge graphs for better predictions work with standardized libraries like sklearn, NumPy, pandas to implement models for classification and regression tasks work with tensorflow, keras, pytorch etc.

to implement various custom and pre-built neural network models like RNNs, CNNs develop internal A/B testing and multi-arm bandit or ensembled models and pipelines work in mixed programming/scripting language environments as per application requirements like python, java, C++ work within state-of-the art MLOps/CICD/DevOps platforms based on standardized spark, kubernetes, kafka based batch, streaming/real-time, or transactional distributed architectures used to host the model training, test, and inference pipelines work on text analytics and NLP problems like NLU, NER, Contextual embeddings, Topic Modelling etc.

Responsibilities
Selecting features, building, and optimizing classifiers using machine learning techniques Data mining and experimental analysis using state-of-the-art methods Processing, cleansing, and verifying the integrity of data used for analysis and training/inference Collect/understand business requirements with varying degree of crispness Define and design data science techniques and pipelines that address specific business problems Work with datasets of varying degrees of size and complexity including both structured and unstructured data.

Developing pipelines to process massive data-streams in distributed computing environments such as spark, kubernetes/docker microservices Develop proprietary algorithms to build customized solutions that go beyond standard industry tools and lead to innovative solutions.

Develop sophisticated visualization of analysis output for business users.

Provide control/analytics for all output produced to monitor/ensure established indicators/targets are met both during initial development and on an ongoing basis.

Identify opportunities for continuous improvement of current algorithms, solutions, and methodologies employed Proactively collaborate with business partners to monitor solution health and changing requirements and develop actionable plans to address the same while optimizing for quality, use, cost, time-to-market amongst other variables.

Requirements
Bachelor’s degree in Statistics, Computer Science, Mathematics, Machine Learning, Econometrics, Physics, Biostatistics or related Quantitative disciplines and 3 or more years’ experience in an enterprise data science organization Graduate degree preferred Must have advanced expertise with software such as Python as well as expertise with JSON, SQL, experience using other programming languages like R, Java, C++, and expertise in GraphQL is preferred Must have experience working with enterprise data warehouses, data marts, data bases, data lakes, or other distributed or cloud-based data storage systems Must have experience working in cross-functional teams and ability to communicate results to non-technical audiences.

Must have experience doing exploratory data analysis and visualization using state of the art python based libraries like pandas, numpy, matplotlib, searborn, plotly, streamlit etc.

Must have experience building models/algorithms for training/inference workloads using libraries like sklearn, tensorflow, pytorch Must have deep understanding of and experience working on atleast one of the following NLP problem domains : NER, Topic Modelling, NLU, Q&A, NMT or related Exposure to building Cognitive Search (Information Retrieval) or Recommender Systems (Information Filtering) is preferred Familiarity with synchronous/event-based system/data/orchestration architectures for batch, streaming/real-time and/or transactional workloads that employ one or more of the following technologies
– Message Queues, Kafka, RESTful microservices, spark, kubernetes/docker Experience with cloud platform and SAAS environments & tools like Azure, AWS, GCP preferred Familiarity with CICD/DevOps tools such as Bitbucket, Bamboo, Jira, Confluence required Experience doing test driven development, using standard logging, and debugging techniques is required Work experience in Agile (Scrum) development teams required

Related Post